O Email! My Email! Our Fearful Trip is Just Beginning: Further Collaborations with Archiving Email

Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain!  but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey.


Email Archiving Stewardship Tools Workshop final panel. Franziska Frey, Christopher Prom, Glynn Edwards, Riccardo Ferrante, and Wendy Gogel. Photo courtesy of Kari Smith.

Email archiving as a distinct research area has been around a while but the discipline is still very much emergent. Stanford University Library, for example, has been working on acquiring and processing email from collections since 2010. ePADD’s Glynn Edwards can trace her initial conversation on developing email archiving software  with Smithsonian Institution Archives’ Ricc Ferrante at the 2012 Society of American Archivists conference in San Diego and she agrees it is very gratifying to see the growth of support and interest, especially over the past year.

The Archiving Email Symposium (videos of the presentations are now available), hosted by the Library of Congress and the National Archives in June 2015, was one of the inspirations for the Email Archiving Stewardship Tools (Harvard EAST) workshop at Harvard Library on March 2-3, 2016. In addition to Harvard and the Library of Congress, participants for the workshop included the Smithsonian Institution Archives, Stanford University Libraries’ ePADD project, MIT Institute Archives and Special Collections, University of Illinois Urbana-Champaign, Artefactual Systems and BitCurator Consortium.


The high-level goals of the two-day workshop, organized by Harvard’s Wendy Gogel and Grainne Reilly, were community building, updating each other on current work, identifying and prioritizing gap areas and exposing the HL community to email-archiving efforts in the field at large. Just bringing the group together ticked off the first goal so we started the day with a mark in the win column.

Glynn Edwards summed up the mood in the room this way: “It was exciting to be part of the working group at Harvard sharing information about our various tools, processes, and needs and to begin conceptualizing a path of data and metadata through different tools contingent on their workflows. There was a lot of energy in the room and a willingness to work together to find ways to re-purpose metadata between tools and collaborate on building shared lexicons to assist with processing and discovery.”


Harvard’s Widener Library. Photograph courtesy of Kate Murray

Edwards also found inspiration in Prom’s statement that “email is one of the richest, one of the most revealing, if not the most revealing, of sources currently being generated.” She goes on to say that “while correspondence has always been an important format in archival collections; email is often more – more immediate, more complex, more exposing. This is highlighted again on an almost weekly basis in breaking news – as the Governor’s emails regarding Flint Michigan water crisis were released or emails and documents referred to as the Panama Papers were leaked.”

My personal interest is in the digital formats used for email messages and other personal information manager or PIM formats including calendaring, text and instant messages. As Prom indicated in the DPC Technology Watch Report Preserving Email (PDF), there’s a convergence in the email archiving community around the MBOX family and EML as de facto preservation formats for email messages primarily because of two related factors: transparency and integration with toolsets.


EML format description from LC’s Sustainability of Digital Formats website

The Library of Congress’s Sustainability of Digital Formats website defines transparency, one of seven sustainability factors, as “the degree to which the digital representation is open to direct analysis with basic tools, including human readability using a text-only editor.”

Native or normalized MBOX and EML files also can be used as access copies because they can be imported into a variety of email clients. It’s no surprise then that these two plain text and very transparent formats, MBOX and EML, are integrated into popular email archiving tools and most modern email clients can import and export one or both of the formats. The Smithsonian Institution Archives’ CERP toolset ingests MBOX-formatted messages before converting to XML, as will the still-in-development DArcMail (Digital Archive Mail System). The ePADD project developed at Stanford University Libraries also requires MBOX for ingest. Harvard University Libraries’ Electronic Archiving System (EAS) ingests EML-formatted messages.


The MBOX format family from the Sustainability of Digital Formats website

Harvard EAST workshop participants discussed some of the issues with these formats, including the lack of format validation tools and the challenges of working with formats, like MBOX, without documented standards.

Reflecting again on Whitman’s poem, email archiving is still a work in progress and our voyage of discovery is nowhere near closed and done. However, projects like the Harvard EAST workshop move us all further along.

Demystifying Digital Preservation for the Audiovisual Archiving Community

The following is a guest post by Kathryn Gronsbell, Digital Asset Manager, Carnegie Hall; Shira Peltzman, Digital Archivist, UCLA Library; Ashley Blewer, Applications Developer, NYPL; and Rebecca Fraimow, Archivist and AAPB NDSR Program Coordinator, WGBH. The intersection of digital preservation and audiovisual archiving has reached a tipping point. As the media production and use landscape […]

Avoid Jitter! Measuring the Performance of Audio Analog-to-Digital Converters

The following is a guest post by Carl Fleischhauer, a Project Manager in the National Digital Initiatives unit at the Library of Congress. It’s not for everyone, but I enjoy trying to figure out specialized technical terminology, even at a superficial level. For the last month or two, I have been helping assemble a revision […]

Digitizing Motion Picture Film: FADGI Report on Current Practices and Future Directions

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager at the Library of Congress. More often than not, the Federal Agencies Digitization Guidelines Initiative Working Groups (one for still images, one for audio-visual) find themselves walking a line between codifying widely adopted practices and exploring new ideas and new technologies […]

Acquiring at Digital Scale: Harvesting the StoryCorps.me Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the StoryCorps.me mobile app and interviews Kate Zwaard and […]

Access Historic Audio and Video Programs: AAPB Launches Online Reading Room

The following is a guest post by Karen Cariani, AAPB Project Director and Director WGBH Media Library and Archive, Alan Gevinson, AAPB Project Director and Special Assistant to the Packard Campus Chief, and Casey Davis, Project Manager, American Archive of Public Broadcasting, WGBH Educational Foundation. The American Archive of Public Broadcasting (AAPB) Project Team at […]

Announcing the 2015 Innovation Award Winners

On behalf of the National Digital Stewardship Alliance Innovation Working Group, I am excited to announce the 2015 NDSA Innovation Award winners! This year, the annual innovation awards committee reviewed over thirty exceptional nominations from across the country. Awardees were selected based on how their work or their project’s whose goals or outcomes represent an […]

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online. The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also […]

Five Questions for the Smithsonian Institution Archives’ Lynda Schmitz Fuhrig

The following is a guest post from Michael Neubert, a supervisory digital projects specialist at the Library of Congress. In February of this year I wrote a post here about an collaborative effort of representatives of the National Archives and Records Administration (NARA), the Government Publishing Office (GPO), and the Library of Congress to work […]