O Email! My Email! Our Fearful Trip is Just Beginning: Further Collaborations with Archiving Email

Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain!  but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey.


Email Archiving Stewardship Tools Workshop final panel. Franziska Frey, Christopher Prom, Glynn Edwards, Riccardo Ferrante, and Wendy Gogel. Photo courtesy of Kari Smith.

Email archiving as a distinct research area has been around a while but the discipline is still very much emergent. Stanford University Library, for example, has been working on acquiring and processing email from collections since 2010. ePADD’s Glynn Edwards can trace her initial conversation on developing email archiving software  with Smithsonian Institution Archives’ Ricc Ferrante at the 2012 Society of American Archivists conference in San Diego and she agrees it is very gratifying to see the growth of support and interest, especially over the past year.

The Archiving Email Symposium (videos of the presentations are now available), hosted by the Library of Congress and the National Archives in June 2015, was one of the inspirations for the Email Archiving Stewardship Tools (Harvard EAST) workshop at Harvard Library on March 2-3, 2016. In addition to Harvard and the Library of Congress, participants for the workshop included the Smithsonian Institution Archives, Stanford University Libraries’ ePADD project, MIT Institute Archives and Special Collections, University of Illinois Urbana-Champaign, Artefactual Systems and BitCurator Consortium.


The high-level goals of the two-day workshop, organized by Harvard’s Wendy Gogel and Grainne Reilly, were community building, updating each other on current work, identifying and prioritizing gap areas and exposing the HL community to email-archiving efforts in the field at large. Just bringing the group together ticked off the first goal so we started the day with a mark in the win column.

Glynn Edwards summed up the mood in the room this way: “It was exciting to be part of the working group at Harvard sharing information about our various tools, processes, and needs and to begin conceptualizing a path of data and metadata through different tools contingent on their workflows. There was a lot of energy in the room and a willingness to work together to find ways to re-purpose metadata between tools and collaborate on building shared lexicons to assist with processing and discovery.”


Harvard’s Widener Library. Photograph courtesy of Kate Murray

Edwards also found inspiration in Prom’s statement that “email is one of the richest, one of the most revealing, if not the most revealing, of sources currently being generated.” She goes on to say that “while correspondence has always been an important format in archival collections; email is often more – more immediate, more complex, more exposing. This is highlighted again on an almost weekly basis in breaking news – as the Governor’s emails regarding Flint Michigan water crisis were released or emails and documents referred to as the Panama Papers were leaked.”

My personal interest is in the digital formats used for email messages and other personal information manager or PIM formats including calendaring, text and instant messages. As Prom indicated in the DPC Technology Watch Report Preserving Email (PDF), there’s a convergence in the email archiving community around the MBOX family and EML as de facto preservation formats for email messages primarily because of two related factors: transparency and integration with toolsets.


EML format description from LC’s Sustainability of Digital Formats website

The Library of Congress’s Sustainability of Digital Formats website defines transparency, one of seven sustainability factors, as “the degree to which the digital representation is open to direct analysis with basic tools, including human readability using a text-only editor.”

Native or normalized MBOX and EML files also can be used as access copies because they can be imported into a variety of email clients. It’s no surprise then that these two plain text and very transparent formats, MBOX and EML, are integrated into popular email archiving tools and most modern email clients can import and export one or both of the formats. The Smithsonian Institution Archives’ CERP toolset ingests MBOX-formatted messages before converting to XML, as will the still-in-development DArcMail (Digital Archive Mail System). The ePADD project developed at Stanford University Libraries also requires MBOX for ingest. Harvard University Libraries’ Electronic Archiving System (EAS) ingests EML-formatted messages.


The MBOX format family from the Sustainability of Digital Formats website

Harvard EAST workshop participants discussed some of the issues with these formats, including the lack of format validation tools and the challenges of working with formats, like MBOX, without documented standards.

Reflecting again on Whitman’s poem, email archiving is still a work in progress and our voyage of discovery is nowhere near closed and done. However, projects like the Harvard EAST workshop move us all further along.

Bagger’s Enhancements for Digital Accessions

This is a guest post by John Scancella, Information Technology Specialist with the Library of Congress, and Tibaut Houzanme, Digital Archivist with the Indiana Archives and Records Administration. BagIt is an internationally accepted method of transferring files via digital containers. If you are new to BagIt, please watch our introductory video. Bagger is a digital […]

Intellectual Property Rights Issues for Software Emulation: An Interview with Euan Cochrane, Zach Vowell, and Jessica Meyerson

The following is a guest post by Morgan McKeehan, National Digital Stewardship Resident at Rhizome. She is participating in the NDSR-NYC cohort. I began my National Digital Stewardship Residency at Rhizome — NDSR project description here (PDF) — by leading a workshop for the Emulation as a Service framework (EaaS), at “Party Like it’s 1999: […]

APIs: How Machines Share and Expose Digital Collections

Kim Milai, a retired school teacher, was searching on ancestry.com for information about her great grandfather, Amohamed Milai, when her browser turned up something she had not expected: a page from the Library of Congress’s Chronicling America site displaying a scan of the Harrisburg Telegraph newspaper from March 13, 1919. On that page was a story […]

Acquiring at Digital Scale: Harvesting the StoryCorps.me Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the StoryCorps.me mobile app and interviews Kate Zwaard and […]

Tool Time, or a Discussion on Picking the Right Digital Preservation Tools for Your Program: An NDSR Project Update

The following is a guest post by John Caldwell, National Digital Stewardship Resident at the United States Senate Historical Office. Who remembers Home Improvement? Tim the “Tool Man” Taylor was always trying to show the “Tool Time” audience how to build things, make repairs and of course, demo new tools made by the show’s sponsor, […]

Improving Technical Options for Audiovisual Collections Through the PREFORMA Project

The digital preservation community is a connected and collaborative one. I first heard about the Europe-based PREFORMA project last summer at a Federal Agencies Digitization Guidelines Initiative meeting when we were discussing the Digital File Formats for Videotape Reformatting comparison matrix. My interest was piqued because I heard about their incorporation of FFV1 and Matroska, […]

Cultural Institutions Embrace Crowdsourcing

Many cultural institutions have accelerated the development of their digital collections and data sets by allowing citizen volunteers to help with the millions of crucial tasks that archivists, scientists, librarians, and curators face. One of the ways institutions are addressing these challenges is through crowdsourcing. In this post, I’ll look at a few sample crowdsourcing projects […]

The National Digital Platform for Libraries: An Interview with Trevor Owens and Emily Reynolds from IMLS

I had the chance to ask Trevor Owens and Emily Reynolds at the Institute of Museum and Library Services (IMLS) about the national digital platform priority and current IMLS grant opportunities.  I was interested to hear how these opportunities could support ongoing activities and research in the digital preservation and stewardship communities. Erin: Could you […]

Seeking Comment on Migration Checklist

The NDSA Infrastructure Working Group’s goals are to identify and share emerging practices around the development and maintenance of tools and systems for the curation, preservation, storage, hosting, migration, and similar activities supporting the long term preservation of digital content. One of the ways the IWG strives to achieve their goals is to collaboratively develop […]