Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain! but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey.
Email archiving as a distinct research area has been around a while but the discipline is still very much emergent. Stanford University Library, for example, has been working on acquiring and processing email from collections since 2010. ePADD’s Glynn Edwards can trace her initial conversation on developing email archiving software with Smithsonian Institution Archives’ Ricc Ferrante at the 2012 Society of American Archivists conference in San Diego and she agrees it is very gratifying to see the growth of support and interest, especially over the past year.
The Archiving Email Symposium (videos of the presentations are now available), hosted by the Library of Congress and the National Archives in June 2015, was one of the inspirations for the Email Archiving Stewardship Tools (Harvard EAST) workshop at Harvard Library on March 2-3, 2016. In addition to Harvard and the Library of Congress, participants for the workshop included the Smithsonian Institution Archives, Stanford University Libraries’ ePADD project, MIT Institute Archives and Special Collections, University of Illinois Urbana-Champaign, Artefactual Systems and BitCurator Consortium.
The high-level goals of the two-day workshop, organized by Harvard’s Wendy Gogel and Grainne Reilly, were community building, updating each other on current work, identifying and prioritizing gap areas and exposing the HL community to email-archiving efforts in the field at large. Just bringing the group together ticked off the first goal so we started the day with a mark in the win column.
Glynn Edwards summed up the mood in the room this way: “It was exciting to be part of the working group at Harvard sharing information about our various tools, processes, and needs and to begin conceptualizing a path of data and metadata through different tools contingent on their workflows. There was a lot of energy in the room and a willingness to work together to find ways to re-purpose metadata between tools and collaborate on building shared lexicons to assist with processing and discovery.”
Edwards also found inspiration in Prom’s statement that “email is one of the richest, one of the most revealing, if not the most revealing, of sources currently being generated.” She goes on to say that “while correspondence has always been an important format in archival collections; email is often more – more immediate, more complex, more exposing. This is highlighted again on an almost weekly basis in breaking news – as the Governor’s emails regarding Flint Michigan water crisis were released or emails and documents referred to as the Panama Papers were leaked.”
My personal interest is in the digital formats used for email messages and other personal information manager or PIM formats including calendaring, text and instant messages. As Prom indicated in the DPC Technology Watch Report Preserving Email (PDF), there’s a convergence in the email archiving community around the MBOX family and EML as de facto preservation formats for email messages primarily because of two related factors: transparency and integration with toolsets.
The Library of Congress’s Sustainability of Digital Formats website defines transparency, one of seven sustainability factors, as “the degree to which the digital representation is open to direct analysis with basic tools, including human readability using a text-only editor.”
Native or normalized MBOX and EML files also can be used as access copies because they can be imported into a variety of email clients. It’s no surprise then that these two plain text and very transparent formats, MBOX and EML, are integrated into popular email archiving tools and most modern email clients can import and export one or both of the formats. The Smithsonian Institution Archives’ CERP toolset ingests MBOX-formatted messages before converting to XML, as will the still-in-development DArcMail (Digital Archive Mail System). The ePADD project developed at Stanford University Libraries also requires MBOX for ingest. Harvard University Libraries’ Electronic Archiving System (EAS) ingests EML-formatted messages.
Harvard EAST workshop participants discussed some of the issues with these formats, including the lack of format validation tools and the challenges of working with formats, like MBOX, without documented standards.
Reflecting again on Whitman’s poem, email archiving is still a work in progress and our voyage of discovery is nowhere near closed and done. However, projects like the Harvard EAST workshop move us all further along.