The “How” of Email Archiving: More Launching Points for Applied Research

Email?

Email? from user tamaleaver on Flickr.

In early July I wrote about the “what” of email archiving. That is, “what” are we trying to preserve when we say we’re “preserving email.” It was admittedly a cursory look at the issue, but hopefully it’s a start for more thorough discussions down the road.

This time I’ll dig in a little deeper and highlight some of the “how” of email archiving: projects and approaches that are attempting to practically address email archiving issues.

What solution you choose depends, in the first instance, on whether you’re an individual or an institution. NDIIPP offers some high-level guidance for email archiving tailored to individuals (and smaller organizations) as part of our personal archiving tips, but this represents only one possible approach to an email archiving methodology. There are solutions available to individuals (including free ones), though some require more active management and resource allocation (that is, $$$) than others.

The Mobisocial lab at Stanford University has an interesting tool that runs on an individual’s computer called Muse. While not a preservation solution, exactly, Muse enables users to access and browse their personal email archives in a variety of creative ways.

Tools like Muse make it easier for end-users to access large collections of email without the collections being subject to significant upfront organizing, sorting or appraisal. Muse (and tools like it) enable a “bypass” approach that may be heretical to advocates of traditional appraisal, but its simplicity, ease-of-use and effectiveness make it valuable to individuals and small organizations that have pulled their email out of an email system but want to continue to access to the files.

[A discussion between differing archival approaches (let’s call them “heavy appraisal” vs. “save everything” just to be reductive) may be too incendiary to get into at this point, but an illuminating take on the subject can be found in a 2011 blog post from the New York Digital Archivists Working Group.]

Recall the four main technical preservation strategies for email from the last post:

  • Migrate email to a new version of the software or an open standard
  • Wrap email in XML formats
  • Emulate the email environment
  • Retain the messages within the existing e-mail system

Muse falls largely under the first strategy. In a tipsheet they note that the tool can access a variety of data formats for email, but they prefer that the archived data be migrated to the mbox format, if not already in that form. The tool can also fetch email from one or more online email accounts, suggesting another migration process hidden under the hood.

"XML" license plate from user anirvan on Flickr.

“XML” license plate from user anirvan on Flickr.

The Bodleian Library at the University of Oxford in the UK also undertook an email migration effort in 2011 and contributed to the “Preserving Email: Directions and Perspectives” conference that year.

The Collaborative Electronic Records Project is one of the most significant to explore leveraging XML wrappers in email preservation (though XML conversion was not their only preservation approach). The project worked with the North Carolina Department of Cultural Resources EMCAP project to develop a parser that converts e-mail messages, associated metadata and attachments from mbox into a single preservation XML file that includes the e-mail account’s organizational structure. They also published an XML Schema for a Single E-Mail Account.

The NDIIPP-supported Persistent Digital Archives and Library System project also released an open source software tool that extracts email, attachments and other objects from Microsoft Outlook Personal Folders (.pst) files, converting the messages into XML.

Why XML? As the Library of Congress Sustainability of Digital Formats page notes, XML satisfies most, if not all, of the listed sustainability factors, making it highly suitable as a target format for normalization.

As for the first two strategies, Chris Prom pulls them together under what he calls the “whole account approach.” This approach, he says, “reflects the traditional archival model of capturing records at the end of a lifecycle, then taking archival custody over them.”

He contrasts this with the “whole system approach,” which covers the third and fourth strategies above. This approach implements email archiving software to capture an entire email ecosystem, or a portion of that ecosystem, to an external storage environment.

Is Email Dead? from user cambodia4kidsorg on Flickr.

Is Email Dead? from user cambodia4kidsorg on Flickr.

Once captured it may take other tools to provide access. If you’re going to emulate your email environment you may just want to emulate the entire operating system. While not specifically about email, we took a long look at emulation as a service in an interview with Dirk von Suchodoletz of the University of Freiburg back in late 2012.

As for retaining the messages in the existing e-mail system, in some ways this runs counter to traditional archival practice. A 2008 Government Accountability Office report, looking at four federal government agencies, noted that “e-mail messages, including records, were generally being retained in e-mail systems that lacked recordkeeping capabilities, which is contrary to regulation.”

This “strategy” of benign neglect has a lot to do with the recordkeeping challenges posed by email, though efforts like the new “Capstone” approach from the U.S. National Archives are looking to streamline the process.

All of this is to say that there’s plenty of room for applied research in email archiving and preservation and the projects above suggest a variety of potential starting points. Now go to it!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.