The “What” of Email Archiving

The Box That Done More, by delphwynd, on Flickr

The Box That Done More, by delphwynd, on Flickr

A couple of weeks ago I wrote about the need for applied digital preservation research. The post generated a number of great comments and I’ll take some time over the next few months to dig a little deeper into each subject area and try and tease out where the useful efforts are, while also identifying further gaps that might be plugged by further research.

This time I’ll dive a little deeper into email archiving. Perhaps it’s best to start with the first question from last time: What are the main challenges of email archiving?

At the highest level, Chris Prom, in his DPC Technology Watch Report on Preserving Email (pdf)  identifies two areas: perceived technological barriers;  and legal mandates that prioritize minimum legal retention periods favoring record destruction over long-term access.

While the legal and policy questions are undoubtedly important, I’m going to leave them for another day and focus for now on the technical issues. And before we can address the “how” of email preservation we need to know “what” an email message is.

A strand of research addressing the “what” for any particular format comes under the rubric of “significant properties.” I won’t go into it any more deeply, but a particularly worthy introduction to this branch of research is “Significant properties of digital objects: definitions, applications, implications” (pdf) by Margaret Hedstrom and Cal Lee from 2002.  (This path of exploration also leads to tools like JHOVE,  JHOVE2 and DROID.)

So how do we know exactly what we need to capture when we’re talking about preserving an email message? Generally speaking, there is common conformance to the “Internet Message Format” syntax (RFC 2822) across mail systems, but there’s been little to no standardization on email storage formats within clients (look at Prom’s chapter on “IETF Standards” for a much richer discussion of these issues).

"Digital preservation buttons" by user wlef70 on Flickr

“Digital preservation buttons” by user wlef70 on Flickr

Gareth Knight looked at email messages through the veneer of “significant properties” research in the 2009 “Significant Properties Testing Report: Electronic Mail,” part of the InSPECT series of testing reports on different kinds of electronic content (the folks at Archivematica have helpfully summarized the information in Knight’s report to a concise table).

Email preservation is certainly context-based: a preservationista needs to understand the email client(s) used in the organization and hone in on the format native to each one. There are a lot! One web site lists approximately 60 email-related file extensions, but Knight narrows it down to five prominent “representation formats” (Microsoft Outlook Message, Microsoft Outlook Personal Folder, mbox, Maildir and the Email Account XML schema), while offering more detail:

Representation formats are interpreted by the type of information that they contain, as opposed to any characteristic of the format specification itself. An email may be stored in any format that allows the storage of text-based information, as text (ASCII, Unicode) and binary encoded data (Microsoft Personal Folders). Variation of each encoding type is identified by the organisational structure and mark-up contained. For example, mail may be stored individually using maildir or EML, or as a combination of one or more emails in a single file using mboxrd, mboxcl, or other variations.

Prom is even more parsimonious, winnowing his list of formats to mbox and EML, with a caveat:

In the case of many proprietary clients, messages cannot be exported from their native system directly into MBOX or EML. Instead, these clients may export the message to a proprietary, though perhaps open, format. The most common of these formats are .pst (Outlook), and .nsf (Lotus). Tools … can then convert these files to MBOX or EML.

Still, Prom notes that “in general, if an institution can get email into one of the MBOX or EML formats, it has taken a very big step on the road toward preserving email.”

Success!

In a future post I’ll take a look at some of the technical approaches being explored to do the “how” of preserving email. Here are some of the most prominent approaches:

  • Migrate email to a new version of the software or an open standard
  • Wrap email in XML formats
  • Emulate the email environment
  • Retain the messages within the existing e-mail system

The Open Planets Foundation has a wiki tracking email-related technical projects, some of which we’ll highlight in future posts. Leads on projects that have explored any of these approaches is greatly appreciated!

2 Comments

  1. David P. Hayes
    July 3, 2013 at 7:45 am

    The best way to assure that archival email can be read and searched in the future is to save folders of it in a format that matches character-by-character the form of the email at the time it came from the server to the user’s email client. Inasmuch as the format on the server by its purpose has to be readable by all email programs (since most senders dispatch their emails without knowledge of what program(s) will be in use at the receiving end), this format has a common denominator which gets diffused after receipts of messages. (I’m assuming here a server which does not alter messages from how they had been sent. Where this is the case, substitute the intermediary server for the final server in interpreting what I outline here.)

    The problem to resolve in getting from the archives on users’ machines to universally-interpretable archives (that is, converting to the RFC 2822 server-neutral format from the various compressed, altered formats such as .dbx). Fortunately, there are tools within programs that work to undo the proprietary changes rendered by email clients, making them readable by programs which don’t use the proprietary format in question.

    The most obvious of these tools is the one whereby within one program one can forward a message as a self-contained file which then is opened as though a new file containing only what was in the original email as received by the intermediary sender. In Outlook Express, for example, one can right-click on a message and select “Forward as attachment.” The message then is seen to be a line-item next to a paper-clip icon within a new, otherwise-blank message. When sent, the receiver can open this archival email as a message-within-a-message, and have the full archival email, including the sent/received transit data which was embedded within the header and usually can only be read by going into the “Properties” tab. Forwarding-as-attachment(s) is not limited to a single message. One can “forward” hundreds and even thousands of messages within a single new email (limited only by the maximum byte count manageable by the client or the hardware).

    The single new email message created in this way to contain numerous archival messages needn’t be sent. It can instead be saved as a single file. In Outlook Express, one simply selects the “File” menu and then from it “Save as …” (avoiding “Save”–which would create an entry in the client program’s Draft folder). Though Outlook Express will by default create the compendium message file with extension .eml, the message itself will be in program-neutral, unencrypted character set. This is easily verified by changing the extension of the resultant file from .eml to .txt — the message will open in a plain-text reader (such as NotePad).

    To an untrained eye, the massive text file created this way looks like the message format as originally sent. (If the email client in this case had altered the message, such as by introducing line breaks, or changing special characters to the client’s display-appearance equivalents, these would have to be changed back by a newly-devised program, if desired. If the email client in this case altered the message by deleting or neutralizing signs of viruses, these changes might be irreversible.) If this massive file is deemed to serve as an archive the user can do one of two things:

    (1) keep the .eml extension, so that the file can be opened in a client which uses it, then save the separate emails within the .eml file as separate files which can then be retrieved by another program.

    (2) use as a .txt file (whether with that extension or another), and then at any later time use a simple-to-create macro to subdivide the contents into separate emails, which can be retrieved by or inserted into another program. There is a recurring pattern in the massive file making it easy to determine where one message ends and the next begins, so slicing the file into its component emails will entail a low-skilled programmed merely recognizing this pattern and then writing the modest amount of code necessary to partition the separate messages within the compendium file.

  2. Butch Lazorchak
    July 3, 2013 at 12:20 pm

    David,

    Thanks for the incredibly thorough comment! We’ll explore the “how” in more detail in a future post. Your approach is certainly one to consider.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.