A couple of weeks ago I wrote about the need for applied digital preservation research. The post generated a number of great comments and I’ll take some time over the next few months to dig a little deeper into each subject area and try and tease out where the useful efforts are, while also identifying further gaps that might be plugged by further research.
This time I’ll dive a little deeper into email archiving. Perhaps it’s best to start with the first question from last time: What are the main challenges of email archiving?
At the highest level, Chris Prom, in his DPC Technology Watch Report on Preserving Email (pdf) identifies two areas: perceived technological barriers; and legal mandates that prioritize minimum legal retention periods favoring record destruction over long-term access.
While the legal and policy questions are undoubtedly important, I’m going to leave them for another day and focus for now on the technical issues. And before we can address the “how” of email preservation we need to know “what” an email message is.
A strand of research addressing the “what” for any particular format comes under the rubric of “significant properties.” I won’t go into it any more deeply, but a particularly worthy introduction to this branch of research is “Significant properties of digital objects: definitions, applications, implications” (pdf) by Margaret Hedstrom and Cal Lee from 2002. (This path of exploration also leads to tools like JHOVE, JHOVE2 and DROID.)
So how do we know exactly what we need to capture when we’re talking about preserving an email message? Generally speaking, there is common conformance to the “Internet Message Format” syntax (RFC 2822) across mail systems, but there’s been little to no standardization on email storage formats within clients (look at Prom’s chapter on “IETF Standards” for a much richer discussion of these issues).
Gareth Knight looked at email messages through the veneer of “significant properties” research in the 2009 “Significant Properties Testing Report: Electronic Mail,” part of the InSPECT series of testing reports on different kinds of electronic content (the folks at Archivematica have helpfully summarized the information in Knight’s report to a concise table).
Email preservation is certainly context-based: a preservationista needs to understand the email client(s) used in the organization and hone in on the format native to each one. There are a lot! One web site lists approximately 60 email-related file extensions, but Knight narrows it down to five prominent “representation formats” (Microsoft Outlook Message, Microsoft Outlook Personal Folder, mbox, Maildir and the Email Account XML schema), while offering more detail:
Representation formats are interpreted by the type of information that they contain, as opposed to any characteristic of the format specification itself. An email may be stored in any format that allows the storage of text-based information, as text (ASCII, Unicode) and binary encoded data (Microsoft Personal Folders). Variation of each encoding type is identified by the organisational structure and mark-up contained. For example, mail may be stored individually using maildir or EML, or as a combination of one or more emails in a single file using mboxrd, mboxcl, or other variations.
Prom is even more parsimonious, winnowing his list of formats to mbox and EML, with a caveat:
In the case of many proprietary clients, messages cannot be exported from their native system directly into MBOX or EML. Instead, these clients may export the message to a proprietary, though perhaps open, format. The most common of these formats are .pst (Outlook), and .nsf (Lotus). Tools … can then convert these files to MBOX or EML.
Still, Prom notes that “in general, if an institution can get email into one of the MBOX or EML formats, it has taken a very big step on the road toward preserving email.”
Success!
In a future post I’ll take a look at some of the technical approaches being explored to do the “how” of preserving email. Here are some of the most prominent approaches:
- Migrate email to a new version of the software or an open standard
- Wrap email in XML formats
- Emulate the email environment
- Retain the messages within the existing e-mail system
The Open Planets Foundation has a wiki tracking email-related technical projects, some of which we’ll highlight in future posts. Leads on projects that have explored any of these approaches is greatly appreciated!
Comments (2)
The best way to assure that archival email can be read and searched in the future is to save folders of it in a format that matches character-by-character the form of the email at the time it came from the server to the user’s email client. Inasmuch as the format on the server by its purpose has to be readable by all email programs (since most senders dispatch their emails without knowledge of what program(s) will be in use at the receiving end), this format has a common denominator which gets diffused after receipts of messages. (I’m assuming here a server which does not alter messages from how they had been sent. Where this is the case, substitute the intermediary server for the final server in interpreting what I outline here.)
The problem to resolve in getting from the archives on users’ machines to universally-interpretable archives (that is, converting to the RFC 2822 server-neutral format from the various compressed, altered formats such as .dbx). Fortunately, there are tools within programs that work to undo the proprietary changes rendered by email clients, making them readable by programs which don’t use the proprietary format in question.
The most obvious of these tools is the one whereby within one program one can forward a message as a self-contained file which then is opened as though a new file containing only what was in the original email as received by the intermediary sender. In Outlook Express, for example, one can right-click on a message and select “Forward as attachment.” The message then is seen to be a line-item next to a paper-clip icon within a new, otherwise-blank message. When sent, the receiver can open this archival email as a message-within-a-message, and have the full archival email, including the sent/received transit data which was embedded within the header and usually can only be read by going into the “Properties” tab. Forwarding-as-attachment(s) is not limited to a single message. One can “forward” hundreds and even thousands of messages within a single new email (limited only by the maximum byte count manageable by the client or the hardware).
The single new email message created in this way to contain numerous archival messages needn’t be sent. It can instead be saved as a single file. In Outlook Express, one simply selects the “File” menu and then from it “Save as …” (avoiding “Save”–which would create an entry in the client program’s Draft folder). Though Outlook Express will by default create the compendium message file with extension .eml, the message itself will be in program-neutral, unencrypted character set. This is easily verified by changing the extension of the resultant file from .eml to .txt — the message will open in a plain-text reader (such as NotePad).
To an untrained eye, the massive text file created this way looks like the message format as originally sent. (If the email client in this case had altered the message, such as by introducing line breaks, or changing special characters to the client’s display-appearance equivalents, these would have to be changed back by a newly-devised program, if desired. If the email client in this case altered the message by deleting or neutralizing signs of viruses, these changes might be irreversible.) If this massive file is deemed to serve as an archive the user can do one of two things:
(1) keep the .eml extension, so that the file can be opened in a client which uses it, then save the separate emails within the .eml file as separate files which can then be retrieved by another program.
(2) use as a .txt file (whether with that extension or another), and then at any later time use a simple-to-create macro to subdivide the contents into separate emails, which can be retrieved by or inserted into another program. There is a recurring pattern in the massive file making it easy to determine where one message ends and the next begins, so slicing the file into its component emails will entail a low-skilled programmed merely recognizing this pattern and then writing the modest amount of code necessary to partition the separate messages within the compendium file.
David,
Thanks for the incredibly thorough comment! We’ll explore the “how” in more detail in a future post. Your approach is certainly one to consider.