As archives increasingly process born-digital collections one thing is clear; processing digital collections often involves working with tons of email. There is already some great work exploring how to deal with email, but given that it is such a significant problem area it is great to see work focused on developing tools to make sense of this material. Of particular concern is how email is simultaneously so ubiquitous and so messy. I’ve heard cases of repositories needing to deal with hundreds of millions of email objects in a single collection. Beyond that, in actual practice people use email for just about everything, so email records are often a messy mixture of public, private, personal and professional material.
To this end, the ePADD project at Stanford, with the help of an NHPRC grant, is working to produce an open-source tool that will allow repositories and individuals to interact with email archives before and after they have been transferred to a repository. I was lucky enough to sit in on a presentation from the projects technical advisor, Dr. Sudheendra Hangal, on the status of the project and am thrilled to have this opportunity to discuss work on it with him and his colleagues, Glynn Edwards and Peter Chan as part of our Insights Interview series. Glynn is the Head of Technical Services in the Manuscripts division in Stanford Libraries and the Manager of the Born-Digital Program and Peter is a digital archivist at Stanford Libraries.
Trevor: Could you briefly describe the scope and objective of the ePADD project? Specifically, what problem are you working to solve and how are you going about solving it?
Glynn: The ePADD project grew out of earlier experimentation during the Mellon-funded AIMS grant. One of the collections contained 50,000 unique email messages. Peter, who was our new digital archivist, experimented with Gephi (exporting header information to create social network graphs) and Forensic Toolkit. Neither option managed to provide a suitable tool for processing or to facilitate discovery. FTK did not allow us to flag individual messages that contained personal identifying information for restriction and neither provided a view of the entities within the corpus nor expose them to remote researchers.
Most of the previous email projects and tools we researched were focused specifically on acquisition and preservation. They did not address other core functions of stewardship – appraisal, processing and access (discovery & delivery).
During this experimental period, Peter discovered MUSE (Memories USing Email), a research project in the Mobisocial group of Stanford’s Computer Science Dept. Using NLP and a built-in lexicon, it allowed us to extract entities, view by correspondent or a graphical visualization of sentiments based on the lexical terms. This was a step in the right direction and we began a multiyear collaboration with Sudheendra Hangal, MUSE’s creator.
The objectives are to create an open-source Java-based software program built on MUSE that supports different activities aligned with core functions of the digital curation lifecycle: appraisal, accessioning, processing, discovery and delivery. In effect it would allow a user, anyone from the creator, donor, curator, archivist or researcher, to use the collection both before and after transfer to a repository.
Stanford, along with our collaborating partners (NYU, Smithsonian, Columbia, and Bodleian @ Oxford), created and prioritized a set of specifications for the initial development cycle, funded by NHPRC. We also developed and published a beta site to demonstrate our concept for exporting entities and correspondents to facilitate discovery. We have been steadily receiving more email collections. Our most recent acquisition contains over 600,000 unique messages. The grant states that the program will handle at least 250,000 messages – so this latest archive will be more than adequate as a stress test!
Trevor: Could you tell us a bit about the design of the workflow in the tool? How are you envisioning donors and processing archivists working with it?
Peter: The workflow is designed as follows:
Creators of email archives use the appraisal module to scan their archives and identify messages they don’t want to transfer. They can also flag messages as “restricted” and enter annotations to specify the terms of the restriction. The files exported from ePADD will NOT contain the messages flagged as “Do Not Transfer.”
After receiving the files from donors, processing archivists will then identify messages to be restricted according to the policy of their institutions and communication with the donor. Depending on the resources available, processing archivists may want to confirm the email addresses of correspondents suggested by ePADD. Archivists may also want to reconcile the correspondents/person entities extracted with authority records suggested by ePADD. After they finish processing, archivists will output two versions of the archive from ePADD. Neither set contains any restricted messages.
The first set is designed for the discovery module, with all messages redacted barring identified entities (people, places, organizations) and email address with a masked domain name. This version will be stored in the web server to be used for the discovery module. Public researchers with internet access can browse and search the archives using the discovery module. They will only see a redacted version of the original messages containing extracted entities, but this is still useful to them to get a sense of the entities present in the archive, without being able to see what is said about them.
The second version, designed for the delivery module, will be stored in the reading room computer designated for email delivery. Researchers using the designated computer in the reading room will be able to browse and search the archives. The messages, when displayed, will be the full messages without redaction. Researchers can define their own lexicon to analyze the collection. They may request copies by flagging the messages they need. Public service archivists/librarians can then give the researchers the files according to the policy of their institutions.
Glynn: I would only add that the appraisal module is meant to make it possible for a creator/donor to review their email archives, to create their own lexicon if desired, and prepare the files for export and final transfer to a repository. During this process they may take actions on specific messages (individually) or sets of messages (bulk by topic or correspondent) as restricted or elected not to transfer. We felt this functionality was important to offer a donor for two reasons. First, in the hope that they weed out irrelevant messages or spam! Second, there may be individuals they correspond with that do not want their messages archived – this is a case for one of our collections.
Trevor: How do you imagine archivists using this tool? Further, how do you see it fitting in with the ecosystem of other open-source tools and platforms that act as digital repository platforms and other tools for processing and working born digital archival materials like BitCurator and Archivematica?
Peter: I consider “processing” of born-digital materials to include both identifying restricted materials AND arranging/describing the intellectual content of the materials. My understanding of Bitcurator and Archivematica is that neither offers tools to arrange/describe the intellectual content of the materials. ePADD offers four tools to arrange/describe the intellectual content of email archives. First, it uses a natural language processing library to extract personal names, organizational names and locations in email archives to give researchers a sense of people, organizations and locations in the archives. Second, it gathers all image files in one place for researchers to browse and if necessary go to the messages containing the images.
Third, it offers user-definable lexicons which contain categories of words the system will use to search against the emails so that researchers/archivists can browse emails according to the lexicons they defined. Finally, ePADD reconciles the correspondents and personal names mentioned in messages with the FAST (Faceted Application of Subject Terminology) dataset which is derived from the Library of Congress Subject Headings. Archivists can then give their confirmation to the suggested matches by ePADD. If none of the suggestions are correct, they can enter their own links to the authority records.
I can see people using ePADD to appraise, process, discover and deliver emails and sending the files generated for delivery and discovery to systems using Archivematica for long term preservation.
Trevor: In Sudheendra’s presentation I saw some really interesting things happening with approaches to identifying different distinct email addresses that are associated with the same individual over time in a collection, and some interesting approaches to associating the names of individuals with canonical data for names of people. I think he also illustrated ways that the content of the messages could be identified and associated with subjects. Could you tell us a bit about how this works and how you are thinking about the possibilities and impact on things like archival description that these approaches could have?
Glynn: With email archives – or any born-digital materials – archivists need automated methods to get through large amounts of data. ePADD incorporates several methods of automation to assist with processing of email. Here are three:
1. Correspondents & name resolution
During ingest, ePADD gathers all correspondents and recipients from email headers and performs basic name resolution tasks. When your cursor rolls over a name, different versions that were aggregated appear in a pop-up window. The archivist can go into the back end and override or edit the addresses that are associated with a specific name.
I would direct you to the wonderful documentation on processing and using email archives on the MUSE website. Regarding the resolution of correspondents, Sudheendra states (PDF) in the “MUSE: Reviving Memories Using Email Archives” report that “MUSE performs entity resolution by unifying names and email addresses in email headers when either the name or email address (as specified in the RFC-822 email header) is equivalent. This is essential since email addresses and even name spellings for a person are likely to change in a long-term archive.”
ePADD performs this process during ingest and allows the donor (appraisal module) or archivist (processing module) to correct or edit the email aliases that are automatically bundled together by ePADD at ingestion.
2. Entity extraction & disambiguation
ePADD extracts entities from the email corpus using Apache’s openNLP library and checks them against OCLC’s FAST database to identify authorities. In the case of multiple hits on a name, it shows all the matching records and can read data from DBpedia to automatically rank the likelihood of each record being the correct one. The archivist finally confirms which authority record is correct.
Algorithms are also used to help the archivist or researcher understand context while reading a message. For example, suppose a conversation mentions Bob, which could refer to any number of Bobs present in the archive. ePADD analyzes the occurrences of Bob throughout the archive with respect to the text and headers of this message, and thinks: “Hmm…when the name Bob is used with the people copied on this email, and when these other names appear in the message, its more likely to be Bob Creeley than other Bobs in the archive like Dylan or Woodward.” It displays a popup with the ranked list of possibilities (see image).
The colored bar underneath each full name indicates the likelihood of that association. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive’s contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.
3. Lexical searches & review
The archivist can use the built-in lexicons or create one in order to tease out the subjects or topics in the archive. MUSE came with a “sentiment” lexicon and ePADD will include another default lexicon based on searching for Personally Identifiable Information and sensitive material. This will include the ability to identify regular expressions – such as credit card or social security numbers as well as material that may be governed by FERPA or HIPAA. These lexicons are editable or one could start from scratch and create a specialized one. The beauty of this is that once the terms are indexed by ePADD the user can view the messages individually or in a visualization graph.
Trevor: As a follow-up to that question, how is the project conceptualizing the role of the archivist engaging with some of these automated processes for description? Sudheendra showed how an archivist could intervene and accept/reject or tweak the resulting bundling of email addresses and associate them with named entities. With that said, I imagine it would be a huge undertaking, and one that seems inconsistent with an MPLP approach, to have an archivist review all of this metadata. To that end, are there ways the project can enable some level of review of particularly important figures and still communicate which part is automated and which part has been reviewed? Or are there other ways the team is thinking about this kind of issue?
Peter: In view of the large number of correspondents and personal names mentioned in an email archive, reviewing ALL name entities is usually not feasible. Depending on resources we have for each archive, we can review, say the top 1000 most mentioned names in an archive.
Glynn: Agreed. This is similar to processing the analog or paper correspondence in a collection. The archivist usually selects correspondents that are either well known, that have substantive letters – either in form or extent. Not all correspondents in a collection make it into the finding aid as added entries, into folder-level description, or even into a detailed index. With ePADD the top 50 or 100 correspondents (in extent) are easily and automatically identified.
However, because researchers may be interested in entities/correspondents that we do not “process,” we are considering allowing them much of the same functionality in the full text access module in the reading room. One example, would be allowing the researcher to create a new lexicon and search by their terms.
To identify what’s been processed is a work in process. We still need to build in some administrative features – such as scope and content notes – to let the researcher know the types/depth of actions performed.
Trevor: How are you thinking about authenticity of records in the context of this project? That is, what constitutes the original and authentic format of these records and how does the project work to ensure the integrity of those records over time. Similarly, how are you thinking about documenting decisions and actions taken in the appraisal process on the records?
Peter: According to “ISO 15489-1, Information and documentation–Records Management,” an authentic (electronic) record is one that can be proven:
a) to be what it purports to be,
b) to have been created or sent by the person purported to have created or sent it, and
c) to have been created or sent at the time purported.
Format is not part of the requirements for an authentic electronic record. One of the reasons is that electronically produced documents actually are not objects at all but rather, by their nature, products that have to be processed each time they are used. There is no transfer, no reading without a re-creation of the information. Furthermore, electronic records are at risk because of technical obsolescence as newer formats replace older ones.
ePADD does not address the issue of authenticity in this round of funding. This issue is definitely important and complicated and I would like to address it in the future.
Trevor: What lessons has the team learned so far about working with email archives? Are there any assumptions or thoughts you had about working with email as records that have evolved or changed while working on the project?
Peter: Conversion of old archived emails can be tricky. Even though normalization is not within the scope of ePADD, still people need to convert emails to MBox format before ePADD can work on them. One of our partners found missing headers from emails when looking at them using ePADD. The emails came from old Groupwise emails that were migrated into Outlook and then converted to Mbox. Is this a conversion error when converting Groupwise emails to Outlook? Or when converting the Outlook emails to mbox? Or when ePADD parses the emails?
Attachment files come in diverse file formats. The ability to view files in attachments is an important feature for a system like ePADD. Apache OpenOffice.org can read files in ~50 file formats. On the other hand, QuickView Plus can view files in ~500 file formats. Should we integrate a commercial software in ePADD in order to view files in the 450 file formats which Apache OpenOffice is not capable of? If yes, ePADD will not be an open-source project anymore. If no, ePADD users have to face the fact that there are files they are not able to view.
Glynn: The sheer volume of data to review can be very daunting. The more specific the terms in the lexicon to perform automated indexing to messages the better. You want to discover messages that should be restricted but not have too many false positives to wade through during review.
The ability to process in bulk cannot be stressed too much. When performing actions on a set of messages – either from a lexical result, correspondent or a user-defined search. ePADD allows to you apply any action to that entire subset. You can also apply actions to original folders. For example, if messages are organized into a folder marked “human resources,” the archivist or donor may choose to flag all the messages in that folder as “restricted until 2050.”
Trevor: What are the next steps for the project? What sorts of things are you exploring for the future?
Peter: I would like to look at the topics/concepts exchanged in emails (and match them against the Library of Congress Subject Heading – Topical). It would be interesting to know what books and movies were mentioned in emails. Publishing extracted entities as linked open data is definitely one thing I would like to do as well. However, it all depends on funding.
Glynn: This is the fun part – envisioning what else is needed or desired in future iterations. It is, however, reliant on funding and collaboration. Input is needed across different types of institutions – museums, government, academic, corporate to name a few. While many of the use cases would be similar, there are unique aspects or goals for different institutions.
Over the past few weeks, we’ve taken part in the NDSA-sponsored meeting (see Chris Prom’s blog post) and held ePADD’s first Advisory Group meeting. These sparked some wonderful discussions and ideas about next steps for greater discovery, delivery and collaboration.
There is a definite need in the profession to begin defining and documenting use cases, to analyze and document life cycles of email archives and existing tools in order to evaluate gaps and future needs, to further discovery through exporting correspondents and extracted entities from ePADD and publishing them with a dynamic search interface across archives. Other avenues we would like to explore are the ability to process and deliver other document types (beyond email), including social media.
The final delivery or access module is intended for reading room access, and we hope to provide more robust tools to allow user interaction with the archives. Additionally, we would like to offer data dumps for text mining/analysis or extractions of header information for social network analysis. Currently these are managed by correspondence through Special Collections.
One suggestion from our Advisory Group was to broaden use of ePADD before final release in the summer (2015). By allowing other repositories to use ePADD for processing we would expose more email collections for researcher use and hopefully get more feedback for the development and specification teams. This will be a better demonstration and test of the program. To this end we plan to release ePADD beyond our grant collaborators to other institutions that have already expressed strong interest.
Sudheendra: Glynn and Peter have answered your questions wonderfully, so I’ll just jump in with a little bit of speculation. In the last couple of decades, we’re seeing that a lot of our lives are reflected in our online activities, be it email, blogs, Facebook, Twitter or any other medium. A small example: a cousin of mine spent almost a year organizing a major dance performance for her daughter. She was reflecting on the effort, and exclaimed to me: “That was so much work. You know, I should save all those emails!” I think that is very telling. All of us have wonderful stories in our lives. There are moments of joy and exasperation, love and sorrow, accomplishment and failure, and they are often captured in our electronic communications. We should be able to preserve them, reflect on them, and hand them over to future generations. We already do this with photographs, which are wonderful. However, text-based communications are complementary to images because they capture thoughts, feelings and intentions in a way that images do not.
Unfortunately, the misuse of personal data for commercial or surveillance reasons is causing many people to be wary of preserving their own records, and even to go out of their way to delete them. This is a pity, because there is so much value buried in archives, if only users could keep their data under their own control, and have good tools with which to make sense of it. So in the next decade, I predict that individuals and families will routinely use tools like ePADD to preserve history important to them. We’re all archivists in that sense.