Exploring Computational Categorization of Records: A Conversation with Meg Phillips from NARA

Continuing the insights interview series, I’m excited to share this conversation with Meg Phillips, External Affairs Liaison at the National Archives and Records Administration. A few years back we “un-chaired” CURATEcamp Processing: Processing Data/Processing Collections together. Meg wrote a guest post reflecting on that event for the Signal titled More Product, Less Process for Born-Digital Collections. I thought it would be good to check back in and see how some of the ideas we were considering about computational processing and digital records in 2012 have continued to play out. In particular, to talk a bit about computational approaches to categorization or processing born-digital materials.

Meg Phillips, External Affairs Liaison at the National Archives and Records Administration and member of the NDSA Coordinating Committee.

Meg Phillips, External Affairs Liaison at the National Archives and Records Administration and member of the NDSA Coordinating Committee.

Trevor: While electronic records have been an issue for archivists for a long time, the scale and complexity of electronic records keeps growing. Some of these issues are what made electronic records such a high priority in the 2014 National Agenda for Digital Stewardship. Could you give us a few concrete examples of the diversity of electronic records data challenges that NARA and other federal agencies face? I ask in part as I think it sets the stage for why some of these computational tools are particularly important for the future of managing electronic records.

Meg: At the National Archives we need new strategies to cope with very high volume electronic collections with a wide variety of formats and content types, but our challenges definitely aren’t unique.  Many other institutions are facing these same problems, including cultural heritage institutions but also records creators like federal agencies.  What we’re finding is that traditional paper-based processes for managing information where all the steps are done by humans just aren’t able to keep up with the volume of electronic records being created now.

One particularly pressing example that shows up in many environments is the need to screen electronic record content to see if it contains any restricted information before we release it.   The way we’ve done this in the past is just too slow to keep up, and the need to do it on heterogeneous electronic collections that could contain almost any information (like email or social media) is growing.  Without new ways to approach this, archives will bring in vast quantities of valuable information, but only a trickle will make it out into publicly searchable collections.  We all want to do better than that.

A variation of this problem that IS specific to the federal government is the need to eventually declassify and release classified national security information.  Classified records can contain some of the juiciest material for historians to work with.  They’re really important for both history and government accountability.  However the November 2012 report, Transforming Classification, of the Public Interest Declassification Board described the rate of declassification using current processes with this example:

“It is estimated that one intelligence agency would, therefore, require two million employees to review manually its one petabyte of information each year. Similarly, other agencies would hypothetically require millions more employees just to conduct their reviews.

"NARA's grand challenge for industry" cartoon created by James Lappin. Presented with permission from James Lappin.

“NARA’s grand challenge for industry” cartoon created by James Lappin. Presented with permission from James Lappin.

Trevor: Do you have any examples you could share with us of how NARA or other federal agencies are using tools, or are are thinking about using these kinds of tools, to automate some of the categorization/appraisal/or processing of electronic records? If so, could you briefly describe some of the uses?

Meg: In the records management sphere, one of the challenges is getting all government employees to save their records in the right category so the right retention rules can be applied.  The National Archives has found that asking each person to file every email and other document properly rarely works very well.  People are just too busy working to fulfill their agency’s mission to spend a lot of time filing their email.  We are encouraging agencies to try automated approaches to the capture and categorization of their electronic records to take the burden of filing off the end users. In fact, we just released a new draft report and plan on Automated Electronic Records Management that we hope will help move the Federal government toward greater use of these tools.

Some bellwether agencies like the Department of Interior are already implementing solutions like this as part of comprehensive electronic records management strategies.  They are training machine learning systems to automatically recognize and file email messages that belong in different file categories and they intend to expand the system to all kinds of electronic content.

The main driver for these kind of artificial intelligence capabilities in the marketplace is the need to do efficient search, categorization, ranking and screening for privileged information for the discovery phase of litigation.  The cool thing for those of us who need to process large electronic collections that that the functions of the eDiscovery processes are really similar to what we need to do.  We need to understand what’s there, figure out what to keep and what to toss, categorize and describe it, and screen it for restricted content before releasing it.  And when we have such large collections that a search results in millions of hits, it is really useful to be able to rank the results based on how closely they match what we’re looking for.  Machine learning tools can do this in a much more sophisticated way than old school keyword searching can.  (And there’s some interesting research from The Sedona Conference that indicates that the machines are more accurate than humans at catching potentially restricted content, as well as better at retrieving all relevant documents.)

I also just discovered a fascinating crowdsourcing project called the “Mechanical Curator” from the British Library. They posted a vast collection of public domain images extracted from books digitized by Microsoft on the web.  They say they “plan to launch a crowdsourcing application at the beginning of next year, to help describe what the images portray.” Further, that they would use data from the crowdsourcing effort to “train automated classifiers that will run against the whole of the content.”  This is a perfect example of an application combining two new approaches, crowdsourcing and machine learning, to extend the scalability of description in the digital world.

You can read more about the National Archives’ project to encourage more automation in the management of electronic records here.

Trevor: Across these different methods and use cases I would be curious to hear what you think are the most promising ways we might be making use of these in the lifecycle of electronic records.

Meg: There are lots of stages in the electronic records lifecycle where automated tools could be very useful.  I have to say that we don’t have a lot of experience with this yet.  I encourage organizations to start piloting some of these technologies and share your lessons learned with the community!  Many of the tools were developed for other purposes and we still need to figure out how well they’ll work in cultural heritage environment.

  • Appraisal: Autocategorization and topic clustering tools can help figure out what topic areas are covered in a collection and how valuable it is.
  • Selection/weeding: We can use automated tools to weed out system files, advertising, and other low value materials that we don’t want to archive.
  • Processing: We can use autocategorization or document clustering to better understand  the content of a collection.  We may not “arrange” it in the same way we would with a physical collection, but we can see different types of inherent organizations for different purposes and can explain those to researchers. We could also flag potentially restricted content (privacy information, for example) if we could train a machine learning system to recognize content similar to other content we already marked as restricted.
  • Description: Description is really all about summarizing information and surfacing the subject terms (perhaps with topic modeling) and names of people and places (for which named-entity extraction could work well). There are already some interesting examples of this sort of work a foot. For instance, Thomas Padilla did some work topic modeling text in Carl Woese’s electronic records (pdf).  Also, Ed Summers has been dabbling with putting a tool together to generate topic models that actually act as interfaces to materials.
  • Reference: Autocategorization tools will help researchers sort through vast sets of hits to find the material worth reading with human eyes.  In a collection of hundreds of millions of emails, even obscure topics can generate lists of many thousands of responsive documents, and that’s overwhelming for many researchers.  These tools can help them make the best use of their time by leading them to the most relevant clusters of documents.

Trevor: As these technologies and methods for categorization continue to augment our abilities to work with and manage digital information I imagine they are going to impact a lot of the workflows and work of archivists across the lifecycle. Do you have any thoughts on what these kinds of tools are going to mean for the future of staffing digital preservation programs?

Meg: In the future, large-scale digital preservation programs are going to need access to someone with not just good technical skills, but also good information science skills.  Perhaps we’ll want some data architects to help users make sense of the content in our collections.

Strictly speaking, these aren’t digital preservation functions. It’s the core archival processing and reference functions that are going to change the most. I also believe the profession of records management is going to undergo a transformation. Records managers are going to need to be able to help expert users select training sets and train their systems to recognize the records that belong in different categories. The need to identify the categories of information generated by their organization and determine how long each category has value will not change. The way they implement those decisions is on the verge of changing a lot.

A Residency Update: Working with Digital Media Art at the Smithsonian

The following is a guest post by Erica Titkemeyer, National Digital Stewardship Resident at the Smithsonian Institution Archives As the National Digital Stewardship Resident placed within the Smithsonian Institution Archives I have been tasked with identifying the specialized digital curation requirements for time‐based media art (TBMA). I typically use this definition to best describe TBMA (also referred to here as digital media art): artwork containing […]

April Issue of the Library of Congress Digital Preservation Newsletter is Now Available!

The April 2014 Library of Congress Digital Preservation Newsletter (pdf) is now available! In this issue: Where are the Born Digital Archives Test Data Sets? Fixity Data in Sound and Moving Image Files Managing a Library of Congress Worth of Data Personal Digital Archiving: The Basics of Scanning New NDSA Report: Geospatial Data Stewardship Online […]

Eyes of the World: Interview with George Jungbluth of the US National Oceanographic and Atmospheric Administration

This post is part of our ongoing NDSA innovation group’s Insights interview series. Scientific data is the biggest of the “big data.” In fact, research data and increased complexity and volume of data are two of the challenges addressed by the National Agenda for Digital Stewardship. To find out more about the data preservation and […]

Shaking the Email Format Family Tree

Recently, we’ve started to add email formats to the Sustainability of Digital Formats website. Eventually, when we get a more robust collection, we’d like to split them out into a separate content category but for now, they (mostly) are categorized with their closest cousin, the Textual Content family.  Our genealogical research is still very much […]

At the Museum: An Interview with Marla Misunas of SFMOMA, Pt. 1

At the Museum is a new interview series highlighting the variety of digital collections in museums and the interesting people working to create and preserve these collections. For this first installment, I interview Marla Misunas, Collections Information Manager for the San Francisco Museum of Modern Art. Marla gives us some great detail about her role […]

Personal Digital Archiving 2014

The fifth annual Personal Digital Archiving conference is on April 10-11 and you can still register online until tomorrow, April 1. This conference attracts a variety of information-technology professionals with a range of digital-preservation interests, mainly oriented toward the needs of individuals rather than the digital collections of cultural institutions. Topics include – but are […]

Personal Digital Archiving: The Basics of Scanning

Although the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Alliance focus on digital preservation and access, many of the personal digital archiving questions that the general public ask us are about scanning. Though scanning is a separate issue from digital preservation, scanning does generate digital files that need to be […]