Continuing the insights interview series, I’m excited to share this conversation with Meg Phillips, External Affairs Liaison at the National Archives and Records Administration. A few years back we “un-chaired” CURATEcamp Processing: Processing Data/Processing Collections together. Meg wrote a guest post reflecting on that event for the Signal titled More Product, Less Process for Born-Digital Collections. I thought it would be good to check back in and see how some of the ideas we were considering about computational processing and digital records in 2012 have continued to play out. In particular, to talk a bit about computational approaches to categorization or processing born-digital materials.
Trevor: While electronic records have been an issue for archivists for a long time, the scale and complexity of electronic records keeps growing. Some of these issues are what made electronic records such a high priority in the 2014 National Agenda for Digital Stewardship. Could you give us a few concrete examples of the diversity of electronic records data challenges that NARA and other federal agencies face? I ask in part as I think it sets the stage for why some of these computational tools are particularly important for the future of managing electronic records.
Meg: At the National Archives we need new strategies to cope with very high volume electronic collections with a wide variety of formats and content types, but our challenges definitely aren’t unique. Many other institutions are facing these same problems, including cultural heritage institutions but also records creators like federal agencies. What we’re finding is that traditional paper-based processes for managing information where all the steps are done by humans just aren’t able to keep up with the volume of electronic records being created now.
One particularly pressing example that shows up in many environments is the need to screen electronic record content to see if it contains any restricted information before we release it. The way we’ve done this in the past is just too slow to keep up, and the need to do it on heterogeneous electronic collections that could contain almost any information (like email or social media) is growing. Without new ways to approach this, archives will bring in vast quantities of valuable information, but only a trickle will make it out into publicly searchable collections. We all want to do better than that.
A variation of this problem that IS specific to the federal government is the need to eventually declassify and release classified national security information. Classified records can contain some of the juiciest material for historians to work with. They’re really important for both history and government accountability. However the November 2012 report, Transforming Classification, of the Public Interest Declassification Board described the rate of declassification using current processes with this example:
“It is estimated that one intelligence agency would, therefore, require two million employees to review manually its one petabyte of information each year. Similarly, other agencies would hypothetically require millions more employees just to conduct their reviews.”
Trevor: Do you have any examples you could share with us of how NARA or other federal agencies are using tools, or are are thinking about using these kinds of tools, to automate some of the categorization/appraisal/or processing of electronic records? If so, could you briefly describe some of the uses?
Meg: In the records management sphere, one of the challenges is getting all government employees to save their records in the right category so the right retention rules can be applied. The National Archives has found that asking each person to file every email and other document properly rarely works very well. People are just too busy working to fulfill their agency’s mission to spend a lot of time filing their email. We are encouraging agencies to try automated approaches to the capture and categorization of their electronic records to take the burden of filing off the end users. In fact, we just released a new draft report and plan on Automated Electronic Records Management that we hope will help move the Federal government toward greater use of these tools.
Some bellwether agencies like the Department of Interior are already implementing solutions like this as part of comprehensive electronic records management strategies. They are training machine learning systems to automatically recognize and file email messages that belong in different file categories and they intend to expand the system to all kinds of electronic content.
The main driver for these kind of artificial intelligence capabilities in the marketplace is the need to do efficient search, categorization, ranking and screening for privileged information for the discovery phase of litigation. The cool thing for those of us who need to process large electronic collections that that the functions of the eDiscovery processes are really similar to what we need to do. We need to understand what’s there, figure out what to keep and what to toss, categorize and describe it, and screen it for restricted content before releasing it. And when we have such large collections that a search results in millions of hits, it is really useful to be able to rank the results based on how closely they match what we’re looking for. Machine learning tools can do this in a much more sophisticated way than old school keyword searching can. (And there’s some interesting research from The Sedona Conference that indicates that the machines are more accurate than humans at catching potentially restricted content, as well as better at retrieving all relevant documents.)
I also just discovered a fascinating crowdsourcing project called the “Mechanical Curator” from the British Library. They posted a vast collection of public domain images extracted from books digitized by Microsoft on the web. They say they “plan to launch a crowdsourcing application at the beginning of next year, to help describe what the images portray.” Further, that they would use data from the crowdsourcing effort to “train automated classifiers that will run against the whole of the content.” This is a perfect example of an application combining two new approaches, crowdsourcing and machine learning, to extend the scalability of description in the digital world.
You can read more about the National Archives’ project to encourage more automation in the management of electronic records here.
Trevor: Across these different methods and use cases I would be curious to hear what you think are the most promising ways we might be making use of these in the lifecycle of electronic records.
Meg: There are lots of stages in the electronic records lifecycle where automated tools could be very useful. I have to say that we don’t have a lot of experience with this yet. I encourage organizations to start piloting some of these technologies and share your lessons learned with the community! Many of the tools were developed for other purposes and we still need to figure out how well they’ll work in cultural heritage environment.
- Appraisal: Autocategorization and topic clustering tools can help figure out what topic areas are covered in a collection and how valuable it is.
- Selection/weeding: We can use automated tools to weed out system files, advertising, and other low value materials that we don’t want to archive.
- Processing: We can use autocategorization or document clustering to better understand the content of a collection. We may not “arrange” it in the same way we would with a physical collection, but we can see different types of inherent organizations for different purposes and can explain those to researchers. We could also flag potentially restricted content (privacy information, for example) if we could train a machine learning system to recognize content similar to other content we already marked as restricted.
- Description: Description is really all about summarizing information and surfacing the subject terms (perhaps with topic modeling) and names of people and places (for which named-entity extraction could work well). There are already some interesting examples of this sort of work a foot. For instance, Thomas Padilla did some work topic modeling text in Carl Woese’s electronic records (pdf). Also, Ed Summers has been dabbling with putting a tool together to generate topic models that actually act as interfaces to materials.
- Reference: Autocategorization tools will help researchers sort through vast sets of hits to find the material worth reading with human eyes. In a collection of hundreds of millions of emails, even obscure topics can generate lists of many thousands of responsive documents, and that’s overwhelming for many researchers. These tools can help them make the best use of their time by leading them to the most relevant clusters of documents.
Trevor: As these technologies and methods for categorization continue to augment our abilities to work with and manage digital information I imagine they are going to impact a lot of the workflows and work of archivists across the lifecycle. Do you have any thoughts on what these kinds of tools are going to mean for the future of staffing digital preservation programs?
Meg: In the future, large-scale digital preservation programs are going to need access to someone with not just good technical skills, but also good information science skills. Perhaps we’ll want some data architects to help users make sense of the content in our collections.
Strictly speaking, these aren’t digital preservation functions. It’s the core archival processing and reference functions that are going to change the most. I also believe the profession of records management is going to undergo a transformation. Records managers are going to need to be able to help expert users select training sets and train their systems to recognize the records that belong in different categories. The need to identify the categories of information generated by their organization and determine how long each category has value will not change. The way they implement those decisions is on the verge of changing a lot.
I think those interested in this post would also be interested in Project Abaca, a UK project (with input from The National Archives of the UK) looking at digitally assisted sensitivity review. See http://projectabaca.wordpress.com/
This is an interesting topic, and one that goes to the core of whether we, as a society, will be able to successfully manage the eRecords we create.
In the mean time however, I agree that auto-classification will help with the speed to access/delivery. Auto categorization has its own problem of accuracy and cannot match human thought in terms of sense making, but is helpful mostly when used at large scale. I assume the areas of ambiguity can be helped with human intervention. Classifiers, as it turns out, work best within a specific knowledge domain, and a generic engine performs less well in a specialized domain. ConceptClassifier is one of many commercial solutions that exist. Tropes is an opensource classifier that offers specialization in different knowledge domain.
I would just add that we might also need to think about the Enterprise Architecture (IT) on which the mostly-not-yet-defined erecords architecture currently stands and operates. The eRecords workflow, from the desktop to final disposition (permanent access repository or destruction), needs a clear path design within the enterprise IT architecture. This, I fear, is a strategic and/or cognitive investment that cannot yet be outsourced to computation:-(
A process engineering or re-engineering might be called for with regard to eRecords in the enterprise data workflow. An analogy would be the laundry: dropping dirty clothes in the right bucket right away, is not only efficient, but also saves time in terms of prep work ; it also enables the machine to be performant against our standards and expectations. Inversely, a mixed-up of different color line of clothes into one machine, hoping that perfect magic will happen, would not be the right approach. In short, while I agree with automatic classification tools, I also support the view of order in the process leading to classification, which will prep the job or do it half way.
Overall, we will enable the classification engines to be only as efficient as the whole process will be.
NARA’s Applied Research Branch, in collaboration with the National Science Foundation and the Army Research Lab, has funded research that has significantly advanced the state of the art for many of the tools and technologies that Meg cites. These include content clustering, auto-categorization, named entity extraction, automated content summarization and description.
You can find papers related to some of this research on the Applied Research webpage on the NARA website. Once on the Applied Research Page, just click on “Research Partner Publications” on the left side of the screen.
You can find information about visualization and content clustering work here:
Weijia Xu, Maria Esteva, Suyog Dott Jain, Varun Jain: Interactive visualization for curatorial analysis of large digital collection. Information Visualization 13(2): 159-183 (2014)
Weijia Xu, Maria Esteva: Finding stories in the archive through paragraph alignment. LLC 26(3): 359-363 (2011)
Weijia Xu, Maria Esteva, Suyog Dott Jain, Varun Jain: Analysis of large digital collections with interactive visualization. IEEE VAST 2011: 241-250
Jae Hyeon Bae, Weijia Xu, Maria Esteva: Facilitating Understanding of Large Document Collections. ICDAR 2011: 1334-1338
Great comments! One of the great things about going public in a blog like this is having commenters provide even more pointers to related resources. (Thanks, Mark and David! )
And I agree with Tibaut about sorting the laundry first. In fact, as part of the Automated Electronic Records Management project I mentioned in the interview we’re trying to figure out how different automation techniques could be used as a series of filters. For example, if electronic records are created as part of a structured, automated process, capturing them for management with a workflow step in the system that generates them would be easier and more accurate than applying autocategorization. Perhaps we could use a sequence of techniques that sorted out all the straightforward cases and only used autocategorization for unstructured content where nothing else would work.
For one small example of a processing project, The Byrd Center for Legislative Studies has been processing a significant amount of born-digital material from Senator Byrd’s papers. They are sharing their process, workflows, and results http://www.historyassociates.com/blog/digital-archives-blog/digital-preservation-project-step1/