From time to time, co-chairs of the National Digital Stewardship Alliance Arts and Humanities Content Working Group will bring you guest posts addressing the future of research and development for digital cultural heritage as a follow-up to a dynamic forum held at the 2014 Digital Preservation Conference.
The following is a guest post from Meg Phillips, External Affairs Liaison, National Archives and Records Administration. Opinions expressed are those of the author and do not necessarily represent positions of the National Archives and Records Administration.
Digital humanists and digital historians are employing research methods that most of us did not anticipate when we were learning to be archivists. Do new types of research mean archivists should re-examine the way we learned to do appraisal?
The new types of researchers are experimenting with methods beyond the scholarly tradition of “close reading.” When paper archives were the only game in town, close reading was all a researcher could do – it’s what we generally mean by “reading.” Researchers studied individual records, extracting meaning and context from the information contained in each document. Now, however, digital humanists are using born-digital or digitized collections to explore the benefits of computational analysis techniques, or “distant reading.” They are using computer programs to analyze patterns and find meaning in entire corpora of records without a human ever reading any individual record at all.
I have been interested in digital scholarship and its implications for archives for a while, but I hadn’t heard the phrase “distant reading” until seeing Franco Moretti’s book “Distant Reading” reviewed earlier this year. (See “What is Distant Reading?” in the New York Times and “In Praise of Overstating the Case: A review of Franco Moretti, Distant Reading” in Digital Humanities Quarterly for a taste of the debate over the book.) The phrase stuck with me as provocative shorthand for a new way of using records, and I started thinking about what distant reading might mean for archival appraisal.
Our traditions of archival appraisal are based on locating records that reward close reading. A series appraised as permanent contains individual records that contain historically valuable information. Both appraisal itself and the culling that happens during transfer or processing focus on removing records that do not contain permanently valuable information.
Now, however, it is possible to ask and answer entirely new kinds of questions with born-digital or digitized records. What did the network of influence in an organization look like? How did communication flow? Was the chief executive interacting with a particular vendor unusually often? When did a new concept or term first appear and how quickly did use of the new term spread? How did a disease spread through a community? Not only is it possible, but early adopters are now teaching these research methods to a new generation of students. For example, Professor Matthew Connelly is teaching a seminar at the London School of Economics called Hacking the Archives. The course challenges students of international history to explore the new kinds of questions computational research allows. These are questions whose answers emerge not from deep reading of individual records but from analysis of patterns in large bodies of records.
The interesting thing about these questions is that the answers may rely on the presence of records that would clearly be temporary if judged on their individual merits. Consider email messages like “Really sick today – not coming in” or a message from the executive of a regulated company saying “Want to meet for lunch?” to a government policymaker. In the aggregate, the patterns of these messages may paint a picture of disease spread or the inner workings of access and influence in government. Those are exactly the kinds of messages traditional archival practice would try to cull. In these cases, appraising an entire corpus of records as permanent would support distant reading much better. The informational value of the whole corpus cannot be captured by selecting just the records with individual value.
If we adjusted practice to support more distant reading, archivists would still do appraisal, deciding what is worth permanent preservation. We would just be doing it at a different level of granularity – appraising the research value of an entire email system, SharePoint site or social media account, for example.
Incidentally, on a practical level this level of appraisal might also lead to disposition instructions that are easier for creating offices to carry out.
Figuring out how to do appraisal to support both distant reading and close reading would be an excellent project for the archival and digital preservation fields. What questions would we want to answer? We could start with some questions like these:
- How many researchers are actually engaged in distant reading? What fields do they work in? Are their numbers increasing?
- Do they want to apply computational techniques to archival materials, for example Federal records in the National Archives, or in any other environment? Perhaps they are getting their source material somewhere else, bypassing archives.
- To what extent do their research methods rely on having a complete set of the records created rather than a subset of the most permanently valuable records?
- Do current definitions of a record and current recordkeeping regulations support a change to appraisal of entire corpora of records?
- How would we know which corpora of records were most useful to researchers?
- Is the benefit of distant reading worth the cost and risk of retaining more material that could have personal privacy or other protected content?
- Is there a meaningful difference between trying to support computational research and actually just keeping everything? (Perhaps this whole discussion is just the modern version of the old tension between historians who want to save everything and archivists who are trying to put their resources toward the most important materials.)
Staff at the National Archives and other institutions are starting to create opportunities for archivists to discuss questions like these. Josh Sternfeld of NEH, Jordan Steele of Johns Hopkins and Paul Wester and I from NARA will be holding a panel discussion of these issues at the Fall 2014 Mid Atlantic Regional Archives Conference meeting in Baltimore, for example. Paul and I will be also be speaking with Matthew Connelly and others on an American Historical Association panel at the 2015 annual meeting in New York City, “Are We Losing History? Capturing Archival Records for a New ERA of Research.”
However, we need to create even more opportunities for archivists to explore these issues with digital humanists. A forum that pulled together digital researchers, archivists, librarians and technologists could be a great opportunity for us all to learn from each other. Such an event could also spread the word about the exciting new things that can be done with digital primary sources and the rich collections of digital resources that are now available in archives and libraries.
Of course, we can also blog about the issues and hope that the community leaps into the fray!
In that spirit, do you think archival appraisal needs to change, and if so, how?
Comments (5)
I’m a scholar who does distant reading (mostly of literary texts, mostly in large digital libraries) at the University of Illinois. Just wanted to chime in with a note of support. These seem to me really good questions.
They’re also hard questions, and I don’t immediately have answers to them. E.g., ‘is there a meaningful difference between trying to support computational research and just keeping everything?’
I don’t know. One thought (probably already familiar to you) is that it’s often helpful to know what kind of sample we’re looking at. I.e., if some portion of the record has to be pruned out, it’s often helpful to know how much was pruned. “How many pages, or boxes,” but perhaps also, if it’s possible to estimate, “How many words?” Of course, if it’s possible to retain more detailed metadata, that’s even better.
This is interesting. It is an avenue I am currently exploring from a couple of different angles: 1) trying to move to a big bucket approach to managing email; 2) collecting documents that we haven’t in the past due to shear volume of the paper, that may be more doable in a digital environment (but should we just because we can?).
Left unsaid here is information asymmetry. So whether you are looking at preserved records closely or from a distance you need to take into account how people correspond and what encourages and discourages doing it in writing. Even with a computational approach, human beings very much are in the picture.
Records only capture that portion of events that officials decided to memorialize. The complete record never is and never will be complete. Writing something down requires (1) trusting the recording keeping climate and (2) establishing and maintaining a record keeping culture that incentivizes memorialization.
As to “disposition instructions that are easier for creating offices to carry out,” in traditional records management, you have human beings creating, receiving, and trying to preserve records.
Rather than thinking in terms of creating offices (the term of art), I always think in terms of people–what they are capable of and what they are not. Helpful for future researchers to keep in mind, too. Not just in using records, but understanding at deep levels what went into preserving them.
Or why in some cases there are no records about certain actions.
And why the process of preserving electronic records now requires a fresh look. That calls for situational awareness by future researchers, of course.
Lack of memorialization can be innocuous. Or, like the dog that didn’t bark in the night, lack of records can point to clues about the climate in which people worked. In the particular examples that Meg gives, email messages could not be considered comprehensive as evidence. You’d have to factor that in when using them to try to ascertain patterns or behaviors.
With the official asking about lunch, most likely he would do that off the record using an intermediary who calls and sounds things out and sets up the meet-up. There wouldn’t necessarily be an email. If not, it would be the deliberate failure to memorialize that provides clues to the most significant environmental issue. Except you might not know it.
You might see clues elsewhere to the two having had contact. Or if they are very discreet, perhaps not. If they meet offsite, and talk over lunch, there wouldn’t be any visit records such as the temporary records created in the building access and appointment process.
Even messages about sick leave could not be considered comprehensive. Preserved email records wouldn’t capture oral communications such as a person telling a colleague, “My cough isn’t getting better, I’m taking the day off to tomorrow to see a doctor.”
So in both cases, preserved records would have to studied as fragmentary evidence.
The main difference we’ve found so far, when providing data-mining/analytical access to web archives, is that researchers require more context in order to interpret any trends in the data (rather like the ‘pruning’ example given above by Ted).
When we focus on discovering and accessing individual items this is less of an issue, but if you are trying to understand things like how word usage has changed over time (a la Google Books), then you really need to have some grasp on the biases in the way the data was collected. Things like crawl frequencies and scopes, and how you handle duplicate items, all become of paramount importance.
Many researches are used to studying fragmentary evidence, and as Maarja said, this needs to be borne in mind when exploiting these resources. However, we’ve found that it is possible to partially compensate for the incompleteness of the record by preserving more information about how the collection process worked.
Good points, Andy! Some of what researchers don’t know about paper records will apply here, as well. Creators of records don’t tend to memorialize what leads to “chilling effects,” fear in a record keeping environment, etc. To the extent it is known or intuited or adduced at the time of records creation, it is or may be known to a small circle of contemporaneous observers outside the traditional RM and IT functions. Whether that context survives to inform later researchers credibly or authoritatively is going to be extremely random.