The diaries of George Washington, the first map that used the name “America,” jazz recordings from the 1920s, pictures from presidential inaugurations—the Library of Congress has a very diverse collection of documents, recordings, pictures and maps that help us understand the story of our nation. Most discussions of saving cultural heritage information involve looking into the past.
However, the wide use of the internet to publish documents, recordings, pictures and maps of the last 20 years prompted us to try to imagine time passing on these digital artifacts. We conducted a project called the Archive Ingest and Handling Test. In this post, I am looking back to 2003 to share with you some of the lessons we learned.
I was privileged to work with a group of partners on the project to test what might occur to a digital collection as time moved forward. The content used for the test was the September 11th Digital Archive at George Mason University. This archive was collected from people who contributed photos, documents and stories about their experiences on September 11, 2001. The archive project did not require standard information forms but rather took whatever its contributors uploaded. Capturing immediate responses to the event, it was spontaneous and diverse. We often referred to it as “content from the wild.”
The Sheridan Libraries at Johns Hopkins University, Harvard University Library, the Stanford University Library and the Computer Science Department at Old Dominion University joined Library of Congress staff to explore the events and conditions that digital content might encounter as time moved forward. The imagined circumstances included changes in technical systems, in administrative organizations taking care of the information, and in data forms.
Each partner worked with an identical copy of the archive, took it into their local system and examined it. When each had developed a good understanding of the files and information in the archive, they exported a copy and passed it along to another partner. The new copy was taken into the local system and the team examined it for differences. Some teams also tried migrating some of the files from one format to another to learn about changes that would influence future use.
The project also examined some of the commonly understood standards and repository systems of the time. The reports provide rich information about each of the partner’s experiences. While we had no crystal ball to see the radical changes to information witnessed in the last few years, we gained an understanding of how to approach these changes. Some of the lessons learned follow.
The more varied digital information is within a single collection, the greater the risk to saving it all. We saw unusual file formats in the data that would make it hard to use it in the future because we would not be able to identify appropriate software. We had difficulty even determining some file formats because they were not standard or correctly formed. The less standard the format, the harder it is to guarantee the survival of the content within the file. Libraries, archives and museums are assessing digital content to understand what can survive and what may disappear without heroic efforts. Recognition of future value is the “rocket science” of digital preservation.
Use a variety of tools to characterize and understand the digital information. The teams explored any tool they could find to help understand the data. Many of the tools developed during the NDIIPP program were inspired by the project. It is better to have several good tools rather than a perfect single tool. The universe of digital information is entirely too vast to ignore the need for automated evaluation of content.
Preservation systems must be flexible enough to accommodate all types of digital information. The partners tested the storage and management systems their organizations were using for materials that had been scanned and digitally formatted. There were challenges bringing diverse data from the Web into systems designed for content that had strict requirements for formats and metadata. The debate over whether to change the data to fit the system (migration), or to have systems that change with the data (emulation), continues today. In the intervening years, we have been grappling with the knowledge that one solution does not fit all.
Working together increases the expertise and capability of
- AIHT team working. Photo: M. Anderson
preservation action teams. All the project partners were recognized experts in different technical and organizational aspects. Their organizations were early leaders in digital preservation action. The combination of computer scientists, librarians, and technical developers fostered healthy debate and resulted in stronger conclusions and recommendations.
Diversity has negative and positive results. The more alike content is within a given collection, the more likely digital stewards can efficiently and economically ensure its survival. However, there is great risk in relying on single technical solutions and single organizations. A diverse community of organizations—public, private, stewardship organizations, content producers, commercial and non-profit—is needed to preserve digital information for future use.
What have you learned about saving digital information since you became aware of the need?
Update: I corrected my grammar and added a picture of the AIHT team at work.