Web Archive Preservation Planning

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.

"Máquina de Rube Goldberg en la base del Alinghi" by Flickr user freshwater2006

"Máquina de Rube Goldberg en la base del Alinghi" by Flickr user freshwater2006

Though presented as a unified experience, a website depends on many interrelated parts: document markup and dynamic code, assorted binary file types, software interpreters and platforms. The challenge of web archive preservation planning is to save this experience, accounting for ongoing changes in the constitutive layers. While web archives and, for that matter, websites themselves haven’t been around long enough that we’ve yet had to contend with issues of wide-scale digital obsolescence at any layer of the stack, we are currently devising strategies to handle such eventualities.

The basic level of preservation applied across all of our production web archives is bit preservation – an unadulterated copy of the captured web content as it’s generated by the Heritrix crawler is stored on multiple, geographically-distributed backup tapes. We are in the process of devising a plan for more active preservation, both in concert and in parallel with other web archiving cultural heritage institutions. The primary organization through which such efforts are coordinated is the International Internet Preservation Consortium, and the Library of Congress has been heavily involved in the activities of its Preservation Working Group.

"my first piece of hand-written HTML in 1997!" by Flickr user laihiu

"my first piece of hand-written HTML in 1997!" by Flickr user laihiu

Web archives present particular challenges for preservation. Web content itself is extremely heterogeneous: idiosyncratically-coded, non-standards compliant HTML is common and there are hundreds of different file types, each potentially boasting multiple specified versions. This already-variegated topology also changes over time. If one were digitizing a collection of print materials, you might only have to deal with a handful of output formats that would be generated according to precise and knowable parameters; it’s much harder to get a handle on the “messy” web.

Another challenge is the layers of abstraction. The “intellectual object” we’re really interested in preserving is the archived website (i.e., the experience thereof, not simply the files that constitute it). When the crawler is running, content from multiple websites is streamed simultaneously and arbitrarily into multiple web archive container files called WARCs. This architecture complicates even determining where all of the files constituting a given website are, let alone preserve them. Beyond what actions may be necessary to preserve the WARC file format, each individual file type within the archive may require a unique preservation strategy itself.

We’ve lately been making inroads on being able to identify and analyze which files constitute a given site through custom scripts; adapting the PREMIS metadata schema for use with WARC files; and exploring analytics made possible through the use of Hadoop-based tools. In collaboration with the IIPC, we’re developing a comprehensive list of preservation risks to web archives with corresponding mitigation strategies; reference hardware and software platforms on which current web archives can be successfully rendered; and a bibliography of web archiving resources. We’re also sponsoring development of an extension to JHOVE2 to support analysis of files within WARCs.

In sum, there’s much work yet to be done. And while we’d all be pleased to wake up to a shiny, jetpack-wearing future, decades from now, to find that digital obsolescence had not (thus far!) proved to be the threat we’d imagined it to be, we’re not taking any chances in the meantime. With diligent attention to the changing web and the risk it poses to web archives, we aim to keep our collections accessible far into the future.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.