The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.
Though presented as a unified experience, a website depends on many interrelated parts: document markup and dynamic code, assorted binary file types, software interpreters and platforms. The challenge of web archive preservation planning is to save this experience, accounting for ongoing changes in the constitutive layers. While web archives and, for that matter, websites themselves haven’t been around long enough that we’ve yet had to contend with issues of wide-scale digital obsolescence at any layer of the stack, we are currently devising strategies to handle such eventualities.
The basic level of preservation applied across all of our production web archives is bit preservation – an unadulterated copy of the captured web content as it’s generated by the Heritrix crawler is stored on multiple, geographically-distributed backup tapes. We are in the process of devising a plan for more active preservation, both in concert and in parallel with other web archiving cultural heritage institutions. The primary organization through which such efforts are coordinated is the International Internet Preservation Consortium, and the Library of Congress has been heavily involved in the activities of its Preservation Working Group.
Web archives present particular challenges for preservation. Web content itself is extremely heterogeneous: idiosyncratically-coded, non-standards compliant HTML is common and there are hundreds of different file types, each potentially boasting multiple specified versions. This already-variegated topology also changes over time. If one were digitizing a collection of print materials, you might only have to deal with a handful of output formats that would be generated according to precise and knowable parameters; it’s much harder to get a handle on the “messy” web.
Another challenge is the layers of abstraction. The “intellectual object” we’re really interested in preserving is the archived website (i.e., the experience thereof, not simply the files that constitute it). When the crawler is running, content from multiple websites is streamed simultaneously and arbitrarily into multiple web archive container files called WARCs. This architecture complicates even determining where all of the files constituting a given website are, let alone preserve them. Beyond what actions may be necessary to preserve the WARC file format, each individual file type within the archive may require a unique preservation strategy itself.
We’ve lately been making inroads on being able to identify and analyze which files constitute a given site through custom scripts; adapting the PREMIS metadata schema for use with WARC files; and exploring analytics made possible through the use of Hadoop-based tools. In collaboration with the IIPC, we’re developing a comprehensive list of preservation risks to web archives with corresponding mitigation strategies; reference hardware and software platforms on which current web archives can be successfully rendered; and a bibliography of web archiving resources. We’re also sponsoring development of an extension to JHOVE2 to support analysis of files within WARCs.
In sum, there’s much work yet to be done. And while we’d all be pleased to wake up to a shiny, jetpack-wearing future, decades from now, to find that digital obsolescence had not (thus far!) proved to be the threat we’d imagined it to be, we’re not taking any chances in the meantime. With diligent attention to the changing web and the risk it poses to web archives, we aim to keep our collections accessible far into the future.