Anatomy of a Web Archive

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

I’m inclined to blame the semantic flexibility of the word “archive” for the fact that someone with no previous exposure to web archives might variously suppose that they are: the result of saving web pages from the browser, institutions acting as repositories for web resources, a navigational feature of some websites allowing for browsing of past content, online storage platforms imagined to be more durable than the web itself, or, simply, “the Wayback Machine.” For as many policies and practices guide cultural heritage institutions’ approaches to web archiving, however, the “web archives” that they create and preserve are remarkably consistent. What are web archives, exactly?

WARC , West African Research Center, by Robin on Flickr

WARC, West African Research Center, by Robin, on Flickr

At the most basic level, web archives are one of two closely-related container file formats for web content: the Web ARchive Container format or its precursor, the ARchive Container format. A quick perusal of the data formats used by the international web archiving community shows a strong predominance of WARC and/or ARC. The ratification of WARC as an ISO standard in 2009 made it an even more attractive preservation format, though both WARC and ARC had been de facto standards since well before then. First used in 1996, the ARC format is more specifically described by the Sustainability of Digital Formats website as the “Internet Archive ARC file format”, a testament both to the outsized contribution of the Internet Archive to the web archiving field as well as the recentness of the community’s broadening membership.

Anatomically, a WARC or ARC file can be thought of as a single document made up of a series of concatenated records. For the WARC format, these records can be one of eight different types, the most predictable of which represents an archived resource (e.g., html, JavaScript, image, video, Flash, etc.) retrieved from the web. Examples of other record types include crawler characteristics, http responses, http requests, resource capture details, pointers to previously-captured content (i.e., when crawler-based content de-duplication is enabled), alternate formats for previously-captured content (e.g., format obsolescence use case), and resources spanning multiple WARC files. Aside from the field designating the record type, there are three other mandatory fields found in the header of every WARC record: a record identifier, the record body size, and a timestamp.

This extensive technical metadata is what distinguishes a web archive from, say, a copy of a web page. Aside from testifying to the provenance and facilitating temporal browsing of the archived data, the variety and ubiquity of record headers also creates intriguing opportunities for metadata extraction and analysis.

Lego Bin, by Josh Hallett, on Flickr

Lego Bin, by Josh Hallett, on Flickr

As for the archived resources themselves, objects from different parts of the same website or multiple websites may be placed at random in one or more WARC files. The arbitrary packaging of harvested content facilitates parallelization of crawling and efficiencies in storing assets common to multiple sites (e.g., JavaScript libraries) but also explains the relatively slower load times of sites in the Wayback Machine; every single object that makes up the page must be unpacked from an arbitrary offset in many different files.

If you want to see for yourself, an appendix to the draft WARC specification contains examples of each of the WARC record types, including archived resources. Internet Archive also provides a set of test WARC files for download. Since even archived binary data is stored as (Base64-encoded) ASCII text, the files are surprisingly legible once unzipped and opened in a text editor. It’s not as seamless a way to navigate the past web as, say, Wayback Machine or Memento, but it will give a deeper understanding of the well-considered and widely-used data structure that makes those technologies work.

2 Comments

  1. Ross Spencer
    November 12, 2013 at 5:03 pm

    Great post Nicholas. Lots of useful information I can forward to my team.

    What tools would you recommend for people wanting to create web archives? Can a non-expert user pick up a tool and archive a single page in a WARC, for example? Would that make sense in such a format?

    Ross

  2. Nicholas Taylor
    November 13, 2013 at 2:49 pm

    Hi Ross, thanks for the comment. The tools for personal archiving of web pages and websites to WARC format are getting better, with the capture side further along than the replay side. Archive Ready (http://archiveready.com/) and WARCreate (http://warcreate.com/) can both be used to create a WARC containing all of the objects that make up an individual web page. GNU Wget 1.14+ (http://www.archiveteam.org/index.php?title=Wget_with_WARC_output) and WAIL (http://matkelly.com/wail/) can both be used to capture entire websites to WARC. WAIL also bundles a standalone Wayback Machine that runs locally, which is the easiest way I know of for users to view the content they’ve collected in WARC format.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.