Top of page

The Magnificent Seven: Looking Back on a Year of Exploring the Web Archives Datasets

Share this post:

It has been just over a year since we kicked off a deep dive into the Library of Congress Web Archives on the Signal! Now at over 2 petabytes, the web archives are a complex aggregation of interrelated web objects that make up the internet as we know it (images, text, code, audio, video, etc.). In keeping with the Digital Strategy for the Library of Congress, we are working to “throw open the treasure chest” by making this digital content as broadly available as possible. However, without the proper tools to navigate this complex resource, users may think of the treasure chest as more of a Pandora’s box! Two broad goals directed our investigation: 1) to develop a better understanding of the individual media objects that comprise the web archives, and 2) to surface specific sets of individual resources from the web archives that will support users exploring research and creative uses of archived content. Let’s check in on how things have progressed.

The Datasets

Over the last year, we were able to release seven web archive file datasets. Each dataset consists of 1,000 files generated from indexes of the web archives, which were used to derive a random list of 1,000 files identified by specific media types and hosted on .gov domains, along with associated metadata extracted by Apache Tika and other tools.

Alongside releasing the seven datasets, we also published five blog posts which illustrate some questions to ask and explore in each of these datasets.

Creative Uses of the Datasets

Speaking of which, we were excited to see our datasets used in creative ways with a variety of digital tools. For example:

  • Matt Miller’s Byzantine PDF creates a “Frankenstein” PDF document by cobbling together bits and pieces from the 1,000 PDFs in our dataset.
  • Anaphora (also by Miller) uses AWS Transcribe to generate transcripts of audio files that can be used to find repeated phrases (see Figure 1 for a glimpse of the interface).

Please let us know if you are aware of any additional uses! We’d love to help spread the word and are eager to see how the data is being (re)used.

Detail of the Anaphora interface.

Learning from the Data

Creating, publishing, and exploring these datasets also helped us understand a little more about the individual resources that make up the web archives. One of the most significant issues was the accuracy of technical information in the HTTP responses. Extracting these resources demonstrated how often this data could be misleading. For example, many resources in the audio dataset media types have ambiguous relationships to specific media types. For example, the RealAudio format files could just as easily be audio, video, or even metadata files. On a similar note, the PDF post discusses how some of the PDF creation dates went back to the seventies (note: PDFs were’t created until the mid 1990s) because the files were encoded with the original date the document was produced in analog form, not the date that the file itself was created.

Perhaps the trickiest limitations are more conceptual in nature, concerning the strategies and rationales we use for drawing boundaries around particular web objects. The interconnected nature of these web objects requires us to balance a holistic understanding of the web archives with more practical concerns like how to “drill down” into a WARC to isolate and provide access to discrete media objects. One of the more confounding items we encountered is a spreadsheet that lacks any values because it is meant to dynamically extract data from external sources via a server-based backend. And let’s not forget that the ways in which people use formats will vary and change over time! The PowerPoint dataset includes slide decks that feel more like books (or game shows) than presentations, though you would never know this simply by looking at the media type declaration.

Looking Forward

Together, these datasets and blog posts begin to illustrate the wealth of potential value that can come from exploring the web archives. Thanks again for tuning into our In the Library’s Web Archives series on the Signal! We hope this has been as informative (and entertaining) to others as it has been for us.

Now that all the datasets are out, we would love to hear more from the digital preservation community about related efforts with web archives content! We are also interested in hearing about how these datasets have been used for educational and creative uses. So do tell us if you find any interesting ways to use these files.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.


Required fields are indicated with an * asterisk.