The Library of Congress Web Archives: Dipping a Toe in a Lake of Data

Today’s guest post is from Chase Dooley and Grace Thomas, Digital Collections Specialists on the Library of Congress Web Archiving Team. 

Over the last two decades, the Library of Congress Web Archiving Program has acquired and made available over 16,000 web archives, as part of more than 114 event and thematic collections. Each Web Archive is an aggregate of one or more websites, which in turn, are aggregates of many files presented together as a Web page in a browser; this aggregate of files are the images you see on the landing page of your favorite news (or gossip, no judging) site; they are the text that fills the articles; they are the bits of code that give you that clean, crisp modern layout. All of this together gives you a single web page. With an archive of over 1.7 petabytes of data in total, keeping track of every web object forming a website, which in turn form web archives, can be a bit like, well… herding cats.

Our program has been quite fortunate to see exponential growth in the total amount of data harvested each year, increases in the number and diversity of collections, and expanded engagement with web harvesting as a tool for routine acquisition of Library materials. However, balancing the ongoing maintenance of the web archives through this growth has previously left little room for exploring the web archives computationally. And if you sensed an “until now” coming, then go ahead, quit your day job and buy yourself a pack of Tarot Cards.

New tools and workflows utilized by the Library’s Web Archiving Team have paved the way for us to begin something that has been overdue: a deep dive into the web archive. By diving in, we hope to form a deeper understanding of the nature of individual web objects in the archive. This deeper understanding will allow more comprehensive maintenance of the archived web objects in the future. It will also ultimately provide several avenues of access to the archive for you, our users, our patrons, which aligns perfectly with goal of helping to “throw open the treasure chest,” as articulated in the Digital Strategy for the Library of Congress. This post serves as an introduction to the work we’ve done and the work that is yet to come.

The Best Way to Eat the Web Archive Elephant (For Us, For Now)

As we embarked on this brave new world of computational analysis, the first thing we encountered was a dilemma, and a rather basic one at that: how do we even begin? Though the tools and resources at our disposal were the best they had been, we still weren’t at a point where we could run analysis over the entire web archive simply because of its size and scale.

Before going further, some basic terms need defining for those not intimately familiar with web archiving. When we talk about the “web archive,” we’re talking specifically about WARC, or their predecessor ARC, files maintained on digital storage systems at the Library. W/ARC files are the standard web archiving file format, and are compressed containers of web objects and metadata about those web objects. It is these W/ARC files that make up the nearly 1.7 petabytes previously mentioned. And it was the size and scale of these files that were the cause and concern of our little endeavor. So, now, what to do?

Luckily, we decided, as a pilot and our first dip into this veritable lake of data, we didn’t need to analyze the entirety of the web archive. We were only interested in a handful of fields that just so happened to be represented in CDX index files—files we create as part of our routine workflow. Again, for those not as familiar with web archiving or those who don’t follow links in blog posts, CDX index files are concatenated lines of metadata about objects contained within W/ARC files. Each line in a CDX index file represents a single web object.

So, for our initial foray into the web archive analysis, we examined the metadata about the web objects as opposed to the web objects themselves. Our determination was that this approach would afford us a high-level view of the archive and a solid foundation from which to build out future analyses.

How and What We Learned From Our Quick Dip

The “how” is pretty simple. For those of you technically inclined, we ran a series of MapReduce jobs over the CDX index files, mapping and reducing on the MIME type and digest fields. For those of you less technically inclined, we programmatically read the CDX index files, sorted each line based on the MIME type and digest field, and counted the results. As a part of this process, we also included the URL of the captured web object to determine top-level domain aggregates, but more on that later.

For clarification, the MIME type field is the media type as identified and maintained by the Internet Assigned Numbers Authority, and as reported by the web object’s server at the time of capture. The digest field is a unique cryptographic hash of the web object’s payload at the time of the crawl, which provides a distinct fingerprint for that object. For our high-level analysis, the MIME or media type helped organize the web object metadata we examined, and the digest helped ensure uniqueness.

After the “how,” came the “what.” Our analysis of the CDX index files, particularly of the PDF web object metadata, yielded some very interesting data points. For instance, we counted 42,188,995 unique web objects with the PDF media type. But what does that number actually mean? Well, if each PDF web object—which is a line in the CDX index file with recognizable PDF MIME type—is indeed what it reports to be, meaning a reference to an actual PDF file in the web archive, then that would imply there are a little over 42.1 million PDFs in the Library’s web archive. However, like anything else, web archives are a special flower in many regards and there are a few caveats that go along with this seemingly basic assumption.

Those caveats can be summed up as: 1) CDX MIME types lie; and 2) metadata about the web objects are not the same as web objects themselves. The server that reported the object’s MIME type at the time of crawl possibly fibbed. Since that report is the basis of the metadata used for our analysis, we must understand that we are working from derivative metadata rather than the object itself. Validating the resources contained in the W/ARC files is a labor-intensive process that we are not yet poised to do.

You can’t say we didn’t warn you! But for simplicity’s sake let’s put the caveats aside for now, take those numbers at face value and, as we move forward, we’ll refer to the metadata extracted from the CDX index files as web objects. Even with that distinction made, how can we begin to wrap our minds around 42.1 million of anything? One thing working in our favor here is that PDFs are arguably a type of digital object most closely resembling something tangible: printed pages. Thankfully, there has been some great work on actual PDF documents in web archives, furthering this line of thinking, and we can stand on their backs to build some inferences about our web objects.

Analysis of the 2008 End of Term Web Archive suggests that PDF documents in that collection contained an average of 13.8 pages. So, if we printed out our 42,188,995 unique PDFs, keeping that average in mind, we would have approximately 582,207,579 pages burying our printer. If we work from the notion that there are roughly 1,800 pages in a linear foot, then those 582,207,579 printed pages would be equivalent to more than 61 linear miles of physical shelf space!

Number of Unique PDFs by Top-Level Domain
gov 19,211,519
org 9,471,589
com 4,485,917
us 2,714,703
id 2,397,448
br 1,557,179
ir 1,293,551
au 960,933
net 762,488
edu 649,763
int 577,685
za 518,213
mil 429,655
de 428,656
ca 374,287
in 369,021
my 359,225
eu 281,752
ro 252,564
jp 177,187

If you remember, we also included the URL of the web object as part of our analysis. We did some post-processing work, and summed up the frequency of top-level domains for the PDF web objects.

The corpus of unique PDF web objects comes from 739 top-level domains. The top-level domain is the last part of a website domain (like .com or .gov). You can see the 20 domains with the most unique PDF web objects associated with them in the table to the right.

Illustrative of the extent of the Library’s efforts to archive government websites, the most frequent top-level domain for the PDF web objects is .gov. 19,211,519 of the PDF web objects were archived from sites with the .gov top-level domain. Since the .gov domain is restricted for use only by United States government entities at federal, state, and local levels, a further exploration (hint hint) might take these 19.2 million PDF web objects into consideration as government documents.

Looking further down the list, the geographical diversity of our web archiving efforts begins to emerge, which mirrors the Library’s collection of international materials across all mediums. The top-level domains .id, .br, .ir, and .au represent country-specific domains from Indonesia, Brazil, Iran, and Australia, respectively, and offer a potential wealth of more than 6.2 million international documents to explore.

It is crucial to know your data when performing any flavor of large-scale, computational analysis. We know the collecting policies of the Library and subsequently the themes of Library’s web archive collections. We also know the prevalence of domains such as .com and .org throughout the live web. Our initial pass at dipping into the CDXs has tracked with what we would expect to see for this type of analysis.

“I want more!” you say?

We hear you! In fact, keep watching The Signal for more focused dives into sets of archived web objects. While we still can’t analyze our entire archive in one fell swoop, we are pulling out our microscopes and studying samples across the archive, sharing out our findings as we go. According to our friends at LC Labs, those sample sets will even come with a download link, just for you!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.