In the Library’s Web Archives: Sorting through a Set of US Government PDFs

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.

Introduction to the File Datasets

The process to create datasets of the various media types follows a common set of parameters. These sets are not intended as exemplars or test sets, but rather as randomly selected samples that represent aspects of a larger, curated corpus. We do, however, hope that they offer a representative cross-section from a limited sampling frame. The first limit was the larger corpus. The Library seeks to collect sites ”broadly,” though not comprehensively, from all the branches of the federal government, and so this dataset initiative amplifies one of our collecting emphases. For this dataset, we selected items that were posted on publicly accessible US .gov domains at the time that the Library archived the resources.

This screen capture illustrates some of the “response header” text, which a server sends alongside the web object in response to a browser request. The red box outlines the media type designation for “application/pdf”.

This screen capture illustrates some of the “response header” text, which a server sends alongside the web object in response to a browser request. The red box outlines the media type designation for “application/pdf”.

We further limited our files by media type, which was identified according to metadata recorded in web harvesting logs. This metadata is requested from the source site when harvested, and it has not been further validated. This information is generally asserted by the provider’s servers and systems, just as it would be provided on the live web. Since the value may or may not be accurate, this adds an interesting layer to the sets since they may be used to further explore the level of accuracy of this supplied technical information. An example of the media type information that might be received is illustrated in the “Response header information,” where the media type application/pdf, the selector for this dataset, is highlighted (see figure). From the file populations limited by domain and media type, we randomly selected 1,000 items from the sample. We plan to release additional sets over the coming year, including various office and data documents, audio, video, and other formats. The PDF set is available here.

Each set will be packaged in a consistent structure and derived from comparable data about the web archives. Each set will be packaged in a ZIP file. The contents will be structured according to BagIt and include fixity information about the contents, a CSV with metadata, README, and a subfolder that will include the set of files. The methods for creating these sets is detailed further in each accompanying README file.

Understanding the 1,000 PDF Set

This chart plots the number of documents harvested by year. The data was derived from the “timestamp” field in the CSV accompanying the PDF set.

This chart plots the number of documents harvested by year. The data was derived from the “timestamp” field in the CSV accompanying the PDF set.

The first installment in the series is a set of 1,000 PDF files, randomly selected from .gov domain sites. The set may be described in many ways, but here are a few salient factors about the set’s technical characteristics. Uncompressed, the 1,000 .gov PDFs comprise 827.5 Megabytes. The PDFs were harvested during web crawls conducted over two decades, from 1996 to 2017, with significant peaks in 2009 and 2010 (each of those years saw nearly 200 PDFs harvested).

This chart uses the reported “create date” recorded for each web object in the PDF set, to illustrate how many documents were reported for each year, from 1974 to 2017.

This chart uses the reported “create date” recorded for each web object in the PDF set, to illustrate how many documents were reported for each year, from 1974 to 2017.

The creation dates extracted from the files suggest that the oldest file in the set was created in 1974, the most recent in 2017. This illustrates one of the challenges of the metadata. According to the published documentation about the PDF family, we know that PDFs weren’t created until the mid 1990s. Why, then, are there dates from the 70s and 80s? When we looked at the 1974 example, we noticed that it was a scan of a memo written in 1974. Someone had entered metadata about the original document, so this information is correct, indeed the source document dates to 1974, but it was misleading in the sense that the date was not when the PDF itself was created. Presumably many of the other dates were automatically generated when the file was created.

This chart plots the number of documents recorded for each PDF version, as reported in the document’s embedded metadata.

This chart plots the number of documents recorded for each PDF version, as reported in the document’s embedded metadata.

What else can we find out from the dataset’s metadata? The source domains mirror the Library’s collecting approach. Although the source domains range from Federal to state government sites, there are notable emphases on domains associated with the US Congress (the Library of Congress archives all web pages for members of the House of Representatives and the Senate, as well as House and Senate Committee websites), as well as domains associated with the Government Publishing Office (gpo.gov). The set appears to include PDF files of many versions, from 1.0 to 1.7 (as purported in the extracted metadata), although about half are version 1.4 or 1.5 (together, 534 files).

This chart shows the number of documents counted by their page count. It shows that more than half of the documents have only 1 or 2 pages. This chart only shows counts for documents that had 12 or more pages and it does not include the “long tail” of page counts. The longest document in the set has more than 1,000 pages!

This chart shows the number of documents by page count. It shows that more than half of the documents have only 1 or 2 pages. This chart only shows counts for documents that had 12 or fewer pages and it does not include the “long tail” of page counts. The longest document in the set has more than 1,000 pages!

While most of the documents have one (339) or two (208) pages, there is a large spectrum of document lengths, including one with a maximum length of 1,168 pages.

We expect that many other observations could be made about these files, which could reveal further insights about their form and content as government documents. We would welcome sharing of any insights that you might find, in the comments below.

Using the Sets

We envision many possible uses for this and subsequent sets. For example, those who download the set may be able to use it for testing workflows in the processing of digital content or collections. Likewise, it may be used to investigate various methods of file characterization and analysis for technical metadata. Since the files in these sets will be selected entirely from government entities, these may be of particular interest to government document librarians, in testing tools or experimenting with ongoing work to identify and auto-categorize documents. Digital preservation and iSchool educators may be interested to use the sets as examples for describing and processing collections with specific content. And of course, while we have suggested these possible uses, there are no doubt many other uses that we have not imagined. We would encourage exploration of the datasets and the accompanying metadata in your own experiments or work to analyze digital content.

This is the first installment of a series of datasets, which we will be posting about over the next year on The Signal. This project is part of our growing efforts to encourage innovation with the Library’s collections, connect with audiences, and to throw open the treasure chest of the Library’s collections. As more sets become available, we would invite you to link to them (//labs.loc.gov/experiments/webarchive-datasets/), download them, and explore them. We hope that they will be of use to the communities that we have already noted, as well as to those that we have not yet considered.

2 Comments

  1. Uldis Bojars
    March 26, 2019 at 12:01 pm

    Is there link text missing in the sentence starting with “For this dataset, we selected items that were […]”?

  2. Aly DesRochers
    March 26, 2019 at 12:06 pm

    Yes, there was, and it has been corrected! Thank you for pointing that out!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.