This blog post was co-authored by Chase Dooley (Senior Digital Collections Specialist) and Tracee Haupt (Digital Collections Specialist), members of the Library’s Web Archiving Team.
The Library’s Web Archiving Team recently released a derivative dataset that describes the United States Elections Web Archive, a collection that preserves over twenty years of campaign websites for candidates in presidential, gubernatorial, and congressional elections. This new dataset is part of a larger initiative to support emerging styles of computational research and joins other datasets that have been made publicly available by the Web Archiving Team and LC Labs. To help researchers better understand the dataset and how it might be used, the Web Archiving Team also created a Jupyter Notebook that goes step by step through the technical process of how the dataset was created and demonstrates a few ways to analyze it.
The United States Elections Web Archive
Campaign websites document political messaging in the digital era, and preserving them is crucial because they tend to change frequently throughout campaign seasons and disappear when the elections are over. The Library’s Web Archiving Team began the United States Election Web Archive in 2000, when the concept of a campaign website was still relatively new, and only a fraction of Americans reported getting election information online. In over two decades of collecting, the Library has created a historical record of how candidate’s websites and digital strategies evolved to become more sophisticated, and how the internet became an increasingly influential part of political campaigns.
Although there has been some variation in earlier years, for presidential elections, the team typically begins capturing websites in the lead up to the primaries, and for gubernatorial and congressional elections, the team begins capturing websites when candidates have been selected for the final ballot. The Library collects websites from all major political parties (Democratic, Republican, Libertarian, and Green), as well as candidates from lesser known parties that make it onto the ballot. Recommending Officers select the websites with the goal of trying to capture as much of the candidate’s web presence as possible, which sometimes includes social media channels or “spin-off” sites created by a candidate to highlight a particular topic, theme, or constituency. The frequency with which these websites are crawled can vary, but is usually once a week, and the content becomes publicly available after a one year embargo period. In earlier years, the United States Elections Web Archive also included websites from the government, political parties, advocacy groups, bloggers, and other entities that produced election-relevant content, but these types of websites are now part of other on-going collections like the Public Policy Topics Web Archive.
The dataset consists of web archive capture indexes (CDX files) from websites in the United States Elections Web Archive. CDX files are concatenated lines of metadata wherein each line represents a single object within a WARC (Web ARChive) file, the standard web archiving format. The dataset contains 411,815 Gzipped CDX files, totaling 250GB, which can be downloaded in bulk or subdivided by election year—starting from 2000 through the most recent election year out of embargo (2018), with more data to be added as it is released.
CDX files are created as the websites are crawled and are part of the process of providing access to archived websites using Wayback Machine software. These CDX files consist of eleven metadata fields delimited by a single space. Including information like URLs, timestamps, file sizes, digests, status codes, and mime types, this metadata is one way to provide a general summary of the vast amount of content in the collection and can serve as the basis for large-scale computational analysis. The main fields are listed and described in the table below, with additional details found in the associated README file.
|urlkey||URL of the captured web object, in SURT format||com,voter)/home/candidates/info|
|timestamp||timestamp for when the document was captured, in the format of YYYYMMDDhhmmss||20001002182124|
|original||URL of the captured web object||http://www.voter.com/home/candidates/info|
|mime type||media type as recorded in the CDX||text/html|
|status code||HTTP response code for the document at the time of its capture||200|
|digest||unique, cryptographic hash of the web object’s payload at the time of the crawl; a Base32 encoded SHA-1 hash, derived from the CDX index file||FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP|
|file size||size of the web object in bytes||15959|
|offset||location of the resource in the compressed Web Archive (WARC) file which stores the full archived object||691714720|
|file name||name of the compressed Web Archive (WARC) file which stores the full archived object||LOC-ELECTION2018-001-20180703135034546-00000-25341~wbgrp-crawl216.us.archive.org~8443.warc.gz|
A previous blog post goes into more detail about CDX files, their limitations, and an example of how the Web Archiving Team has utilized them to analyze our collections. An example of what a CDX file looks like appears below. If the long string of characters still looks a little intimidating, don’t worry– the Jupyter Notebook was created to help make the data easier to use and understand.
The Jupyter Notebook
Along with the new dataset, the Web Archiving Team has also released a new Jupyter Notebook that provides examples of how to further explore and analyze the data. This notebook joins others that are available through LC for Robots, and includes annotated blocks of Python code that can be reused and altered to suit researcher needs. If you need an introduction to Jupyter Notebooks, or want to brush up on your Python skills before diving in, we encourage you to check out free online tutorials.
One of the first things that the notebook does is provide the means for transforming CDX files into a DataFrame, a tabular format with rows and columns of data that can be easier to read and analyze. Next, the notebook demonstrates how to subdivide the data by election year and do an analysis of the mime type field. Mime types are the media types contained within a webpage, and they serve a similar function for the internet as file extensions do for Windows computers. The notebook demonstrates how to generate a list of mime types and the total number of each using a method from Pandas, an open source Python package used for data analysis.
As you can see above, the majority of the values show up as “-”. This means that a mimetype wasn’t found during indexing, likely because either the server didn’t provide one or the indexer could not parse what the response was. Further research into this could prove interesting. To simplify things, however, these blanks were dropped from the analysis for now and the remainder was turned into a bar graph.
Knowing that the majority of the remaining resources in the dataset have a reported text-based mimetype field value, this suggests avenues for textual analysis. As an example, the notebook shows how to fetch all the text from just the first fifty rows, and using a simple summation and sorting, create a list of the top 25 words. If you have access to a larger machine, it’s possible to increase the number or run the analysis across the whole DataFrame.
The notebook is intended to provide a basic outline of how to get started with the new dataset, and we encourage you build on it to come up with your own ways to analyze and interpret the data. Do you have interesting ideas for how this data could be used? We invite you to share them with us at [email protected]. Does this dataset leave you wanting more? Check out additional web archiving datasets and other blog posts that have been written about exploring web archiving data.