Candidates, Campaigns, and CDX Files: A New United States Elections Web Archive Dataset

This blog post was co-authored by Chase Dooley (Senior Digital Collections Specialist) and Tracee Haupt (Digital Collections Specialist), members of the Library’s Web Archiving Team.

The Library’s Web Archiving Team recently released a derivative dataset that describes the United States Elections Web Archive, a collection that preserves over twenty years of campaign websites for candidates in presidential, gubernatorial, and congressional elections. This new dataset is part of a larger initiative to support emerging styles of computational research and joins other datasets that have been made publicly available by the Web Archiving Team and LC Labs. To help researchers better understand the dataset and how it might be used, the Web Archiving Team also created a Jupyter Notebook that goes step by step through the technical process of how the dataset was created and demonstrates a few ways to analyze it.

The United States Elections Web Archive

Campaign websites document political messaging in the digital era, and preserving them is crucial because they tend to change frequently throughout campaign seasons and disappear when the elections are over. The Library’s Web Archiving Team began the United States Election Web Archive in 2000, when the concept of a campaign website was still relatively new, and only a fraction of Americans reported getting election information online. In over two decades of collecting, the Library has created a historical record of how candidate’s websites and digital strategies evolved to become more sophisticated, and how the internet became an increasingly influential part of political campaigns.

Although there has been some variation in earlier years, for presidential elections, the team typically begins capturing websites in the lead up to the primaries, and for gubernatorial and congressional elections, the team begins capturing websites when candidates have been selected for the final ballot. The Library collects websites from all major political parties (Democratic, Republican, Libertarian, and Green), as well as candidates from lesser known parties that make it onto the ballot. Recommending Officers select the websites with the goal of trying to capture as much of the candidate’s web presence as possible, which sometimes includes social media channels or “spin-off” sites created by a candidate to highlight a particular topic, theme, or constituency. The frequency with which these websites are crawled can vary, but is usually once a week, and the content becomes publicly available after a one year embargo period. In earlier years, the United States Elections Web Archive also included websites from the government, political parties, advocacy groups, bloggers, and other entities that produced election-relevant content, but these types of websites are now part of other on-going collections like the Public Policy Topics Web Archive.

The Data

The dataset consists of web archive capture indexes (CDX files) from websites in the United States Elections Web Archive. CDX files are concatenated lines of metadata wherein each line represents a single object within a WARC (Web ARChive) file, the standard web archiving format. The dataset contains 411,815 Gzipped CDX files, totaling 250GB, which can be downloaded in bulk or subdivided by election year—starting from 2000 through the most recent election year out of embargo (2018), with more data to be added as it is released.

CDX files are created as the websites are crawled and are part of the process of providing access to archived websites using Wayback Machine software. These CDX files consist of eleven metadata fields delimited by a single space. Including information like URLs, timestamps, file sizes, digests, status codes, and mime types, this metadata is one way to provide a general summary of the vast amount of content in the collection and can serve as the basis for large-scale computational analysis. The main fields are listed and described in the table below, with additional  details found in the associated README file.

Attribute Definition Example
urlkey URL of the captured web object, in SURT format com,voter)/home/candidates/info
timestamp   timestamp for when the document was captured, in the format of YYYYMMDDhhmmss 20001002182124
original URL of the captured web object
mime type media type as recorded in the CDX text/html
status code   HTTP response code for the document at the time of its capture 200
digest unique, cryptographic hash of the web object’s payload at the time of the crawl; a Base32 encoded SHA-1 hash, derived from the CDX index file FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP
file size size of the web object in bytes 15959
offset location of the resource in the compressed Web Archive (WARC) file which stores the full archived object 691714720
file name name of the compressed Web Archive (WARC) file which stores the full archived object

A previous blog post goes into more detail about CDX files, their limitations, and an example of how the Web Archiving Team has utilized them to analyze our collections. An example of what a CDX file looks like appears below. If the long string of characters still looks a little intimidating, don’t worry– the Jupyter Notebook was created to help make the data easier to use and understand.

Strings of data that comprise a CDX file

Example of a CDX file

The Jupyter Notebook

Along with the new dataset, the Web Archiving Team has also released a new Jupyter Notebook that provides examples of how to further explore and analyze the data. This notebook joins others that are available through LC for Robots, and includes annotated blocks of Python code that can be reused and altered to suit researcher needs. If you need an introduction to Jupyter Notebooks, or want to brush up on your Python skills before diving in, we encourage you to check out free online tutorials.

One of the first things that the notebook does is provide the means for transforming CDX files into a DataFrame, a tabular format with rows and columns of data that can be easier to read and analyze. Next, the notebook demonstrates how to subdivide the data by election year and do an analysis of the mime type field. Mime types are the media types contained within a webpage, and they serve a similar function for the internet as file extensions do for Windows computers.  The notebook demonstrates how to generate a list of mime types and the total number of each using a method from Pandas, an open source Python package used for data analysis.

List of the top mimetypes and their quantities

Most common mimetypes in the United States Elections Web Archives








As you can see above, the majority of the values show up as “-”.  This means that a mimetype wasn’t found during indexing, likely because either the server didn’t provide one or the indexer could not parse what the response was. Further research into this could prove interesting. To simplify things, however, these blanks were dropped from the analysis for now and the remainder was turned into a bar graph.

bar graph of the top mimetypes

Most common mimetypes in the United States Elections Web Archives

Knowing that the majority of the remaining resources in the dataset have a reported text-based mimetype field value, this suggests avenues for textual analysis. As an example, the notebook shows how to fetch all the text from just the first fifty rows, and using a simple summation and sorting, create a list of the top 25 words. If you have access to a larger machine, it’s possible to increase the number or run the analysis across the whole DataFrame.

List of the top 25 words and their quanitites

Top 25 words in a selection from the United States Elections Web Archive

The notebook is intended to provide a basic outline of how to get started with the new dataset, and we encourage you build on it to come up with your own ways to analyze and interpret the data. Do you have interesting ideas for how this data could be used? We invite you to share them with us at [email protected]. Does this dataset leave you wanting more? Check out additional web archiving datasets and other blog posts that have been written about exploring web archiving data.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.