In the Library’s Web Archives: Totally Tabular Data

This is a guest post by Pedro Gonzalez-Fernandez, a Digital Collections Specialist in the Digital Content Management Section at the Library of Congress. He holds both an MLS and an MA in Music History and Literature from University of Maryland, as well as a BA in Music from Shepherd University.

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs. Although we invite usage and interest from a wide range of digital enthusiasts, we are particularly hoping to interest practitioners and scholars working on digital preservation education and digital scholarship projects.

It has been argued that spreadsheets were the original “killer app” of personal computing. This level of hype may seem out of place when associated with the humble software tools we use to arrange information in rows and columns, be it for work-, school-, or home-related purposes; however, the intuitive structure of tabular data and the ease in which it can be disseminated has had a major impact on research as a whole. What other format can so easily organize and allow us to analyze, say, DNA microarray data, a breakdown of domestic airline fare for Q1 2016, and histological information related to Puerto Rican oysters?

Analysis of data from the Library of Congress Web Archives is useful in documenting how various forms of tabular data have become part of digital government publishing. We introduced the file datasets and our methods of creating them in recent posts about our PDF and audio datasets, and are glad to be continuing the series with the release of three CSV/TSV/XLS datasets. A wide range of government agencies are making tabular data available in these formats and many of these files have been archived in the Library of Congress Web Archives. Essentially, all of what we find in these files is—structurally speaking—tabular data. Still, exploring these sample datasets illustrates some of the varied and creative ways that these kinds of files have been used by different federal agencies.

Understanding the 1,000 CSV/TSV/XLS File Datasets

The third installment of our series includes three sets of 1,000 files with tabular data media types randomly selected from archived .gov domain sites. We selected files from many media types, which corresponded to Microsoft’s Excel files (XLS and XSLX), comma-separated values (CSV), and tab-separated values (TSV). Technical and structural information about the digital files were extracted using Apache Tika (check out the READMEs for more in-depth information). We have taken a look across the three sample datasets and have some initial observations to share out with the hope of inspiring further analysis. What follows is a preliminary examination of the 1,000 randomly selected URLs for each media type.

The formats that jump to mind for many are XLS and XLSX, a ubiquitous format for the Excel component of the Microsoft Office suite. Over the years, XLS has gained several features that add to its versatility, including the ability to have multiple sheets in a single workbook, formulas that can be applied directly to the data, enhanced formatting capabilities, and the ability to embed different formats and hyperlinks within cells. These features allow for complexity that can make these files challenging from a preservation perspective and, as such, place XLS on the lower end of “Acceptable” in the Library’s Recommended Formats Statement. In contrast, “flat” formats like CSV and TSV are less problematic in this regard. Comma-separated values (CSV) and tab-separated values (TSV) are both simple, line-oriented formats that differ in regards to their chosen delimiters. This means that these files, by virtue of having no bonus features, can be accurately rendered by a wider range of programs.

Uncompressed, the 3,000 .gov tabular data files comprise approximately 3 Gigabytes (~1 GB for CSV, ~45 MB for TSV, ~332 MB for XLS). Many of the files lack embedded creation dates, though the harvested date (which comes from the timestamp column in the metadata CSV) provides some temporal information about the files. Like the PDF and audio file datasets, items in this set were harvested during Web crawls conducted over two decades, from 1996 to 2017. The majority of this content, for all three formats, was captured between 2007 and 2017.

Files Archived from Various Federal Agencies Sites

What government sites feature most prominently in these samples? Below is a table with the number of URLs for the top ten domains in each of these three data sets. This offers some helpful context for beginning to explore this content. The table below shows the top ten base URLs (e.g., https://www.spc.ncep.noaa.gov becomes www.noaa.gov) to illustrate general trends by media type in these datasets. The 1,000 unique URLs for each media type have now been reduced to far less headache-inducing numbers.

Base URL (CSV) Instances Base URL (TSV) Instances Base URL (XLS) Instances
www.noaa.gov 676 www.nrel.gov 382 www.ed.gov 195
www.dot.gov 61 www.sandia.gov 146 www.ca.gov 108
www.nrel.gov 60 www.lanl.gov 134 www.nsf.gov 60
www.energy.gov 38 www.nasa.gov 133 www.dot.gov 45
www.bea.gov 29 www.anl.gov 32 www.wa.gov 39
www.ca.gov 23 www.noaa.gov 32 www.dhs.gov 36
www.nersc.gov 15 www.llnl.gov 30 www.gpo.gov 35
www.federalreserve.gov 8 www.usgs.gov 12 www.bea.gov 32
www.faa.gov 8 www.mt.gov 4 www.usda.gov 32
www.itdashboard.gov 8 www.loc.gov 2 www.fbi.gov 23

Right away we can observe different usage patterns. Note how the XLS instances are distributed more evenly than the top-heavy consolidation of CSV and TSV base URLs. Sure enough, if we look beyond the above table to include all unique base URLs from the long tail, there are ~102 unique base URLs for XLS whereas CSV and TSV had ~46 and ~13, respectively. It’s easy to imagine that a very wide range of government agencies are producing Excel spreadsheets, so this isn’t much of a grand reveal; however, it may be illuminating to look across the relatively small number of agencies using CSV and TSV. Let’s dig deeper by examining the agencies that post the highest number of datasets in each format, as well as those that make up the long tail.

You Can’t Spell “Department of Commerce” Without C…SV?

This chart illustrates CSV files by top-level domain.

This chart illustrates CSV files by top-level domain.

The vast majority of CSVs in the dataset come from the National Oceanic and Atmospheric Administration’s website (NOAA.gov). The 676 files from NOAA.gov are more than eleven times the 61 files from the Department of Transportation (DOT.gov). It appears the Library’s Web Archives have accumulated detailed climate data from NOAA agencies across the globe. For example, this CSV file from the National Centers for Environmental Information includes regular weather updates from station 27995099999 in Samara Bezenchuk, Russia. Additionally, data from the Tides & Currents website appears to update automatically in response to sensor readings across the United States, and these reports then get released as CSVs and are now archived in the Web Archives.

They Might Be (Tab-Separated) Datasets

This chart shows TSV files per top-level domain.

This chart shows TSV files per top-level domain.

Looking at the largest TSV file captured (~.4 MB) from the National Renewable Energy Laboratory, a thought pops into my mind: “this doesn’t look like a structured research dataset, this looks like an issue from a bug tracking ticket!” This illustrates a common use for tabular data formats, which we noted: exporting data from ticketing systems.

Of the 1,000 sample TSVs, ~845 of the URLs include the string “trac,” and these files range in size from 38 KB to 406 MB. There are no instances of “trac” in the XLS sample and only 24 in the CSV sample. In these cases, “trac” indicates that these files present exports of information about tickets from an open-source project management system called Trac, which is integrated in a version control systems for software development. Opening up some of the larger Trac tickets gives us a glimpse into various task summaries, user IDs for reporters/owners, prioritization information, and more.

This demonstrates another key aspect of working with files from the Web Archives. Because these Trac files represent individual issues in a system, the system may output many individual files. In contrast, there are other individual files in the 1,000 file dataset that include massive amounts of individual rows of data. Still, considering that ~80% of our TSV sample consists of project tasks, feature requests, bug reports, and software support issues, it seems probable that there is a vast number of tickets within the Web Archives that may provide insights into the world of managing government IT projects.

Excel’s Affordances Make Things More Complicated

This pie chart shows the distribution of Excel files per top-level domain.

This pie chart shows the distribution of Excel files per top-level domain.

There are a range of interesting datasets in the XLS files. For example, student origin data from New York State Education Department, the fastest growing occupations in Alaska between 2006 and 2016, and the results of the 2010 Annual Volunteer Survey for Peace Corps participants working on projects in Nicaragua.

If we approach this sample of files by opening up some of the larger files, we end up with some great illustrations of the varied ways individuals use Excel to present and manipulate tabular data.

Our biggest sample file, the BPA Regional Dialogue Generation Schedule Break Out Illustrative Tool (19.5 MB), comes from the Bonneville Power Administration, a nonprofit federal power marketing administration that is part of the Department of Energy. The Illustrative Tool is basically an interactive software product—contained within a complex workbook—that uses several Excel functions to dynamically respond to user input. For this reason, we see all sorts of embedded documentation, functions, color-coded sections, and so on. It’s unclear if one could make use of these functions without access to Excel.

An example of an interactive Excel workbook.

This screen capture illustrates an interactive Excel workbook with various options for the user to make changes.

And let’s not forget that a common feature of spreadsheet applications is to extract data dynamically from external sources. For example, the NASA Langley Aero Compass Text Execution Management System has a logo image that pops up before asking you “Connect to database to get/store Table Data?” As one might expect, the link to a server-based backend for this file is not functioning. As a result, the archived file is more like an interface to a Web application than a source of data itself.

How Can We Use (and Get More of) This Information?

When we first opened up the batches of CSV, TSV, and XLS files, we anticipated being presented with a wide range of examples of structured data. Indeed, there are a great variety of datasets in the Web Archives, but we also found some unexpected things. The affordances of tabular data formats and the software used for presentation allow users to produce content that does not exactly fit with most definitions of a typical research dataset.

These sample files offer a point of entry from which to explore and study the life and function of these formats in government Web publishing. So dive in and explore! As you do, please share any and all interesting details that you can tease out or surface as you explore these tabular datasets.

Visualizing Chronicling America Data: 15 million pages of digitized historical newspapers

The Library of Congress celebrates an exciting milestone as Chronicling America, the online searchable database of historic U.S. newspapers, now includes more than 15 million pages! To mark the occasion, we are throwing a #ChronAmParty on Twitter and unveiling a set of interactive data visualizations that help reveal the variety of content available in a corpus of 15 million digitized newspaper pages.

Digital Strategy v1.1.2 Now Available

In October 2018, the Library published a new digital strategy describing the Library’s objectives for digital transformation over the next five years. The strategy describes goals such as growing online collections, creating opportunities for deeper engagement, and investing in an innovation culture that supports a changing information landscape. In an effort to reflect some of […]

In the Library’s Web Archives: US Government Audio on Shuffle

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s […]