Selected Datasets: A New Library of Congress Collection

Friends, data wranglers, lend me your ears; The Library of Congress’ Selected Datasets Collection is now live! You can now download datasets of the Simple English Wikipedia, the Atlas of Historical County Boundaries, sports economic data, half a million emails from Enron, and urban soil lead abatement from this online collection. This initial set of 20 datasets represents the public start of an ongoing collecting program tied to the Library’s plan to support emerging styles of data-driven research, such as text mining and machine learning. In this post we’ll take a broad look at the work that went into implementing this program, check out some of the acquired datasets, and discuss our plans for continued program development.

Figure 1. About page for the Selected Datasets Collection.

What’s up, DAWG?

Last year, in support of the goals of the Library of Congress Digital Collecting Plan to “Expand Collecting of Appropriate Datasets and Other Large Units of Content” the Library of Congress Dataset Acquisitions Working Group (DAWG) piloted the acquisition of varied, exemplar datasets in order to explore the feasibility of a dataset acquisition program at the Library. DAWG focused on accomplishing three critical objectives:

  • define the attributes of a dataset acquisition to consider during scoping and selection processes,
  • establish end-to-end acquisition workflows, and
  • determine how to provide appropriate access to patrons.

These goals required navigating a range of interconnected technical/workflow issues and subject matter questions that called on the expertise of staff across the Library. For more information on these considerations, check out the Library’s updated Supplementary Guidelines for dataset acquisitions that DAWG revised in collaboration with the Collection Development Office.

Datasets in scope

Per the Supplementary Guidelines, “the Library focus[es] on the selective acquisition of datasets in fixed and published forms that are:

  • Within scope under relevant Collections Policy Statements, and
  • Rank high on the following list of criteria:
    • Usefulness in serving the current and future informational needs of Congress and researchers,
    • Unique information provided,
    • Scholarly content,
    • Currency of information, and
    • Risk of loss.

Prepping content for Selected Datasets

When datasets and their associated materials are acquired, an authentic and accurate copy of the content as received will be packaged using Bagger (a digital records packaging/validation tool based on the BagIt Specification) and stored permanently in the digital repository. For content that is licensed in a way that allows for patrons to download their own copies, the Digital Content Management section creates zipped access files that can be downloaded via individual item records on loc.gov. Here’s an example of how a dataset’s available versions are displayed in a loc.gov item record, and how an unzipped data package generally looks:

Figure 2. (Left) Multiple versions of a dataset are downloadable from loc.gov records as zipped data packages (//www.loc.gov/item/2018487648); (Right) An unzipped data package will generally result in a single directory that is named according to the item’s Library of Congress Control Number (LCCN). This directory will contain the dataset(s), documentation, and any other relevant material (//loc.gov/item/2019205400).

After downloading and unzipping the content, you are able to dive right in!

So what’s in the collection?

DCM piloted the acquisition of varied datasets in order to address potential technical and workflow issues. Here’s some info on 3 of the available items:

  • Rodney Fort’s Sports Business Data Pages is an openly available aggregation of professional sports economics data from the early eighties to present. The downloadable zip includes several XLS files documenting individual leagues, player salaries, and more, with plenty of documentation and reference materials. We will collect a new version of the dataset once a year from the creator.
  • The Atlas of Historical County Boundaries includes GIS and KMZ files that comprehensively document the size, shape, and location of every U.S. State (plus D.C.!). There is an abundance of historical and technical context provided by the project team in sidecar PDFs.
  • FMA : a dataset for music analysis includes thousands of CC-licensed mp3s from the Free Music Archive, along with sidecar metadata files containing track-level information (such as ID, title, artist, genres, tags and play counts), genre IDs, and data generated using LibRosa and echonest (music/audio analysis software). The Library also archived a copy of the FMA git repository in order to capture all the code (Jupyter notebooks, scripts) developed for interacting with the dataset and metadata files.

Building on a history of data collecting

This is not the Library’s first effort to collect and provide access to datasets. Technically, the Library has received published data in print volumes for its entire history, but let’s keep our focus on datasets stored in digital form. The Library has also collected a large amount of datasets on external media carriers (e.g., CD-ROM, floppy disk). DAWG collaborated with the Preservation Reformatting Division to copy the content from these carriers in order to provide downloadable access to disk images that are free of rights restrictions. For example, check out this Environmental Protection Agency data from 1996 or this NOAA data from 1992.

Figure 3. Two external media items that were imaged by the Preservation Reformatting Division. (Left) //www.loc.gov/item/2006570348/; (Right) //www.loc.gov/item/97128645/.

What’s next for the Library’s dataset collecting program?

The dataset collecting program has a lot of growing to do! DAWG recommended the formation of a standing technical group to address the wide range of issues that may pop up during future dataset acquisitions. This new team is the Dataset Acquisitions Technical Group (DATG), which is coordinated by the Digital Content Management section.

In addition to bringing in new dataset content, here are a few of DATG’s priorities:

  • Refine and further develop MARC protocols for describing datasets,
  • Consult with reference and acquisitions librarians on ways to streamline procedures for scoping and nominating dataset acquisitions, and
  • Spread the word about the program to increase involvement with stakeholders throughout the organization.

That’s all for this announcement! As the program matures, we will continue to post updates here on the Signal. We look forward to seeing the exciting ways that users interact with the data and encourage any/all feedback, so please reach out to let us know what you think!

2 Comments

  1. Diane K
    June 24, 2020 at 5:36 pm

    suggestion: data mining both House & Senate congressional records from say 1917-1950. Non-classified, of course…

  2. Bianca Aguglia
    June 30, 2020 at 7:08 pm

    Thank you. ❤️

    These kinds of datasets are a wonderful gift to data hobbyists and data scientists. Thank you for making them freely available.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.