Reflecting On a Year of Selected Datasets

Introduction

The Selected Datasets Collection was publicly launched June 2020 as part of the Library’s ongoing efforts to support emerging data-driven styles of research. Since then, our initial offering of twenty datasets has grown to nearly 200 unique items, and we’ve continued to refine the technical workflows by which content is prepared and delivered to users via loc.gov. We are pleased to share how these workflows have allowed the Library to provide access to certain LC-published datasets, in addition to highlighting some of the new items added to the Selected Datasets Collection.

Providing access to LC-published datasets

This past fiscal year, the Dataset Acquisitions Technical Group (DATG) developed a streamlined template for packaging datasets published by the Library of Congress for end user access via loc.gov. Library staff can consult this template for information regarding questions on dataset intake and processing.

Library staff are able to consult this generic structural/technical template for information regarding questions that pop up while processing the content, such as: How do I select the right file format for an access derivative? Am I naming the files correctly? Are there associated files that I should package with the dataset? Will my documentation make sense to someone regardless of experience working with datasets?

For an example of our template in action, let’s unpack one of the datasets from By the People, the Library’s crowdsourced transcription program. By the People now packages all of the volunteer-created transcriptions for retired campaigns as datasets and has published 7 to date. The Rosa Parks Papers dataset includes the full text and related metadata in a single CSV (Fig. 1), along with a README file that serves as the main dataset documentation. Both of these files are contained in a ZIP that is named according to the item’s Library of Congress Control Number (LCCN): 2020445590.zip. Library staff used the template to ensure that all of these elements would match the other datasets available on loc.gov. For more information on the use of LC-published derivative datasets, check out this Signal post covering how the Branch Rickey crowdsourced transcription dataset was processed during University of Michigan School of Information’s Ann Arbor Data Dive.

Fig. 1 – Image of README for Dataset from Rosa Parks Papers (LCCN 2020445590).

Let’s take a quick look at two other new dataset examples and how the template was used during their processing. The MARC Distribution Services Dataset is an export of MDSConnect, an openly available set of nearly 25 million MARC records that is split into 9 subsets: 1) serials, 2) maps, 3) music, 4) classification, 5) subjects, 6) books all, 7) computer files, 8) name authorities, and (9) visual materials. The records are available in two file formats: UTF8 and XML. The Dot Gov Datasets are the result of exploratory work conducted by the Web Archiving Team to make the Web Archives more widely accessible and usable. These 7 datasets each contain information related to 1,000 files of related media types selected from .gov domains in the Library’s Web Archives (i.e., image, PowerPoint, PDF, audio, and tabular data formats). Here’s a post from 2020 that provides more detail.

Other highlights added to the Selected Datasets Collection

Library staff added ten datasets from web archived instances of US Government websites contained in the Library’s Web Archives to the Selected Datasets Collection. Figure 2 shows where the 2009 RECS Survey dataset can be located from within an archived instance of the U.S. Energy Information Administration website (left), and how the files comprising the 2009 RECS Survey dataset have been packaged as a discrete dataset item, described, and made available on loc.gov (right).

Fig. 2 – An archived instance of the U.S. Energy Information Administration website, which includes several links to components of the 2009 RECS Survey data (L), and the related loc.gov item record (LCCN 2020445582) from which users may download the data files in a single package (R).

The Selected Datasets Collection has also continued to add items that have been extracted from external media carriers by staff in the Preservation Services Division (PSD). Figure 3 provides an example of a CD-ROM processed by PSD that contains “Employment and Wages, Annual Averages, 2005,” a publication from the Bureau of Labor Statistics that includes several TSV datasets that aggregate information from the Quarterly Census of Employment and Wages program by State/county and industry. More information regarding datasets on external media is available in this post covering the launch of the Selected Datasets Collection.

Fig. 3 – Example of an external media carrier from which dataset files have been extracted and made available via loc.gov (LCCN 84645713). 

Stay tuned for more updates

We look forward to sharing more dataset developments and if you have questions about datasets, please let us know in the comments.

For those in search of general information about datasets, check out this Research Guide created Library staff.

Interested in an example of how to perform research using datasets? Check out this Signal post on using a dataset of crowdsourced transcriptions as a tool for open research.

The September 11, 2001 Web Archive: Twenty Years Later

Today’s guest post is from Tracee Haupt, a Digital Collection Specialist in the Digital Content Management section at the Library of Congress. On the twentieth anniversary of the September 11th terrorist attacks, I asked four individuals who were part of the creation of the September 11, 2001 Web Archive to reflect on their experience documenting […]

Next Slide Please: 2021 Digital Strategy Summer Intern Design Sprint part I

This is an interview with Emily Zerrenner, Jodanna Domond, Luke Borland, and Darshni Patel, four of the seven students that joined our team during the summer of 2021. As a small group, they worked together to better understand the Library’s Web Archives with the needs of researchers and data visualization artists in mind.

Nominations sought for the U.S. Federal Government Domain End of Term 2020 Web Archive

This is a guest blog post by Abbie Grotke, Assistant Head, Digital Content Management Section You may have noticed that it is presidential election season in the United States, which means it’s also time for web archivists to gather once again to archive United States Federal Government websites during the end of the presidential term. […]

Gina Jones and 20 Years of Web Archiving at the Library of Congress

Today’s guest blog post is from Gina Jones and Abbie Grotke, both of the Web Archiving Team. As a part of our series looking back at some of the people and stories around our 20th Anniversary of Web Archiving, I wanted to share with you an interview with a person who has been working on […]

In a Web Archives Frame of Mind: Improving Access and Describing the Collections

This is a guest post by Lauren Baker, a Librarian-in-Residence on the Library of Congress Web Archiving Team (a part of the Digital Collections Management & Services Division). The Librarians-in-Residence Program offers early career librarians an opportunity to contribute to Library projects while learning from professionals in the field. In 2018, the Library of Congress […]

Introducing the Computing Cultural Heritage in the Cloud Project

With support from the Andrew W. Mellon Foundation, the LC Labs team will pilot ways to combine cutting edge technology and the collections of the largest library in the world, to support creative new uses of collections. This project will explore service models to support researchers accessing Library of Congress collections in the cloud, with findings shared throughout the 2 year project.

In the Library’s Web Archives: 1,000 U.S. Government PowerPoint Slide Decks

The Digital Content Management section has been working to extract and make available sets of files from the Library’s significant Web Archives holdings. The outcome of the project is a series of web archive file datasets, each containing 1,000 files of related media types selected from .gov domains. You can read more about this series […]

In the Library’s Web Archives: Dig If You Will the Pictures

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” […]