Top of page

Datasets as Primary Sources: An Archaeological Dig into Our Collective Brains, Part 1

Share this post:

This post was written by Peter DeCraene, a 2021-22 Albert Einstein Distinguished Educator Fellow at the Library of Congress.

An archeological excavation, 1979. Photo by Eugene Prince.

“Data-driven decision making” is one of those buzz-phrases often heard in recent years, and it seems to carry with it the belief that data is objective and leads to the best decisions. However, data is a primary source that requires close analysis and questioning. Recently, I’ve been exploring datasets as primary sources with Eileen Jakeway Manchester, in the Library’s Digital Innovation Division, and while that experience with a particular dataset was interesting, it also opened up a host of other questions for me. To begin with, these “born digital” primary sources require a better understanding of how they are created and a new set of tools for digging in and understanding them.

Aside from the technical questions about being able to look at a dataset – which might be stored in a variety of different formats – I wondered what we could do once we have access to it. So I started with something familiar, the Library’s Primary Source Analysis Tool. The Analysis Tool’s questions prompt students to make observations, reflect on what they observe, and generate questions about a primary source. Similarly, when working with a dataset, we can take some time to observe what’s in the dataset, reflect on its purpose and importance, and question the kinds of information to be gleaned from it. Eileen and I came up with some additional questions specific to looking at data to prompt student discussion when exploring these items:

  • If there is a README or description page for the dataset, what information does it provide?
  • How is the information in the dataset organized?
  • What relationships might there be between different parts of the dataset?
  • What could the dataset be used for?

As an example, consider the item, “Dataset from Rosa Parks Papers,” which is part of the Library’s Selected Datasets online collection. This example, containing volunteer-created, full-text transcriptions of Parks’ papers and related meta data, has a short download time and consists of files that can be opened and explored with common software (which is not true of all of the Selected Datasets). The simplicity provides an opportunity for students to focus on learning to analyze a dataset without a heavy focus on historical background or math and computer science knowledge.

Downloading and unzipping the dataset will yield two files to look at: a README file in text format, and a .csv (comma separated values) file that can be opened using widely available spreadsheet software. Present students with a spreadsheet version of the data file first. (This is similar to the strategy of showing a primary source photograph without providing any context to encourage close observation.) Encourage them to observe, reflect on, and question the spreadsheet. Ask questions to prompt conversation as needed:

  • What might the column headings mean, and what kind of information is stored in each column?
  • Several columns all appear to have one entry repeated over and over. Does that value repeat throughout the entire column? Why might this happen?
  • Do any of the columns appear to be related to one another?
  • What do the URLs in the seventh column link to?
  • What appears to be the purpose of this spreadsheet?

Next, ask students to look at the README file to provide some context to the data. Encourage them to revise their observations with the new information provided, and reflect on the purpose of the data. Why was this dataset created? How might it be used? What questions come to mind about the dataset?

The Rosa Parks Papers by themselves present a very personal view of a civil rights icon. The dataset built from the papers provides a glimpse at the organization and transcription strategies of the Library, as well as a collection of transcribed papers in one file. Additionally, the spreadsheet itself gives students the opportunity to learn or practice skills with this type of file, such as formatting, filtering, finding, and sorting. Analyzing the spreadsheet and README files together will stretch students’ analysis skills into a new and growing set of born-digital primary sources that document human thinking.

We will be looking at the challenges and opportunities that arise when considering datasets as primary sources in future blog posts. If you’ve already explored items like this, let us know about your observations, reflections, and questions!

Do you enjoy these posts? Subscribe! You’ll receive free teaching ideas and primary sources from the Library of Congress.

Comments (2)

  1. Thanks for taking a deeper dive into the data and how it can be used with students. Even teachers who consistently integrate primary sources into their daily teaching sometimes get in the “rut” of using only images and/or text documents themselves. Having students gather and analyze the data provided both by the surface and the metadata/historic information provided from multiple sources can truly help students hypothesize and draw more accurate conclusions.

  2. Your readers may be interested in our library guide “Datasets at the Library of Congress: A Research Guide.” https://guides.loc.gov/datasets/introduction

    Datasets are a structured collections of data generally associated with a unique body of work. This guide provides information about various dataset collections, and suggests sites and resources for data science and machine learning projects.

Add a Comment

Your email address will not be published. Required fields are marked *