Top of page

Datasets as Primary Sources: An Archaeological Dig into Our Collective Brains, Part 2

Share this post:

This post was written by Peter DeCraene, a 2021-22 Albert Einstein Distinguished Educator Fellow at the Library of Congress. This is part 2 of an ongoing occasional series.

Part 1 of this series looked at the transcription records of the Rosa Parks Papers. Other datasets from the Library’s Selected Datasets online collection include the diary of Samuel J. Gibson, a Union soldier held in a Confederate prisoner of war camp, and the papers of Susan B. Anthony. Each of these data files contains information about historical people and their views on the world around them at the time. The collection also includes information about topics as diverse as U.S. Geological Survey reports on water use and the Grand Comics Database.Webpage for Selected Datasets collection

Some of the files in the Selected Datasets collection are easier to access than others, and some are better documented than others. And the collection is always growing. One of the well-documented and quickly accessible datasets is the “Dataset from a picture of subsidized households: 2008.” For computer science students interested in learning to access, clean, and analyze complex data, this dataset provides a wealth of information about the state of public housing in the U.S. at the beginning of the 21st century. It also includes a detailed document describing the information in the data files. Digging into this data would make a great cross-curricular project with social studies students researching the history and current state of public housing.

For teachers and students wishing to bypass some of the more technical aspects of accessing items in the Selected Datasets collection, viewing the Chronicling America collection as a dataset can also yield interesting information using the advanced search features. For example, students might look at the number of newspapers in Virginia with the words “free” and “independent” on their front pages in the years leading up to the Civil War,  and compare the number of occurrences to those from the same search of newspapers in California, Alabama, or Ohio. In what contexts are those words used in each state? Performing the search one year at a time from the start of James Buchanan’s presidency in 1857 through the end of the war in 1865 might also reveal some interesting trends. Or, search for those words appearing on the second page and discuss the reasons the words might show up more or less frequently there. Determining search parameters, then analyzing and representing the results would be a good collaboration between students in math and social studies classes.

Chronicling America advanced search features

Data scientists perform this type of frequency analysis all the time on data gathered from many sources: polling information, website usage, or social media accounts, for example. In addition to the typical questions we ask about primary sources (Who created this item, why was it created, who was the audience?), this type of primary sources analysis also raises other questions: What might be missing from the data? How might this data have been used or misused? Would different data representations lead to different interpretations? The connections across school subjects and to current cultural practices make analysis of datasets as primary sources a vital and engaging part of our lessons.

Do you enjoy these posts? Subscribe! You’ll receive free teaching ideas and primary sources from the Library of Congress.


  1. Readers may also be interested in Datasets at the Library of Congress: A Research Guide
    Datasets are a structured collections of data generally associated with a unique body of work. This guide provides information about various dataset collections, and suggests sites and resources for data science and machine learning projects.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.