Datasets as Primary Sources, Part II

This post was written by Peter DeCraene, a 2021-22 Albert Einstein Distinguished Educator Fellow at the Library of Congress. This is part 2 of an ongoing occasional series about using datasets as primary sources. We thank Peter and the Teaching with LC team for allowing us to cross-post his writing for Signal Blog readers! 

Part 1 of this series looked at the transcription records of the Rosa Parks Papers. Other datasets from the Library’s Selected Datasets online collection include the diary of Samuel J. Gibson, a Union soldier held in a Confederate prisoner of war camp, and the papers of Susan B. Anthony. Each of these data files contains information about historical people and their views on the world around them at the time. The collection also includes information about topics as diverse as U.S. Geological Survey reports on water use and the Grand Comics Database.Webpage for Selected Datasets collection

The files in the Selected Datasets collection vary widely and require different approaches to teaching and learning.  And the collection is always growing. One of the well-documented and quickly accessible datasets is the “Dataset from a picture of subsidized households: 2008.” For computer science students interested in learning to access, clean, and analyze complex data, this dataset provides a wealth of information about the state of public housing in the U.S. at the beginning of the 21st century. It also includes a detailed document describing the information in the data files. Digging into this data would make a great cross-curricular project with social studies students researching the history and current state of public housing.

For teachers and students wishing to bypass some of the more technical aspects of accessing items in the Selected Datasets collection, viewing the Chronicling America collection as a dataset can also yield interesting information using the advanced search features. For example, students might look at the number of newspapers in Virginia with the words “free” and “independent” on their front pages in the years leading up to the Civil War, and compare the number of occurrences to those from the same search of newspapers in California, Alabama, or Ohio. In what contexts are those words used in each state? Performing the search one year at a time from the start of James Buchanan’s presidency in 1857 through the end of the war in 1865 might also reveal some interesting trends. Or, search for those words appearing on the second page and discuss the reasons the words might show up more or less frequently there. Determining search parameters, then analyzing and representing the results would be a good collaboration between students in math and social studies classes.

Chronicling America advanced search features

Data scientists perform this type of frequency analysis all the time on data gathered from many sources: polling information, website usage, or social media accounts, for example. In addition to the typical questions we ask about primary sources (Who created this item, why was it created, who was the audience?), this type of primary sources analysis also raises other questions: What might be missing from the data? How might this data have been used or misused? Would different data representations lead to different interpretations? The connections across school subjects and to current cultural practices make analysis of datasets as primary sources a vital and engaging part of our lessons.

Using Crowdsourced Transcriptions: An Interview with Allison Johnson

By the People volunteers have helped the Library of Congress return over 120,000 transcriptions back to loc.gov, making the Library’s collections more discoverable and accessible for all. To celebrate the impact our virtual volunteers have on the Library and its patrons, we are highlighting some of the ways that scholars, educators, and community members have used […]

Registration Now Open for IIPC’s 2022 Web Archiving Conference

We are excited to announce that registration is now open for the 2022 Web Archiving Conference! The event, which the Library of Congress is hosting in partnership with the International Internet Preservation Consortium (IIPC) will be held virtually on May 23-25, 2022. The conference is free and open to everyone with an interest in web […]

An Introduction to Born Digital Collections at the Manuscript Division, or How to Cross the Equator

The following guest post by Josh Levy, Historian of Science and Technology in the Library’s Manuscript Division, is part two of a series. You can find Part 1 of the series, “Doing History with Born Digital Files: the Rhoda Métraux and Edward Lorenz Papers,” posted on The Signal. Archives can’t just collect physical objects anymore. […]