Finding By the People Transcriptions in the Library’s Digital Collections

Today’s guest post is from Dr. Victoria Van Hyning, who served as a By the People Community Manager at the Library from 2018-2020. Starting in Fall 2020, she will be an Assistant Professor of Library Innovation at the University of Maryland iSchool, where she will continue her research on crowdsourcing, outreach, and inclusion.  


The Library of Congress launched the By the People (crowd.loc.gov) crowdsourcing project in October 2018. The project invites anyone with an internet connection to transcribe, review, and tag digitized images of manuscripts and typed materials from the Library’s collections. Everyone is welcome to take part! Volunteers don’t even need to create an account, but those who do have access to additional features such as tagging, and reviewing other people’s transcriptions.

All transcriptions are created and reviewed by volunteers before they are made available on loc.gov, the Library’s main website and discovery layer. These transcriptions improve search, readability, and access to handwritten and typed documents for those cannot read the handwriting of the original documents, or who use screen readers.

The By the People team works with a range of technical and curatorial staff across the Library to bring image files and metadata from loc.gov to crowd.loc.gov (where the materials are transcribed and reviewed) and to bring the resulting transcriptions back to loc.gov (where they improve search and access).

By following this link, you’ll get to the list of all By the People transcriptions that have been published on loc.gov. The search results for this link are updated automatically over time as new content is added. As of July 2020, we’ve published 16,315 transcriptions on the Library’s main website for the following collections:

  • Branch Rickey baseball scouting reports
  • Mary Church Terrell
  • Abraham Lincoln
  • Clara Barton
  • Susan B. Anthony
  • William Oland Bourne and the disabled Civil War soldiers

Publish early, acknowledge often

Image: Abraham Lincoln papers, Series 4. Addenda, 1774-1948: Cigar label, “El Biejo Onesto Abe Cigarros”, 1860. 

By the People is designed as a stand-alone website, meaning it is not directly tied to the Library’s main website. It was created here at the Library in 2018, when the LC Labs team joined with Library Services and the Platform Services Division of the Office of the Chief Information Officer to develop the crowdsourcing initiative and its software platform, Concordia. By the People built on an earlier experiment launched by LC Labs in 2017, called Beyond Words, as well as other crowdsourcing investigations. By the People has now moved from experiment to flourishing program, hosted by the Library’s Digital Content Management Section. The underlying code from Concordia is freely available for use and reuse via the Library of Congress’s GitHub.

Changes made to images, metadata or transcriptions on one site are not automatically reflected on the other. This means we need to reintegrate the transcriptions into the Library’s digital collections access systems in bulk. The Library was keen to demonstrate our ability to bring transcriptions back from By the People to the Library’s main website, so just three months after the project launched we published our first batch of crowdsourced transcriptions, consisting of 781 transcriptions of pages in the Abraham Lincoln papers.

We were also committed to prominently acknowledging the work of volunteers on this project. An attribution is included in each searchable, downloadable .txt file, and an overlay appears over the transcription viewer on loc.gov stating “Transcribed and reviewed by volunteers participating in the By the People project at crowd.loc.gov.”

Testing, testing

These 781 Lincoln pages were a pilot for our process and taught us many valuable things. The first was that it was best only to bring back completed items, (i.e., a whole diary or letter, rather than a hodgepodge of completed pages from within an item that is still being transcribed and reviewed). Although an argument could be made that more searchable text more quickly would optimize research and access, it turned out that researchers and our volunteers were confused when they only saw transcriptions for a handful of pages within a larger object. Therefore, we adjusted our process on the Concordia application only to export completed items.

Other iterations involved changing how we display transcriptions from other sources in the Lincoln papers. A new overlay was applied to some 10,000 transcribed pages in the Lincoln papers, created years ago by members of Knox College, thus bringing greater prominence to that collaboration. In early discussions about how to include and phrase attributions, senior staff reflected that this overlay might eventually enable the Library to indicate the provenance of other kinds of text, including Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR).

Whole datasets

In March 2019, we released our second data ingest, featuring the entirety of the Branch Rickey baseball scouting reports, which volunteers had transcribed and reviewed in just four months. Five weeks after the Campaign was completed, these 1,926 pages were published on loc.gov to celebrate Opening Day, while the bulk data was made available in .csv from, along with a README file and some preliminary analysis on labs.loc.gov.

Later this year, this version of the data, along with an updated version (v2) will be released as a government report with their own catalog record, titled “Datasets from Branch Rickey scouting reports.” Additional datasets will be added to the catalog as campaigns are completed. Examples of completed Campaigns include left-handed penmanship entries to a competition run by preacher and publisher William Oland Bourne for disabled Union veterans of the Civil War; the letters, autobiographical fragments, protest coordination notebook and other documents from Rosa Parks’s papers, and the diaries, speeches and letters of suffragists including Susan B. Anthony, and Carrie Chapman Catt.

These whole campaign datasets might interest researchers of machine learning, linguistics, history, sociology, and other domains.

Launching the Digital Collections Management Compendium

Over the past two years, my colleagues and I in the Digital Content Management section have been working with experts from across many divisions of the Library of Congress to collate and assemble guidance and policy that guide or reflect the practices that the Library uses to manage digital collections. I am excited to share […]

Data and Humanism Shape Library of Congress Conference

The presentations at the Library of Congress’ Collections As Data conference coalesced into two main themes: 1) digital collections are composed of data that can be acquired,  processed and displayed in countless scientific and creative ways and 2) we should always be aware and respectful that data is manipulated by — and derived from — people. […]

Digital Collections and Data Science

Researchers, of varying technical abilities, are increasingly applying data science tools and methods to digital collections. As a result, new ways are emerging for processing and analyzing the digital collections’ raw material — the data. For example, instead of pondering one single digital item at a time – such as a news story, photo or […]

Co-Hosting a Datathon at the Library of Congress

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline […]

Insights Interview: Josh Sternfeld on Funding Digital Stewardship Research and Development

The 2015 iteration of the National Agenda for Digital Stewardship identifies high-level recommendations, directed at funders, researchers, and organizational leaders that will advance the community’s capacity for digital preservation. As part of our Insights Interview series we’re pleased to talk with Josh Sternfeld, a Senior Program Officer in the Division of Preservation and Access at […]

What’s a Nice English Professor Like You Doing in a Place Like This: An Interview With Matthew Kirschenbaum

I’ve talked about Matthew Kirschenbaum’s work in a range of posts on digital objects here on The Signal. It seemed like it would be valuable to delve deeper into some of those discussions here in an interview. If you are unfamiliar, Matthew G. Kirschenbaum is Associate Professor in the Department of English at the University […]

Digital Humanities and Digital Preservation

This past April 8 was the 2013 “Day of Digital Humanities.”  Started in 2010, this is an annual event of blogging and tweeting about the experience of digital humanities by graduate students, professors, alt-academics, librarians and other participants who identify with the field.  And “the field” of Digital Humanities can be whatever you define it […]