Finding By the People Transcriptions in the Library’s Digital Collections

Today’s guest post is from Dr. Victoria Van Hyning, who served as a By the People Community Manager at the Library from 2018-2020. Starting in Fall 2020, she will be an Assistant Professor of Library Innovation at the University of Maryland iSchool, where she will continue her research on crowdsourcing, outreach, and inclusion.  


The Library of Congress launched the By the People (crowd.loc.gov) crowdsourcing project in October 2018. The project invites anyone with an internet connection to transcribe, review, and tag digitized images of manuscripts and typed materials from the Library’s collections. Everyone is welcome to take part! Volunteers don’t even need to create an account, but those who do have access to additional features such as tagging, and reviewing other people’s transcriptions.

All transcriptions are created and reviewed by volunteers before they are made available on loc.gov, the Library’s main website and discovery layer. These transcriptions improve search, readability, and access to handwritten and typed documents for those cannot read the handwriting of the original documents, or who use screen readers.

The By the People team works with a range of technical and curatorial staff across the Library to bring image files and metadata from loc.gov to crowd.loc.gov (where the materials are transcribed and reviewed) and to bring the resulting transcriptions back to loc.gov (where they improve search and access).

By following this link, you’ll get to the list of all By the People transcriptions that have been published on loc.gov. The search results for this link are updated automatically over time as new content is added. As of July 2020, we’ve published 16,315 transcriptions on the Library’s main website for the following collections:

  • Branch Rickey baseball scouting reports
  • Mary Church Terrell
  • Abraham Lincoln
  • Clara Barton
  • Susan B. Anthony
  • William Oland Bourne and the disabled Civil War soldiers

Publish early, acknowledge often

Image: Abraham Lincoln papers, Series 4. Addenda, 1774-1948: Cigar label, “El Biejo Onesto Abe Cigarros”, 1860. 

By the People is designed as a stand-alone website, meaning it is not directly tied to the Library’s main website. It was created here at the Library in 2018, when the LC Labs team joined with Library Services and the Platform Services Division of the Office of the Chief Information Officer to develop the crowdsourcing initiative and its software platform, Concordia. By the People built on an earlier experiment launched by LC Labs in 2017, called Beyond Words, as well as other crowdsourcing investigations. By the People has now moved from experiment to flourishing program, hosted by the Library’s Digital Content Management Section. The underlying code from Concordia is freely available for use and reuse via the Library of Congress’s GitHub.

Changes made to images, metadata or transcriptions on one site are not automatically reflected on the other. This means we need to reintegrate the transcriptions into the Library’s digital collections access systems in bulk. The Library was keen to demonstrate our ability to bring transcriptions back from By the People to the Library’s main website, so just three months after the project launched we published our first batch of crowdsourced transcriptions, consisting of 781 transcriptions of pages in the Abraham Lincoln papers.

We were also committed to prominently acknowledging the work of volunteers on this project. An attribution is included in each searchable, downloadable .txt file, and an overlay appears over the transcription viewer on loc.gov stating “Transcribed and reviewed by volunteers participating in the By the People project at crowd.loc.gov.”

Testing, testing

These 781 Lincoln pages were a pilot for our process and taught us many valuable things. The first was that it was best only to bring back completed items, (i.e., a whole diary or letter, rather than a hodgepodge of completed pages from within an item that is still being transcribed and reviewed). Although an argument could be made that more searchable text more quickly would optimize research and access, it turned out that researchers and our volunteers were confused when they only saw transcriptions for a handful of pages within a larger object. Therefore, we adjusted our process on the Concordia application only to export completed items.

Other iterations involved changing how we display transcriptions from other sources in the Lincoln papers. A new overlay was applied to some 10,000 transcribed pages in the Lincoln papers, created years ago by members of Knox College, thus bringing greater prominence to that collaboration. In early discussions about how to include and phrase attributions, senior staff reflected that this overlay might eventually enable the Library to indicate the provenance of other kinds of text, including Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR).

Whole datasets

In March 2019, we released our second data ingest, featuring the entirety of the Branch Rickey baseball scouting reports, which volunteers had transcribed and reviewed in just four months. Five weeks after the Campaign was completed, these 1,926 pages were published on loc.gov to celebrate Opening Day, while the bulk data was made available in .csv from, along with a README file and some preliminary analysis on labs.loc.gov.

Later this year, this version of the data, along with an updated version (v2) will be released as a government report with their own catalog record, titled “Datasets from Branch Rickey scouting reports.” Additional datasets will be added to the catalog as campaigns are completed. Examples of completed Campaigns include left-handed penmanship entries to a competition run by preacher and publisher William Oland Bourne for disabled Union veterans of the Civil War; the letters, autobiographical fragments, protest coordination notebook and other documents from Rosa Parks’s papers, and the diaries, speeches and letters of suffragists including Susan B. Anthony, and Carrie Chapman Catt.

These whole campaign datasets might interest researchers of machine learning, linguistics, history, sociology, and other domains.

New strategy! New crowd! New team!

Big news! We’ll launch a crowdsourcing program at the Library of Congress on October 24. We’re asking everyone to join us as we improve discovery and access across our diverse collections through transcription and tagging. The program is grounded in what we’ve learned through our previous experiences with participatory projects at the Library, including image […]

Introducing Beyond Words

As a part of Library of Congress Labs release last week, the National Digital Initiatives team launched Beyond Words. This pilot crowdsourcing application was created in collaboration with the Serial and Government Publications Division and the Office of the Chief Information Officer (OCIO) at the Library of Congress. In our first week and a half, […]

Cultural Institutions Embrace Crowdsourcing

Many cultural institutions have accelerated the development of their digital collections and data sets by allowing citizen volunteers to help with the millions of crucial tasks that archivists, scientists, librarians, and curators face. One of the ways institutions are addressing these challenges is through crowdsourcing. In this post, I’ll look at a few sample crowdsourcing projects […]

The Metadata Games Crowdsourcing Toolset for Libraries & Archives: An Interview with Mary Flanagan

I am excited to continue the NDSA innovation insights interview series to talk about the metadata games open source software project with Mary Flanagan. Mary is an artist, scholar and designer who holds the Sherman Fairchild Distinguished Professorship in Digital Humanities at Dartmouth College and serves as the director of Tiltfactor Lab. While she is broadly […]

What Does Innovation Look Like? The NDSA Innovation Working Group Wants to Know

The following is a guest post from Micah Beck, Associate Professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, and Jane Mandelbaum, Trevor Owens and Jefferson Bailey in the Library of Congress’s Office of Strategic Initiatives. What important big ideas are just around the corner in digital stewardship? What […]

Crowdsourcing the Civil War: Insights Interview with Nicole Saylor

The following is a guest post from Trevor Owens, Digital Archivist with the Office of Strategic Initiatives. I’m excited to offer this fourth interview for Insights, an occasional feature sharing interviews and conversations between National Digital Stewardship Alliance Innovation Working Group members and individuals involved with projects related to preservation, access and stewardship of digital […]