Today’s guest post is from Lauren Algee, a Senior Digital Collections Specialist & By the People community manager at the Library of Congress.
The Library of Congress launched the By the People crowdsourced transcription program in 2018. Since then, we have invited anyone to volunteer by transcribing Library of Congress digital collections through our online platform, Concordia. Completed transcriptions go back into Library of Congress digital collections on loc.gov to make them keyword searchable and improve accessibility. We also publish transcriptions in bulk as open datasets. By the People transcription campaigns have always included typed and printed text – typed scouting reports of baseball great Branch Rickey were one of our first campaigns! Our team is asked often why we ask volunteers to transcribe print and typed collections instead of using OCR (Optical Character Recognition).
To OCR or not to OCR?
Optical Character Recognition is a software tool that extracts text from images and is widely used by libraries to create searchable and accessible text for digitized books and print materials. However, not all print or typed document collections are good candidates for applying OCR at scale.
Some print materials – such as those with dark, light, or blurry text; text with ornate or unusual fonts; and “noisy” pages (like those with ink bleed-through or other marks on the page) – do not produce accurate text through OCR. The Library is constantly improving its workflows (see our colleagues’ recent blogpost on newspaper OCR enhancement), but many material types remain challenging. Additionally, many Library of Congress collections are a mix of both handwritten and printed pages. These collections don’t make good candidates for batch OCR processing, since the majority of the collection (i.e., the handwritten items) would produce poor results.

While many of our collections are not ideal for a mass digitization OCR workflow, we realized that a page-level OCR tool within Concordia could aid volunteers in transcribing some items. Since the launch of By the People in 2018, we have consistently seen that some of the hardest pages for volunteers to finish are newspapers clippings, book proofs, or other lengthy print materials within manuscript collections – items that are good candidates for OCR.
We also heard from volunteers who were using OCR on their own by running By the People images through a third party tool and then pasting the resulting text back into Concordia. These volunteers reported that even if OCR only worked well for part of a page, such as a letterhead, it could still be a useful starting point for transcription. They found this page-level OCR made transcription less tedious. We wanted to help volunteers use their time and skills efficiently and leverage technology to make the volunteer experience better.

Incorporating OCR into a transcription workflow
After several months of development, the Concordia team released a first version of the OCR tool in July 2023. The new “Transcribe with OCR” button allows registered volunteers to determine when and if to use OCR. The button appears under the image viewer on Not Started and In Progress pages when a volunteer is registered and logged into their account.
Users can learn more about the tool by clicking the question mark next to the button. When you click the OCR button, you are prompted to select the language of the page. It usually takes just a few seconds for the OCR text to generate. The OCR text can then be edited just like any other transcription
We emphasize to volunteers that OCR is imperfect and all text should be reviewed thoroughly before it is submitted. If you notice a lot of errors and would prefer to transcribe as usual, you can always delete or click “Undo” to remove the OCR text. Otherwise, the transcription process proceeds as it always does – once a volunteer decides the transcription is accurate and complete they submit it for final review.
A tested method
We have refined the OCR feature over time based on user testing and other evaluation. Initially, pages where OCR was used were labelled “This transcription was started with OCR.” In user testing we found that because this text appeared on OCR-ed pages that had been edited, some volunteers believed that the OCR tool was much more accurate than it actually is. We removed the notice to eliminate any confusion.
We also observed that anonymous users were more likely than registered users to use OCR on pages where it was not suitable or submit OCR text without editing and decided that the OCR option should only be available to registered users. Another update we introduced is the ability for By the People Community Managers to turn the OCR tool completely off for certain campaigns. This helps volunteers avoid using the OCR feature in campaigns that feature very little printed material, like the papers of Joseph Holt and James Garfield.
How are volunteers using the OCR tool?
We wanted to know if OCR could help streamline the transcription process. Our initial data suggest that OCR-ed pages take fewer passes of transcription to complete on average. For example, in our American Federation of Labor Records campaign, which is primarily typed letters, the overall average of transcription actions (saves, submits, edits) per page was 2.68 actions, whereas the average for pages with OCR text was 1.38 actions.
So, does the OCR tool make transcription more efficient? It depends! The process, like computers and humans, is imperfect. Sometimes volunteers try the OCR feature on pages that aren’t good candidates for the tool and it produces mixed results. But, we also have heard from many volunteers who say they find the tool useful and it makes their experience more efficient and more enjoyable, which is our ultimate goal. We’ll continue gathering volunteer feedback and evaluating it alongside the transcription data.
If you’re a registered user – try it out! If not, sign up here to access OCR, as well as the ability to review, view a profile page to track your work, and more. Then let us know what you think of the By the People OCR tool – we’d love to hear from you!
Comments
Hello
I have been registered with the citizen archivist program for a while, and while my time is very limited, those few times and few entries that I have made into the system have provided me with this sense of joy and connecting to history that I have never felt before. Thank you