Top of page

black and white photograph of woman working at a typewriter

Volunteers Leverage OCR to Transcribe Library of Congress Digital Collections

Share this post:

Today’s guest post is from Lauren Algee, a Senior Digital Collections Specialist & By the People community manager at the Library of Congress.


The Library of Congress launched the By the People crowdsourced transcription program in 2018. Since then, we have invited anyone to volunteer by transcribing Library of Congress digital collections through our online platform, Concordia. Completed transcriptions go back into Library of Congress digital collections on loc.gov to make them keyword searchable and improve accessibility. We also publish transcriptions in bulk as open datasets. By the People transcription campaigns have always included typed and printed text – typed scouting reports of baseball great Branch Rickey were one of our first campaigns! Our team is asked often why we ask volunteers to transcribe print and typed collections instead of using OCR (Optical Character Recognition). 

To OCR or not to OCR? 

Optical Character Recognition is a software tool that extracts text from images and is widely used by libraries to create searchable and accessible text for digitized books and print materials. However, not all print or typed document collections are good candidates for applying OCR at scale.

Some print materials – such as those with dark, light, or blurry text; text with ornate or unusual fonts; and “noisy” pages (like those with ink bleed-through or other marks on the page) – do not produce accurate text through OCR. The Library is constantly improving its workflows (see our colleagues’ recent blogpost on newspaper OCR enhancement), but many material types remain challenging. Additionally, many Library of Congress collections are a mix of both handwritten and printed pages. These collections don’t make good candidates for batch OCR processing, since the majority of the collection (i.e., the handwritten items) would produce poor results.

MEMORANDUM OF GAME BETWEEN SANTURCE AND PONCE AT SAN JUAN, PUERTO RICO, ON JANUARY 25, 1955 CLEMENTE played centerfield for Santurce. I would guess him to be at least 6' tall, weight about 175 pounds, right hand hitter, very young. I have been told very often from many sources about his running speed. I was sorely disappointed with it. His running form is bad, definitely bad, and based upon what I saw tonight, he has only a bit above average major league running speed. He has a beautiful throwing arm. He throws the ball down and it really goes places. However, he runs with the ball every time he makes a throw and that's bad. He has no adventure whatever on the bases, takes a comparatively small lead, and doesn't have in mind, apparently, getting a break. I can imagine that he has never stolen a base in his life with his skill or cleverness. I can that it if was done it was because he was pushed off. His form at the plate is perfect. The bat is out and back and in good position to give him power. There is not the slightest hitch of movement in his hands or arms and the big end of the bat is completely quiet when the ball leaves the pitcher’s hand. His sweep is level - very level. His stride is short and his stance is good to start with and he finishes good with his body. I know of no reason why he should not become a very fine hitter. I would not class him, however, as even a prospective home run hitter. I do not believe he can possibly do a major league club any good in 1955. It is just too bad that he could not have had his first year in a Class B or C league and then this year he might have profited greatly with a second year as a regular say in Class A. In 1956 he can be sent out on option by Pittsburgh only by first securing waivers, and waivers likely cannot be secured. So, we are stuck with him, - stuck indeed, until such time as he can really help a major league club. The most disappointing feature about CLEMENTE is his lack of adventure, - of chance taking. He had at least two chances tonight to make a good play. He simply waited for the bounce. I hope he looks better to me tomorrow night when Santurce plays San Juan, - the final game of the regular season and the city championship of San Juan is at stake. Perhaps this boy will put out in that game. Transcribed and reviewed by volunteers participating in the By The People project at crowd.loc.gov. Transcribed and reviewed by contributors participating in the By The People project at crowd.loc.gov.
A typed January 25, 1955 Branch Rickey scouting report which would not produce accurate OCR. This report concerns Roberto Clemente, who would go on to amass 3,000 hits in a Hall-of-Fame career for the Pittsburgh Pirates.

While many of our collections are not ideal for a mass digitization OCR workflow, we realized that a page-level OCR tool within Concordia could aid volunteers in transcribing some items. Since the launch of By the People in 2018, we have consistently seen that some of the hardest pages for volunteers to finish are newspapers clippings, book proofs, or other lengthy print materials within manuscript collections – items that are good candidates for OCR.

We also heard from volunteers who were using OCR on their own by running By the People images through a third party tool and then pasting the resulting text back into Concordia. These volunteers reported that even if OCR only worked well for part of a page, such as a letterhead, it could still be a useful starting point for transcription. They found this page-level OCR made transcription less tedious. We wanted to help volunteers use their time and skills efficiently and leverage technology to make the volunteer experience better. 

screenshot of By the People transcription page with the first step of the By the People OCR workflow activated. Orange arrow points to either "cancel" or "yes, select langauge."
The first step of the By the People OCR workflow, initiated by the ”Transcribe with OCR” button beneath the image viewer, informs users that existing text will be replaced.

Incorporating OCR into a transcription workflow 

After several months of development, the Concordia team released a first version of the OCR tool in July 2023. The new “Transcribe with OCR” button allows registered volunteers to determine when and if to use OCR. The button appears under the image viewer on Not Started and In Progress pages when a volunteer is registered and logged into their account.

Users can learn more about the tool by clicking the question mark next to the button. When you click the OCR button, you are prompted to select the language of the page. It usually takes just a few seconds for the OCR text to generate. The OCR text can then be edited just like any other transcription

We emphasize to volunteers that OCR is imperfect and all text should be reviewed thoroughly before it is submitted. If you notice a lot of errors and would prefer to transcribe as usual, you can always delete or click “Undo” to remove the OCR text. Otherwise, the transcription process proceeds as it always does – once a volunteer decides the transcription is accurate and complete they submit it for final review.

A tested method

We have refined the OCR feature over time based on user testing and other evaluation. Initially, pages where OCR was used were labelled “This transcription was started with OCR.” In user testing we found that because this text appeared on OCR-ed pages that had been edited, some volunteers believed that the OCR tool was much more accurate than it actually is. We removed the notice to eliminate any confusion.

We also observed that anonymous users were more likely than registered users to use OCR on pages where it was not suitable or submit OCR text without editing and decided that the OCR option should only be available to registered users. Another update we introduced is the ability for By the People Community Managers to turn the OCR tool completely off for certain campaigns. This helps volunteers avoid using the OCR feature in campaigns that feature very little printed material, like the papers of Joseph Holt and James Garfield. 

How are volunteers using the OCR tool? 

We wanted to know if OCR could help streamline the transcription process. Our initial data suggest that OCR-ed pages take fewer passes of transcription to complete on average. For example, in our American Federation of Labor Records campaign, which is primarily typed letters, the overall average of transcription actions (saves, submits, edits) per page was 2.68 actions, whereas the average for pages with OCR text was 1.38 actions. 

So, does the OCR tool make transcription more efficient? It depends! The process, like computers and humans, is imperfect. Sometimes volunteers try the OCR feature on pages that aren’t good candidates for the tool and it produces mixed results. But, we also have heard from many volunteers who say they find the tool useful and it makes their experience more efficient and more enjoyable, which is our ultimate goal. We’ll continue gathering volunteer feedback and evaluating it alongside the transcription data.

If you’re a registered user – try it out! If not, sign up here to access OCR, as well as the ability to review, view a profile page to track your work, and more. Then let us know what you think of the By the People OCR tool – we’d love to hear from you! 

Comments

  1. Hello

    I have been registered with the citizen archivist program for a while, and while my time is very limited, those few times and few entries that I have made into the system have provided me with this sense of joy and connecting to history that I have never felt before. Thank you

Add a Comment

Your email address will not be published. Required fields are marked *