Making Scanned Content Accessible Using Full-text Search and OCR

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:

adams080414image1

This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page

Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

  • A plain text version for the search engine, which does not understand HTML markup
  • A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer

Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on wdl.org has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:

adams080414image2

The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:

adams080414image3

2. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:

adams080414image4

4. Finally, we add a partially transparent overlay over each highlighted word:

adams080414image5

Notes

  • The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
  • You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
  • This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.

adams080414image6

Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

  • The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
  • For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
  • Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
  • This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.

3 Comments

  1. John
    August 4, 2014 at 7:54 pm

    We are doing something similar at the University of Auckland, using our b-engine platform and PDFs. The text behind the image in the PDFs is extracted and becomes part of the index for searching. The extracted text is used to create a display of the search term in context. This means users can judge for themselves the accuracy of the OCR.
    The search is carried over into the PDF and search terms are highlighted in the PDF. Opening the KWIC search term display is very fast, opening the PDFs (one per chapter) a bit slower. You can copy and paste sections of text from the PDF, or save a whole chapter or article.
    http://www.nzjh.auckland.ac.nz/
    http://www.thebookshelf.auckland.ac.nz/browse.php

  2. Peter Gray
    August 4, 2014 at 8:49 pm

    Hello. At the National Library of Australia we are working on a new digital delivery system for Books, Manuscripts, Maps, Journals, Ephemera, and other types of content.

    We have developed two web ‘applications’ – one for internal use which ingests scanned material, links the scanned images to bibliographic data, and then sends them through OCR software to capture the structure and content (the structure can be marked out either manually or automatically, depending upon the level of structural detail required. This package of images and mets/alto is then delivered via our second application – our digital delivery layer.

    The general public are able to edit the OCR content, view document structure, conduct full text searches, download content, create citations, and also view the content in a full screen mode using the Internet Archive Bookreader and Jpeg2000 technology.

    We are in early stages of development at the moment – we are actually in the middle of user acceptance testing. In a few weeks we will launch the latest version which will integrate further with Trove for a more consistent user experience.

    Here is a link to the current version – http://nla.gov.au/nla.obj-1047813
    (the new version is much more user friendly and less buggy :)

  3. Faith Hamby
    August 5, 2014 at 5:47 am

    As an intern in the Law Library of Congress’s Digital Resources Division, I’ve been working on creating metadata for Statutes at Large. Though I don’t have an MLS, I’ve had over 5 years as a paraprofessional in the library field and an M.A. in Writing. With the rise of technology, I’ve become more and more interested in using technology to preserve our cultural heritage and historical documents so that they’re not only archived but accessible to anyone who wishes to search them. This article was very informative about the new technologies available to help in searchability and findability, especially using OCR, which was buggy at best when it was first released.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.