Top of page

Working at Scale: The Firehose of Data

Share this post:

This is a guest post written by Amanda May, Digital Projects Specialist in the Preservation Services Division. Her work includes managing digital files for the division, recovering data from removable media in Library collections, and providing consultation and services for born-digital collections data.

The Library of Congress is the largest library in the world with over 173 million cataloged items. For the past decade, digitization of this enormous collection has increased exponentially. The Preservation Services Division (PSD) is responsible for a huge portion of this effort, managing contracts for the digitization of millions of pages of books, newspapers, and microfilm frames each year. All of this imaging results in a lot of data, hundreds of millions of files, that have to be safely shepherded into the digital repository that preserves them forever* and into access systems that allow researchers to use them.

A lot of work happens before digitization, from collation to pinning and linking to delivery to the vendor. The vendor delivers the digitized images to PSD on external hard drives. One delivery might consist of two or three hard drives, each containing over 4,000 newspaper issues. Each issue is a “bag” – a directory in the BagIt specification that contains preservation and descriptive metadata about what is contained within. Each page of the original item is represented by a JPEG2000 image and an ALTO XML file containing the optical character recognition (OCR) transcript of the page. The bag also contains a PDF of the entire issue as well as a METS XML file describing the entire issue. An issue could be as small as one page or as large as 1,000 pages.

Caption: A directory on a hard drive containing 1,000 newspaper issues. Screenshot credit: Amanda May
Caption: A directory on a hard drive containing 1,000 newspaper issues. Screenshot credit: Amanda May

 

When the hard drives are delivered to PSD, a set of scripts known as “Validation Station” is run on each to ensure that the XML files are valid and to create inventory files that help with the division’s quality review workflows. Then the drives are mounted to two ingest stations – dedicated Linux computers that perform a series of checks before transferring the files into servers for quality review. Each bag goes through three pre-check tasks on the ingest stations:

Inventory: creates a record of the bag in CTS, the digital repository system, along with a record of what the bag metadata file claims is contained inside.

Malware Scan: checks the bag contents for anything that might harm the repository

Verify: checks the contents of the bag against the manifest, to see that everything that is supposed to be there is there, and nothing else

It is not wise to ingest all of the bags in a delivery at once. Only one of these tasks can run at a time on each ingest station, so ingesting a large number of bags creates a huge queue of tasks that can take days to clear. Once the Verify task is complete for each bag, the bag is added to the queue of items to be copied to the Staging servers and then to the queue to be added to Sampler, the quality review portal. Sampler is not just used by PSD, but by staff throughout the whole Library. If PSD tries to add 10,000 bags to Sampler at once then it blocks the queue for everyone else trying to perform quality review work for days. To be good neighbors, and to efficiently manage queues on the ingest stations and in Sampler, PSD ingests about 1,200 bags at a time. Ingesting an entire delivery can take a week or more.

Caption: Two ways to view the ingest queue. Screenshot credit: Amanda May
Caption: Two ways to view the ingest queue. Screenshot credit: Amanda May

 

PSD staff review the issues in Sampler and keep track of the work in the workflow database. When each bag is accepted in Sampler, the repository system copies all the files to long-term storage so that they can be safely managed into perpetuity as part of a digital preservation cycle. Since these newspaper issues are still protected by copyright, access to the digitized files is restricted to researchers who are physically located on the Library of Congress campus via the Stacks access portal. The final step of the ingest process is copying the issue-level PDF file to that location. At the time of writing this article, Stacks contains over 1 million issues from newspapers and journals as well as over 150,000 books. Items that are not under copyright protections are available more widely via the Library’s digital collections website.

Caption: A CTS item record for a fully-ingested bag, an issue of a newspaper. Three instances of the bag are inventoried – the original copy on the hard drive, the long-term preservation copy, and the PDF that is viewable through Stacks. Screenshot credit: Amanda May
Caption: A CTS item record for a fully-ingested bag, an issue of a newspaper. Three instances of the bag are inventoried – the original copy on the hard drive, the long-term preservation copy, and the PDF that is viewable through Stacks. Screenshot credit: Amanda May

 

When all the files have been ingested and reviewed, the hard drives are returned to the vendor and the cycle begins again. PSD received 19 deliveries in FY22, so this process was repeated 19 times. This resulted in 138,531 new newspaper issues, 3,260 new digitized books, and a total of 6,339,789 images being made available to Library of Congress researchers. Managing the fire-hose of data is an important part of the massive digitization efforts at the Library of Congress.

 

Add a Comment

Your email address will not be published. Required fields are marked *