The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.
Yesterday, I blogged about the digital reformatting of historical books and other documents. I reported that virtually all digitization projects in memory institutions present the information from the pages in the form of a searchable text. I also noted the variation in the types of images that are typically employed to represent the book as an artifact or, at minimum, to compensate for the inaccuracies in the automated OCR transcriptions of the text by presenting a facsimile image of each page.
Although not typical of memory institution reformatting programs, it is worth spotlighting a special form of scanning–perhaps the maximal expression of artifactual imaging–that documents the physical structure of the inks, the paper or parchment, and other aspects of the original. This form of imaging is intended to support scientific study and the careful work of object conservation. As an example, the Library of Congress Preservation Directorate has carried out a “hyperspectral” look at the historic Waldseemueller map. (See also the team’s discussion of the technology).
Another example of scientific imaging is from the Archimedes Palimpsest project, an examination of a medieval manuscript on parchment on deposit at The Walters Art Museum in Baltimore. Unlike paper, parchment is sufficiently durable that you can take a knife, scrape off the text, and then overwrite it with a new text. The pages in the Archimedes Palimpsest came from five older books that medieval scribes had taken apart, scraped, and reinscribed to rebind as a prayerbook. The science team at the Walters used multispectral imaging to see the hidden writing (and diagrams) under the new text.
Meanwhile, at the informational end of spectrum, we have a multiyear project to scan the Copyright Office card catalog, where words-on-the-card are of the paramount importance. The 45 million cards in this catalog provide an index to copyright registrations and transfers of ownership in the United States from 1870 to 1977, offering a window into the literary, musical, artistic, and scientific production of the United States and foreign countries. These cards are an important supplement to the Library’s main catalog because only some of the works deposited for copyright are selected for inclusion in the Library’s collections and we do not always fully catalog the works we select.
The copyright card scanning project started a little over a year ago and relatively high quality uncompressed master files have been produced. The planners are certain that very good OCR results can be obtained from these images. Meanwhile, some in the planning group feel that equally good results could also be obtained using a lossy compressed variant of the JPEG 2000 format, and there has been some informal discussion of producing future master files in this image format.
Why do we care about these varying imaging specifications? They are all on the docket for the Still Image Working Group in the Federal Agencies Digitization Guidelines Initiative, in which the Library is key player. The Working Group continues to refine its guidelines for still-image scanning, recognizing that recommended imaging specifications will vary according to a given project’s objectives. For books (and other textual materials), every project seeks to get informational content. But as reported above, there is considerable variation in the degree of importance assigned to artifactual values, which accounts for the wide range of image types: from something that looks like a Xerox copy to something that looks like an art museum poster to scientific representations of a page’s microstructure.
The Working Group guidelines will also cover pictorial materials, notably photographs. The group’s consideration of the imaging of photographic negatives and transparencies distinguishes between informational and artifactual in a slightly different way. That will be the subject of my next blog.