Information or Artifact: Digitizing a Book, Part 1

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.

How do you reproduce a book in digital form?  This may seem like a simple question until you pick up a book and page through it.  You may be struck by “how” in the methodological sense, knowing you need to scan the, say, two hundred pages and, often, not wishing to cut the book into pieces to do so.  This need has led to the development of a number of marvelous mechanical devices including the book scanner known as SCRIBE from the Internet Archive, and a scanner that turns the pages as it makes the pictures.

But there is another how, one that we first explored at the Library nearly twenty years ago, at a 1992 workshop on electronic texts. This was in the early years of American Memory, when our content was pressed onto CDs, before the arrival of Mosaic, the first popular Web browser in the U.S. We convened a group of experts to sort out the degree to which a digitized book should be a searchable text or a set of facsimile images.  This cross-discipline discussion was a bit of a first, I think, with the searchable (and marked-up) text advocates representing academia and the Text Encoding Initiative while the image side was repped by library preservation specialists starting their transition from microfilm to digital reproduction.

This is a bit of an oversimplification but the searchable text folks were drawn to the words on the page without a strong passion for the book as a physical object.  To borrow a term from audiovisual archiving, a book and its paper pages were seen as a carrier for information.  The important thing for them was to get at the text, which could then elaborated upon by editorial comments, variant readings from other editions (think of the many printings of Shakespeare), and the like, all carefully set off by the symbols of markup language.

Figure 1. Sample of marked-up text following Text Encoding Initiative guidelines from the Inscriptions of Roman Tripolitania Project

Figure 1. Sample of marked-up text following Text Encoding Initiative guidelines from the Inscriptions of Roman Tripolitania Project

 

 

 

Meanwhile, the imaging folks brought their microfilm habits forward but with some added nuances.  They reminded us of the importance of the bookness of a book, its value as an artifact and not just as a carrier.  Whereas in the realm of microfilm, this artifactuality had to be presented in virtual form: a series of microphotographs (generally black and white) that lacked the heft and presence of a real book.  In contrast, in the digital realm, the images could even be produced in a manner that would permit the printing out of a paper reproduction of the book, i.e., the creation of a physical replica.

During the 1990s, most libraries that digitized books proceeded in a dual mode, presenting online books as combinations of page image sets and searchable texts.  One trail-blazing project from the period is the Making of America at the University of Michigan.  The MOA online presentation permits an easy switch back and forth between page and text.  Their format helped establish what has become the typical pattern: the page images are presented first and you dig down to get to the text.  Years later, the Internet Archive book project and Google Books do more or less the same thing, albeit with some new elaborations.

Figure 2. Image of a title page from the Making of America at the University of Michigan. If you change the pulldown at top from image to text, the display presents you with the OCR-converted and marked-up text.

Figure 2. Image of a title page from the Making of America at the University of Michigan. If you change the pulldown at top from image to text, the display presents you with the OCR-converted and marked-up text

In part, the dual-mode presentation reflects an interest in conveying both the informational and artifactual aspects of the original book.  But the real driver has been another fact of digital life: the high cost of producing accurate and well-marked-up versions of the text.  The founders of the Text Encoding Initiative were connected to an academic world in which careful rendering and parsing of the text were de rigueur (lots of human editing).  The librarians, the Internet Archive, and Google Books oversee massive book scanning efforts and they employ efficient-and-cheap Optical Character Recognition (OCR) to produce the searchable texts.  OCR provides good but not great accuracy–you can figure on from one to four typographic errors per page (or more).  This imperfect “full text” (as the Internet Archive calls it) becomes a great resource for indexing and searching while the page images serve as the authoritative representation of the printed text.  This is an affordable outcome.

The page image specifications in these dual-mode presentations have turned up in a number of variations.  In the 1990s, some book scanning projects envisaged printing back to paper, binding the new set of pages, and returning a fresh copy of an old book to the shelf.  These projects focused on the, um, routine imprints of the nineteenth and early twentieth century.  Their imaging approach captured the typography–scraping the ink off the page, as it were. Such an image could be used to print the text (and illustrations) back on clean paper.  The paperness of the original paper (and the actual edge of the sheet) was not a concern, nor was the binding.

At about the same time, however, some other projects focused on medieval illuminated manuscripts, incunabula, and various important first editions.  These projects used a different style of imaging.  They scanned in color, even when the item was black ink on white paper, since (of course) there are many shades of white, especially as paper ages.  These projects also tended to scan just past the edges of the sheet and often produced images of the binding and spine.  The resulting image set represented a facsimile of the sort we might associate with an art museum.  By the way, here’s a video of such a project at the Smithsonian Institution Libraries.

Figure 3. From the Book of Hours, a 1524 French illuminated manuscript, number 10 in the Lessing J. Rosenwald Collection in the Library of Congress Rare Book and Special Collections Division

Figure 3. From the Book of Hours, a 1524 French illuminated manuscript, number 10 in the Lessing J. Rosenwald Collection in the Library of Congress Rare Book and Special Collections Division

Enough for today.  Tomorrow, I’ll have a bit more to say about the imaging of the pages in books (and other documents), the informational capture of catalog cards, and some related matters.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.