Before You Were Born: We Were Digitizing Texts

We are all pretty familiar with the process of scanning texts to produce page images and converting them using optical character recognition to full-text indexing and searching. But electronic texts have a far older-pedigree.

The UVa EText Center Web Site in 1997

The UVa EText Center Web Site in 1997

Text digitization in the cultural heritage sector started in earnest in 1971, when the first Project Gutenberg text — the United States Declaration of Independence — was keyed into a file on a mainframe at the University of Illinois. The Thesaurus Linguae Graecae began in 1972. The Oxford Text Archive was founded in 1976. The ARTFL Project was founded at the University of Chicago in 1982. The Perseus Digital Library started its development in 1985. The Text Encoding Initiative started in 1987. The Women Writers Project started at Brown University in 1988. The University of Michigan’s UMLibText project was started in 1989. The Center for Electronic Texts in the Humanities was established jointly by Princeton University and Rutgers University in 1991. Sweden’s Project Runeberg went online in 1992. The University of Virginia EText Center was also founded in 1992. These projects focused on keyed-in text structured with markup, ASCII or SGML at the time, transitioning to HTML and later, to XML.

Mosaic, the first web browser, was released in November 1993. The web was the “killer app” for digitized cultural heritage materials.

Do you want to see some real digitized text history? Check out this archived list of electronic text centers from 1994. It’s an international Who’s Who of digital humanities. And it’s a wonderful piece of computing history in and of itself, with its gopher servers and VAX machines and USENET groups and anonymous FTP sites.

Text digitization has long had as a part of its history the goal of preservation. The phrase “Preservation Reformatting” is known to all of us, and digitization is part of many institutional preservation strategies, especially for brittle books. Yale University Library and Cornell University Library undertook test test projects to digitize text materials and produce preservation microfilm from the digital files. Yale’s Project Open Book started in 1991, and the Cornell demonstration project formally started in 1993. Making of America launched in 1995. The first round of Library of Congress Ameritech digitization for the National Digital Library was in 1996. The National Archives and Records Administration’s Electronic Access pilot project started in 1997.

I’m skipping over much of the last 15 years of text digitization, where the work has shifted from the bespoke to mass digitization. There are a lot of references to the Open Content Alliance, the Universal Digital Library/Million Book Project, and Google Book Search, among others. The The Library of Congress is still participating in the development of standards for page imaging for textual materials as part of its Federal Agencies Digitization Guideline Initiative work.

And I’m not going to enter into the page images or marked-up keyed text or OCR debate that has existed since the earliest days of text digitization. There are ardent points of view of the pros and cons of each.

There is, of course, the preservation of the output of these text digitization projects. There is a very inclusive reports worth reading on that topic: Preservation of Digitized Books and Other Digital Content Held by Cultural Heritage Organizations, a report jointly written by Cornell University and Portico in 2011.



  1. Sharad
    December 19, 2012 at 3:22 pm


    This was a stroll down memory lane. I was thinking of older internet products last night–particularly the image/font-deprived e-mail systems I used, the battle between Netscape and Mosaic, using VAX (via dial-up), the word “baud,” and the screeching, beeping noises that I associate with it. Alta Vista and Infoseek!

    I remember writing a paper last year on digitization efforts and Copyright issues (*cough*Google Books*cough*) and in the research process, I came across a few books in the Library’s collection on Optical Character Recognition. Didn’t know it dated back to the early-to-mid 60s.

    (Thanks for posting!)

  2. Carl Fleischhauer
    December 21, 2012 at 9:24 am

    I will add to Leslie’s interesting list the following item from our own work here at the Library of Congress: The Workshop on Electronic Texts in June 1992. The proceedings are available here: As far as I can tell, this was the first meeting that brought together two rather different perspectives on the question, “What does it mean to reproduce a book in digital form?” Some attendees came from the realm of microfilming and preservation-via-high-end-photocopying, an approach we often associate with the important book scanning project at Cornell University. (At first it sought to print the digital images back to acid-free paper and then rebind a fresh copy of the book.) At the other end of the spectrum were the academics associated with the Text Encoding Initiative (TEI), launched in 1987. The summum bonum for the TEI practitioners was a perfect text transcription, marked-up with SGML (the parent of HTML and XML). What this conference accomplished, I think, was to begin to define a dual model: facsimile images conjoined with (not necessarily perfect) searchable text. This model has prevailed in the online presentation of textual materials during the last two decades.

  3. Graeme Johanson
    December 25, 2012 at 6:13 pm

    One of the more bizarre exercises in imperialism and projection of human ambition dates from the 1977 ‘golden record’ on Voyager I and II. See: Why any scientist would assume that ‘life’ in remote space would want to decode human sounds is extraordinary. The capsule is much more of an archive of US conceit than a real record of ‘humanity’. It is a declaration of official culture.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.