Public Access to Historical Records

The following is a guest post by Maria A. Pallante, Register of Copyrights and Director of the U.S. Copyright Office.

Help Wanted: Have you ever attempted to build an electronic index and searchable database of a complex and diverse collection of 70 million imaged historical records? Neither have we.

One of the largest card catalogs in the world, the U.S. Copyright Office card catalog comprises approximately 46 million cards. Photo by Cecelia Rogers, 2010.

Current records dating back to 1978 are available online and searchable at www.copyright.gov/records. The Office’s records date back to 1870, however, and many pertain to works still under copyright protection. These records are the focus of our current digitization efforts. This is an ambitious project that I announced recently as one of several priorities and special projects the Copyright Office is undertaking. To date nearly 13 million index cards from our card catalog and over half of the 660 volume Catalog of Copyright Entries have been scanned, and the images have been processed through quality assurance and moved to long-term managed storage.

So, back to the earlier question: How do we go about creating a searchable database comprised of 70 million digital objects? For that matter, how do we create metadata for such a large volume of records? Assuming we would like to achieve full-level indexing, how do we do so on a rudimentary indexing budget? What technologies and creative approaches can we profitably employ to get this work done? We welcome your ideas and suggestions on these and many other questions related to this project.

The Copyright Office historical catalog serves as the mint record of American creativity, and there are great benefits to making the collection accessible online. We know that working collaboratively will ensure that the final product best meets the needs of the widest audience of users. I hope you will subscribe to our project blog at http://blogs.loc.gov/copyrightdigitization/ and visit our project web page at www.copyright.gov/digitization from time to time. Most of all, I hope that you will be an active partner in this important effort.

16 Comments

  1. David Fessenden
    December 1, 2011 at 1:53 pm

    I would hope that for works which have entered the public domain, the records could state this clearly. It would save a lot of people a lot of work and research.

    As I read the copyright law, any work by an author who has been dead 70 years or more must be in public domain, unless it was published posthumously. And yet I frequently get requests for permission to reprint the works of 19th-century authors.

  2. LisaMary
    December 1, 2011 at 1:55 pm

    My hope for the digitization process is that one day we will have a “print on demand” service for the whole archive. With travel grants drying up so rapidly this can allow much research to be done off site that both funds and time will be easier to focus. Thank you.

  3. Andrew Sly
    December 1, 2011 at 10:28 pm

    David: Unfortunately, copyright gets more complicated the more you look into it. In the USA, as I understand it, the life+70 term only came into effect on March 1, 1989, and it was _not_ retroactive. Before that was a different system which required copyright registration, and had one optional renewal period (which was later made mostly automatic.) And of course, this is only in the US. If you are in a different country, that country’s laws will apply.

    Asking for a clear indication of the public domain status of every item is a little far-fetched because many items may require hours of research and still have inconclusive results.

  4. David Starner
    December 1, 2011 at 11:34 pm

    “Print on demand”? There are times and places for printing even today, but 660 volume reference works aren’t the place. Online reference and downloadable ebooks should be just fine.

  5. WJM
    December 1, 2011 at 11:53 pm

    Distribute the work. Let researchers and third parties contribute their own labour, subject to cross-checks and QC.

  6. Sharad Shah
    December 2, 2011 at 8:33 am

    With regards to metadata and making early claims (i.e. handwritten) searchable, there are OCR programs available, but the problem comes down recognition and the dependence on such software could lead to unnecessary grammatical errors (“r” for “n” or “a” for “o” are some of the better examples). It’s guesswork and not 100% accurate.

    While thorough, to go back and proofread every application and compare it with the converted (and searchable) text would be extremely time consuming.

    The other option is to limit the metadata and converted information to that which is typically used for searches: author/claimant, date, type of work, and title.

  7. Mike Ratoza
    December 2, 2011 at 12:14 pm

    Maria, Google did it.

  8. Rene Ford
    December 2, 2011 at 12:38 pm

    One industry that process millions of digital images every day is the healthcare industry. Claims that are not filed electronically are digitally imaged and OCR is used to insert the different pieces of the claim into a database. It’s been around long enough that they, I believe, have worked out the most of the kinks with OCR. You may want to contact a government friendly insurance carrier for ideas.

  9. Mary Minow
    December 2, 2011 at 2:49 pm

    Stanford would be a good point of contact. Their renewal database is quite useful.
    http://collections.stanford.edu/copyrightrenewals/bin/page?forward=home

  10. Michael Capobianco
    December 5, 2011 at 12:06 pm

    I would say you should prioritize on the basis of questionability of copyright status, that is, concentrate on works in the 1923-1963 period and renewals, with books being highest priority, then periodicals and short fiction. As has been mentioned, there’s a very clear beginning on this with Stanford’s digitization of copyright renewal year-end summaries and the searchable version of the Stanford digitization that Google produced. So the first order of business should be to take the digitized versions of summaries and verify them to make them official.

  11. PH
    December 5, 2011 at 8:35 pm

    DIY::: / Keep It In House / Check The Test Results From Past Government Scanning Start up Programs / Work With The Manufacturers In Real Time Usage / With Techs Instructing Your Staff, The Staff From Others Who Keep Our History Alive / Manufactures Benefit ::: Pre-Market – No Cost Testing, Advertising – National Word Of Mouth, Instructors Less Traveling- Keeps Cost Down – New Data Gathered/ LOC Benefits ::: Past & Up To Date Data / Equipment & Instruction Provided / Offers Others In The Field, An Opportunity To Be Part Of An History Making Event… Mass Scanning Project… / What Types Of Scanners Needed, Space Needed, Personal Needed, Do The Math- Min Scanned Per Day / 70,000,000… Have Fun ;>

  12. Marisol Wallace
    December 30, 2011 at 8:09 am

    Bookmarked! Thanks for an amazing post, will read your others posts.

  13. Bernard Espinosa
    June 2, 2012 at 9:51 pm

    Good particulars. I’m back, to read the excellent info.

  14. Manuela Whitted
    June 27, 2012 at 11:39 am

    You might have the very greatest weblog web site content material material on the planet

  15. Vinnie Staffon
    July 6, 2012 at 2:45 am

    The data you can find on this website is as outstanding, if not higher, than every little thing you’ll be able to discover

  16. Torrie Gibes
    July 19, 2012 at 11:40 am

    Your blog is really excellent, Nice articles, Its extremely help me.