How 'Big' Is the Library of Congress?

I want to preface this post by reiterating one of our general disclaimers up front, to wit: “This blog does not represent official Library of Congress communications.” Because this post will edge slightly closer to “editorializing” than most of my previous posts.

Working in the Office of Communications as I do, I’m aware of the lion’s share of news coverage about and references to the Library, whether via Lexis-Nexis, Google News, or any number of other monitoring services to which we have access. This is a rough approximation — and perhaps a bit hyperbolic — but I would guess that somewhere between a quarter and a third of the general references I see to the Library are along these lines: “Our servers can store the equivalent of the Library of Congress,” or “our network is fast enough to download the entire Library of Congress in a millisecond.” You get the drift.

So it begs the question, just how “big” is the Library of Congress, in terms of our content, but especially if one tried to equate it to the digital realm?

I won’t go into any of the specific claims that are being made, but they’re easy to find out there in the ether, and suffice it to say that the Library would stand behind very few if any of them. There are certain things we can quantify, but far more that are purely speculative.

For instance, we can as of this moment say that the approximate amount of our collections that are digitized and freely and publicly available on the Internet is about 74 terabytes. We can also say that we have about 15.3 million digital items online.

Some may be tempted to extrapolate that those digital items represent a precise percentage of the nearly 142 million items in the Library’s physical collections, and then estimate some kind of digital corollary. But comparing digital and physical items is apples and oranges, at best. A simple example of that fallacy would be represented by a single photograph online depicting several physical objects.

Another source of digital estimates is likely based on the number of books and printed items in our collections, which is currently about 32 million. One could attempt to establish the average length of those items (pages, words, characters, etc.) and extrapolate the digital equivalent of those 32 million physical items.

Assuming one could do that with any degree of accuracy — and that’s a big assumption — it overlooks the fact that those 32 million books represent only about one-quarter of the entire physical collections. The rest are in the form of manuscripts, prints, photographs, maps, globes, moving images, sound recordings, sheet music, oral histories, etc. So how does that other three-quarters of the Library equate digitally? Can one automatically assume the digital resolution at which all maps or photographs, for instance, would be scanned? Those are major wildcards indeed.

And then there are our motion pictures, videos and sound recordings alone — around 6 million items stored at our new Packard Campus for Audio-Visual Conservation in Culpeper, Va. What is their digital equivalent? Most people who record television programs onto a computer or DVR know that a hard drive with hundreds of megabytes or even a terabyte or more can quickly fill up.

One more thing we can quantify or at least estimate: The folks at the Packard Campus say that when their systems are fully online, they expect to be able to digitize between 3 and 5 petabytes of content per year. (That is to say, 3,000 to 5,000 terabytes, for those who are playing at home. Put another way, a single petabyte stored on CD-ROMs would create a stack of discs more than a mile high.) And even at that rate, it would still take decades to digitize the existing content.

So at this point, we’re talking about potentially mind-boggling amounts of data — we’re into the territory of angels and heads of pins. And while it is a bit of a tangent, it points to the continued importance of libraries as places (and librarians) when too many people assume that “everything is online.”

While it is certainly flattering that the Library of Congress is used as a typical benchmark against which others measure their content or data capacity, we would do well to take these claims with at least a shaker of salt. We are far “bigger” than many of them might think.

9 Comments

  1. Amanda French
    February 11, 2009 at 4:50 pm

    Fascinating! Thanks for posting this.

  2. Aatom
    February 11, 2009 at 8:02 pm

    This is rather fascinating – it’s daunting to imagine how much human output has been collected and stored by the LoC.

    I picture you in one of the dark, labyrinthine Name of the Rose libraries writing this, btw.

  3. Robin Williams
    February 12, 2009 at 7:37 am

    I personally think that the number of books do not make any library big or small. The main concern should be whether the content of the library is a good source of knowledge or not. Anyway, thank you very much for the post.

  4. Jennifer Bowman
    February 23, 2009 at 12:17 am

    This is amazing. I knew that the Library of Congress had an incredible amount of information, but I never fully realized everything that it holds. Thank you for posting; this gave me a new perspective and appreciation.

  5. Library
    March 1, 2009 at 10:40 pm

    Interesting post.

    Library of Congress

  6. jason griese
    March 10, 2009 at 9:09 pm

    I was thinking about this the other day. In the event that tragedy struck the LOC , Is there a separate record of the property so that maybe a new compilation could be assembled ,even though it would not be near complete?

  7. Matt Raymond
    March 17, 2009 at 5:09 pm

    Jason, many of our collections are unique in all the world, but collections security is something that we — particularly the Librarian of Congress — take VERY seriously.

    In terms of our digital content, there are many analogous security measures and redundancies.

  8. Bill Clayton
    August 14, 2009 at 12:41 pm

    Each month, 500 million people interact on the Yahoo! website, doing searches, emailing, reading news and other content. It adds up to 25 terabytes of data each day. If you digitized all of the books in the Library of Congress, you’d get about 10 terabytes of data. So, on Yahoo! alone, people consume more than two Libraries of Congress a day. HOWEVER, data at the Library of Congress are organized, whereas Internet data is scattered. Search companies like Google are trying to organize all of that information but, until they do, the Library of Congress will be the premiere source of information. I suspect the Library will evolve, too, changing its model to fit in seamlessly with the coming Internet age.

  9. Just Think
    September 9, 2014 at 10:13 am

    @Bill Clayton

    Your attempt at comparison is very flawed. You cannot on one hand (Yahoo!) look at the size of the data times the number of people accessing it and on the other (LoC) look at only the data. Well okay, you can. But it’s meaningless. Plus. given that the article emphasizes that the LoC holds more than books and that they are storing about 5PB a year, your post seems trite.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.