Top of page

A “Library of Congress” Worth of Data: It’s All In How You Define It

Share this post:

When I wrote my post on the “Library of Congress” as a unit of measure, I expected to receive some feedback.

And boy, did I.

"Black is my favorite color" from flickr user cbgrfx123, some rights reserved
Black is my favorite color, by cbgrfx123, on Flickr

As expected, I received some new examples:

  • “In less than two years the app has already hosted more than 500 million images — more than 30 times greater than the entire photo archive of the Library of Congress.”  LINK
  • “MAST is currently home to an estimated 200 terabytes of data, which… is nearly the same amount of information contained in the U.S. Library of Congress.”  LINK
  • “This year, CenturyLink projects that 1.8 zettabytes of data will be created. By 2015, the projection is 7.9 zettabytes. That’s the equivalent of 18 million times the digital assets stored by the Library of Congress today.”  LINK
  • Twitter “needed just 20 terabytes to back up every tweet that’s ever existed… that’s about twice the estimated size of the print collection of the Library of Congress.”  LINK
  • “A TB, or terabyte, is about 1.05 million MB. All the data in the American Library of Congress amounts to 15 TB.”  LINK
  • “One petabyte of data is equivalent to 13.3 years of high-definition video, or all of the content in the U.S. Library of Congress — by its own claim the largest library in the world — multiplied by 50…”  LINK

But what I also got were calls, emails, and tweets asking why I didn’t set the record straight about the size of the Library’s digital collections, and share a number.  The answer to the question about the size of the collections is:  it depends.

Do we count… items or files or the amount of storage used?  What constitutes an item?

Do we count… master files? Derivative files? Copies on servers? Copies on tape?  Second (third, fourth) copies in other distributed preservation locations?

Do we count … files we “own?”  Have in our physical control?  License access to that lives elsewhere?

And, when we digitize one more item at 5 p.m. that hadn’t existed in our collections at 4:59 p.m., do we update our counts/extents?

So, here’s what I can say:  the Library of Congress has more than 3 petabytes of digital collections.  What else I can say with all certainty is that by the time you read this, all the numbers — counts and amount of storage — will have changed.

Comments (13)

  1. The McKinsey Report, “Big data: The next frontier for innovation, competition, and productivity on big data” (June, 2011*) states 235 terabytes of data has been “collected by the US Library of Congress by April 2011.” However, you mention above that “the Library of Congress has more than 3 petabytes of digital collections.” Is it correct to say that in the last year, 2-3 additional terabytes of data has been digitized? I can understand that the amount of digital content will change daily but the jump between 235 TB and >3 PB in 12 months is a huge amount of digitized content.

    I’m curious: How do you back up this amount of data (tape, disk, optical, something else)? How often is this content backed-up?

    Cheers,

    Roger

    * see page vi for the data/attribution of 235 TB: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

  2. Thanks for asking, Roger. The Library adds to its digital collections in many ways, not just through digitization of its own physical collections of all formats (text, images, maps, video, audio, etc.). We acquire born digital collections through the Copyright Office, individual and vendor agreements, and through collaborative projects with numerous partner organizations. We have also been selectively archiving the web since 2000. That quote from the McKinsey report is referring to the published number for the Library’s collected web archives only, not the full extent of the Library’s digital collections. In 2011 we already had more than 2 PB of digital collections at the Library. As to backups, we copy files to tape, utilizing more than one tape storage architecture in multiple distributed locations to reduce risk.

  3. This is an interesting discussion but have you (or someone) at the Library of Congress performed an estimate of the size of ALL your holdings if they were ALL digitized? I’m seeking a more authoritative number than everyone’s guesses.

  4. No, Christine, we haven’t. The size of our collections changes every day, so any estimate of how large a complete set of digital surrogates along with our born-digital collections would immediately be out-of-date. We could attempt some calculations for some point in time, but that would not be not inconsiderable effort given the huge range of item types. Let me think about how we might do it.

  5. Given that mixed media will occupy different amounts of storage depending on things like image resolution and sound quality, I was wondering if anyone has estimated just the total number of characters required to hold just the text in the LoC collection at present? I’m guessing that it is considerably less than 2PB, but I could be wrong.

  6. The ‘moving target’ nature of the LOC ‘unit’ is somewhat problematic, but otherwise it’s an interesting tool.

    However, has anyone there ever heard of ‘Jumping Jesi’ as a measure of information volume change across history? I was told of this in the late 80s, I believe, but have no source info, and cannot find it online. I doubt the source was pub’d in the 80s, however, but cannot be sure.

    A “Jesi” is described as the “amount of information on Earth at the time of Jesus’ birth”. And the ‘jumping’ part indicates the time when a jesi doubles. This, of course, would produce a chart showing the rate of information accumulation on Earth, through history.

    Clearly a very Western-centric unit, and somewhat of an estimate since we can’t now measure information in 0AD, but an also rather useful one that I cannot seem to find demonstrated or alternately attempted anywhere else, online at least.

    thanks!

  7. I would like to make an observation. The unit LOC should not be considered a unit, but rather a scaler or constant of some form.

    Because of the rapid increase in size of the library of congress. One can compare it to the rapid increase in data. Likely a more duteous person than me could determine how the change in data size of the library of congress and how it correlates to the boom in overall world data. I propose the constant LOC be used in this sort of fashion:

    X= amount of data in library of congress at one point moment
    Y= amount of overall world data including every last forum post and napkin doodle at the same given moment
    K= exponent likely associated to time
    (LOC)= constant relating the numbers
    Z=Extraneous information (duplications, redundant information, backups, etc.)

    (LOC)*X^K=Y-Z

    Now likely the equation would be significantly more complex than that, with K likely changing with time but I suspect that one could estimate world data size to some varying accuracy if such a constant were found, and extend that to future estimates using some derivative of this.

  8. When I worked in library, Library of Congress was “LC” and Library of Congress Catalog Number is still “LCCN.” Why do I keep seeing “LOC” which to me means (number of) “Lines Of Code” ?

  9. Hi there! I’m a data journalist, and I think a “library of Congress” worth of data would be a great benchmark – but only if it makes sense on a gut level.

    So how about the digital size of all the *printed* material stored in the LOC? Meaning, anything that originally lived on paper?

    When I think library, I still think physical books – materials I can hold in my hands. I would love a figure for how many terabytes of data are contained in all printed works of importance. It would be a relatable quantity – or would be for me.

  10. Jared Kendall: how do we define “Works of importance”?

  11. I think LOC is a great unit of measurement; I’d particularly prefer just the text in all the print material in the LC.

    As soon as we start digitizing non-text content, the way it was digitized matters more than how much valuable information was in the original.

    So keeping it to just text makes it less ambiguous, and as Darnall rightly pointed out, using just the physical collection is more powerful.

  12. Sigh darn you autocorrect I meant to credit Jared Kendall. Not “Darnall” 🙂

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.


Required fields are indicated with an * asterisk.