A “Library of Congress” Worth of Data: It’s All In How You Define It

When I wrote my post on the “Library of Congress” as a unit of measure, I expected to receive some feedback.

And boy, did I.

"Black is my favorite color" from flickr user cbgrfx123, some rights reserved

Black is my favorite color, by cbgrfx123, on Flickr

As expected, I received some new examples:

  • “In less than two years the app has already hosted more than 500 million images — more than 30 times greater than the entire photo archive of the Library of Congress.”  LINK
  • “MAST is currently home to an estimated 200 terabytes of data, which… is nearly the same amount of information contained in the U.S. Library of Congress.”  LINK
  • “This year, CenturyLink projects that 1.8 zettabytes of data will be created. By 2015, the projection is 7.9 zettabytes. That’s the equivalent of 18 million times the digital assets stored by the Library of Congress today.”  LINK
  • Twitter “needed just 20 terabytes to back up every tweet that’s ever existed… that’s about twice the estimated size of the print collection of the Library of Congress.”  LINK
  • “A TB, or terabyte, is about 1.05 million MB. All the data in the American Library of Congress amounts to 15 TB.”  LINK
  • “One petabyte of data is equivalent to 13.3 years of high-definition video, or all of the content in the U.S. Library of Congress — by its own claim the largest library in the world — multiplied by 50…”  LINK

But what I also got were calls, emails, and tweets asking why I didn’t set the record straight about the size of the Library’s digital collections, and share a number.  The answer to the question about the size of the collections is:  it depends.

Do we count… items or files or the amount of storage used?  What constitutes an item?

Do we count… master files? Derivative files? Copies on servers? Copies on tape?  Second (third, fourth) copies in other distributed preservation locations?

Do we count … files we “own?”  Have in our physical control?  License access to that lives elsewhere?

And, when we digitize one more item at 5 p.m. that hadn’t existed in our collections at 4:59 p.m., do we update our counts/extents?

So, here’s what I can say:  the Library of Congress has more than 3 petabytes of digital collections.  What else I can say with all certainty is that by the time you read this, all the numbers — counts and amount of storage — will have changed.

5 Comments

  1. Roger
    May 29, 2012 at 6:46 pm

    The McKinsey Report, “Big data: The next frontier for innovation, competition, and productivity on big data” (June, 2011*) states 235 terabytes of data has been “collected by the US Library of Congress by April 2011.” However, you mention above that “the Library of Congress has more than 3 petabytes of digital collections.” Is it correct to say that in the last year, 2-3 additional terabytes of data has been digitized? I can understand that the amount of digital content will change daily but the jump between 235 TB and >3 PB in 12 months is a huge amount of digitized content.

    I’m curious: How do you back up this amount of data (tape, disk, optical, something else)? How often is this content backed-up?

    Cheers,

    Roger

    * see page vi for the data/attribution of 235 TB: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

  2. Leslie Johnston
    May 30, 2012 at 9:58 am

    Thanks for asking, Roger. The Library adds to its digital collections in many ways, not just through digitization of its own physical collections of all formats (text, images, maps, video, audio, etc.). We acquire born digital collections through the Copyright Office, individual and vendor agreements, and through collaborative projects with numerous partner organizations. We have also been selectively archiving the web since 2000. That quote from the McKinsey report is referring to the published number for the Library’s collected web archives only, not the full extent of the Library’s digital collections. In 2011 we already had more than 2 PB of digital collections at the Library. As to backups, we copy files to tape, utilizing more than one tape storage architecture in multiple distributed locations to reduce risk.

  3. Christine Sorenson
    May 1, 2013 at 3:20 pm

    This is an interesting discussion but have you (or someone) at the Library of Congress performed an estimate of the size of ALL your holdings if they were ALL digitized? I’m seeking a more authoritative number than everyone’s guesses.

  4. Leslie Johnston
    May 2, 2013 at 10:58 am

    No, Christine, we haven’t. The size of our collections changes every day, so any estimate of how large a complete set of digital surrogates along with our born-digital collections would immediately be out-of-date. We could attempt some calculations for some point in time, but that would not be not inconsiderable effort given the huge range of item types. Let me think about how we might do it.

  5. Stephen Loughin
    June 18, 2014 at 2:39 pm

    Given that mixed media will occupy different amounts of storage depending on things like image resolution and sound quality, I was wondering if anyone has estimated just the total number of characters required to hold just the text in the LoC collection at present? I’m guessing that it is considerably less than 2PB, but I could be wrong.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.