When I wrote my post on the “Library of Congress” as a unit of measure, I expected to receive some feedback.
And boy, did I.
As expected, I received some new examples:
- “In less than two years the app has already hosted more than 500 million images — more than 30 times greater than the entire photo archive of the Library of Congress.” LINK
- “MAST is currently home to an estimated 200 terabytes of data, which… is nearly the same amount of information contained in the U.S. Library of Congress.” LINK
- “This year, CenturyLink projects that 1.8 zettabytes of data will be created. By 2015, the projection is 7.9 zettabytes. That’s the equivalent of 18 million times the digital assets stored by the Library of Congress today.” LINK
- Twitter “needed just 20 terabytes to back up every tweet that’s ever existed… that’s about twice the estimated size of the print collection of the Library of Congress.” LINK
- “A TB, or terabyte, is about 1.05 million MB. All the data in the American Library of Congress amounts to 15 TB.” LINK
- “One petabyte of data is equivalent to 13.3 years of high-definition video, or all of the content in the U.S. Library of Congress — by its own claim the largest library in the world — multiplied by 50…” LINK
But what I also got were calls, emails, and tweets asking why I didn’t set the record straight about the size of the Library’s digital collections, and share a number. The answer to the question about the size of the collections is: it depends.
Do we count… items or files or the amount of storage used? What constitutes an item?
Do we count… master files? Derivative files? Copies on servers? Copies on tape? Second (third, fourth) copies in other distributed preservation locations?
Do we count … files we “own?” Have in our physical control? License access to that lives elsewhere?
And, when we digitize one more item at 5 p.m. that hadn’t existed in our collections at 4:59 p.m., do we update our counts/extents?
So, here’s what I can say: the Library of Congress has more than 3 petabytes of digital collections. What else I can say with all certainty is that by the time you read this, all the numbers — counts and amount of storage — will have changed.


May 29, 2012 at 6:46 pm
The McKinsey Report, “Big data: The next frontier for innovation, competition, and productivity on big data” (June, 2011*) states 235 terabytes of data has been “collected by the US Library of Congress by April 2011.” However, you mention above that “the Library of Congress has more than 3 petabytes of digital collections.” Is it correct to say that in the last year, 2-3 additional terabytes of data has been digitized? I can understand that the amount of digital content will change daily but the jump between 235 TB and >3 PB in 12 months is a huge amount of digitized content.
I’m curious: How do you back up this amount of data (tape, disk, optical, something else)? How often is this content backed-up?
Cheers,
Roger
* see page vi for the data/attribution of 235 TB: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
May 30, 2012 at 9:58 am
Thanks for asking, Roger. The Library adds to its digital collections in many ways, not just through digitization of its own physical collections of all formats (text, images, maps, video, audio, etc.). We acquire born digital collections through the Copyright Office, individual and vendor agreements, and through collaborative projects with numerous partner organizations. We have also been selectively archiving the web since 2000. That quote from the McKinsey report is referring to the published number for the Library’s collected web archives only, not the full extent of the Library’s digital collections. In 2011 we already had more than 2 PB of digital collections at the Library. As to backups, we copy files to tape, utilizing more than one tape storage architecture in multiple distributed locations to reduce risk.
May 1, 2013 at 3:20 pm
This is an interesting discussion but have you (or someone) at the Library of Congress performed an estimate of the size of ALL your holdings if they were ALL digitized? I’m seeking a more authoritative number than everyone’s guesses.
May 2, 2013 at 10:58 am
No, Christine, we haven’t. The size of our collections changes every day, so any estimate of how large a complete set of digital surrogates along with our born-digital collections would immediately be out-of-date. We could attempt some calculations for some point in time, but that would not be not inconsiderable effort given the huge range of item types. Let me think about how we might do it.