In July 2011, Nicholas Taylor posted an entry to this blog about the amount of data transferred to the Library of Congress and the likely sources of some of the public perceptions of the size of the Library’s digital collections. And Matt Raymond of the Library posted an excellent overview of the size of the Library’s collections in February 2009.
Since then, I’ve become somewhat obsessed with references to the size of the collections, and the use of “a Library of Congress” as a unit of measure. Just check Wikipedia under “unusual units of measurement.”
The ur-number seems to come from a 1997 report written by Michael Lesk titled “How Much Information Is There In the World?” In that report he provides the proposed calculation for the “size” of a digitized book, and the guesstimate that the Library had 20 millions books. To be fair, this report also makes a guesstimate about the size of collections of photographs, video, and audio, and comes up with the figure of 3 petabytes worth of collections. For 1997, this was a very well-informed estimate.
But the numbers that caught the public’s imagination were the ones for books. And that 10 TB figure is everywhere.
So, how many Libraries of Congress does it take to…? Or how many Libraries of Congress can be contained in…?
- “Every Six Hours, the NSA Gathers as Much Data as Is Stored in the Entire Library of Congress.” LINK
- “Facebook’s photo collection has a staggering 140 billion photos, that’s over 10,000 times larger than the Library of Congress.” LINK
- “The [Honeywell India Technology] centre stores some 32 terabytes (32,768 GB) of data. That’s five times more than the world’s largest library – the US Library of Congress.” LINK
- “The fiber optic cable is capable of transmitting data at a maximum of 40 gigabits per second from deep-sea locations where gaps of instrument coverage currently exist. For comparison, the entire print collection of the Library of Congress could be transmitted over the link in just more than 30 minutes.” LINK
- “There are 25 Petabytes (10^15) created every day and thrown into the internet. This is 70 times larger than the Library of Congress.” LINK
- “…it is estimated that the entire collection of the Library of Congress including photos, sound recordings and movies might take 3,000 TB of storage. Assuming $100 each for 2 TB hard drives, the entire book collection of the Library of Congress could be stored on about $1500 worth of hard drives at today’s prices.” LINK
- “The upper end of the reference configurations is 96 blades [servers] with 1,152 cores, 9.2 TB memory and 57.6 TB of disk storage, enough disk space to store the entire Library of Congress six times.” LINK
- “He keeps 500 terabytes of storage near Factual’s headquarters. That’s about twice the amount needed to hold the entire Library of Congress.” LINK
- “The size of Facebook’s data retention database alone would be larger than all of the content that the Library of Congress has put online to date.” LINK
- “… in a world where the entire Library of Congress will soon be accessible on a mobile device with search procedures that are vastly better than any card catalog, factual mastery will become less and less important. ” LINK
I have more of these, but I am always looking to add to my growing collection. Please let me know about more by commenting!
Comments (21)
Glad to have an ally in my hopeless battle against the Internet being collectively wrong!
Here are three more:
Infographic: Data Deluge – 8 Zettabytes of Data by 2015
(http://www.readwriteweb.com/enterprise/2011/11/infographic-data-deluge—8-ze.php) — LC digitized is “up to” 462 TB of data.
Seagate Announces World’s First 4-Terabyte External Hard Drive
(http://www.pcmag.com/article2/0,2817,2392562,00.asp) — 7th paragraph, “the entire Library of Congress …, by popular definition, takes up 10 terabytes of data”
Humanity’s Tweets: Just 20 Terabytes
(http://www.pcmag.com/article2/0,2817,2382347,00.asp) — 2nd paragraph, Twitter archive is “twice the estimated size of the print collection of the Library of Congress.”
Thanks! I am always glad to have more. There are a number of product and vendor benchmarks for capacity and speed measured in “Libraries of Congress” that I have not listed here.
I have had a number of people write to ask just how much digital data there really is at the Library of Congress. We don’t provide numbers since it changes every day. All I will say is that it is multiple Petabytes across all the collections, servers, and tape libraries. That should start some conversations!
Catching up on my reading and found a couple more:
Optimism shines through experts’ view of the future (http://www.smh.com.au/national/optimism-shines-through-experts-view-of-the-future-20120323-1vpas.html) — “All the data in the American Library of Congress amounts to 15 TB.”
Will Megaupload’s 28 petabytes of data be deleted? (http://www.computerworld.com/s/article/9225405/Will_Megaupload_39_s_28_petabytes_of_data_be_deleted_?taxonomyId=17) — “One petabyte of data is equivalent to 13.3 years of high-definition video, or all of the content in the U.S. Library of Congress — by its own claim the largest library in the world — multiplied by 50, according to a footnote in the court filing.”
Leslie,
Check out:
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A.H. (2011). Big data: the next frontier for innovation, competition, and productivity. Report. Seoul: McKinsey Global Institute. Retrieved June 1, 2011, from http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
p. vi:
235 terabytes data collected by the US Library of Congress by April 2011
15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress
p. 3
One exabyte of data is the equivalent of more than 4,000 times the information stored in the US Library of Congress. (footnote 6: According to the Library of Congress Web site, the US Library of Congress had 235 terabytes of storage in April 2011.
NY Times again: “the digital collection of the Library of Congress is a little more than 300 terabytes, according to an estimate earlier this year.”
All the TV News Since 2009, on One Web Site
Though this Internet Archive is a really impressive project!
The estimates, they are ever-present and highly variable. I love reading these, and appreciate that people write to me and tweet them to me.
I meant to post that last week I was sitting across from Michael Lesk at an event dinner. I teased him that his report had a ruinous effect on my life and he laughed. He never expected that to be the item from that report that took on its own life, long after it was correct or relevant.
“Multiple Petabytes” to me means 20.
A nice, solid, conservative figure that might actually last a few years. Really, really accurate? Probably not. But at some point, when making measurements, you have to come up with a standard.
Anne – We have more than 50 PB of storage because we keep multiple copies of files in different locations for preservation purposes. I can say that our digital collection size is over 5 PB.
i read an article saying new methods can store data onto synthetic DNA and that scaling the amount up about 2.2 petabytes of data can be stored on a gram of DNA. so i got curious and measured that for a person of my weight 200 lbs about 90718 grams around 199,580 petabytes could be stored. great but its mostly just numbers to me HOW MUCH DATA IS 199,580 petabytes?! please im so intrigued and yet so unable to give this a workable scale in my mind
I have a much simpler question:
How much space would it take to store the text (text only) of every book ever written?
Arthur: That’s actually a much more difficult question than you think, because there are so many possible ways of calculating how many books have even been written. Written or published? In every language? And what constitutes a book? Is there a minimum length for what constitutes a book? Is a pamphlet a book? Is a serial a book? What about a government publication?
Google made an estimation of 130,000,000 books in 2010 (http://booksearch.blogspot.com/2010/08/books-of-world-stand-up-and-be-counted.html), but that only covers books published in relatively recent history, books that have has ISBNs assigned. That number is definitely NOT inclusive of all definitions of “book.”
In short, there’s no easy way to come up with a definitive number of how many books have ever been written or published. And since more books are published internationally every minute, the number is constantly changing. Just like the size of the digital collections of the Library of Congress.
The link in that fifth bullet point is awful. I feel like it was written by my great-grandfather after I tried telling him what Big Data is.
There’s a new ad out for HP Converged Infrastructure with the tagline, “Need to backup the Library of Congress tonight?” Gag!
http://hp.marketingbridge.com/m/t8r9o?page=7
OK, yes this is all very amusing. Let’s point and laugh at the poor, harassed, writers and editors who continually “get it wrong.”
Or, the LoC could, you know, actually publish an accurate number? You could then point editors to that when you spot one of these hi-lar-i-ous errors and then they could update their stories. Would that be such a bad use of taxpayer dollars?
Anyway, in the absence, I’ll assume from Leslie’s nods and winks that it’s of the order of 10PB.
There’s definitely no intent to harass tech writers. The issue is that the number changes every single day in an effort that includes multiple sites and dozens of digitization projects and acquisitions of born-digital content. And because older numbers continue to live on the web, it’s easy for someone to find those out-of-date numbers, or make estimates based on calculations from many years ago. When people ask our public affairs office, I can always provide a current answer.
I would say the number _today_ is 6.5 PB, not including multiple tape archived copies. We grow at a rate over more than 15 TB/day.
Hi @leslie – I’m wondering if you wouldn’t mind providing an updated estimate of the LOC’s size for us? You estimate 6.5 PB in March of 2013 – what about now some 10 months later?
Rob – today I would say it’s approximately 7 PB.
““…it is estimated that the entire collection of the Library of Congress including photos, sound recordings and movies might take 3,000 TB of storage. Assuming $100 each for 2 TB hard drives, the entire book collection of the Library of Congress could be stored on about $1500 worth of hard drives at today’s prices.” LINK”
There’s no account for bad math. That would be $150,000 in hard drives at 2012’s pricing. Without redundancy.
@Leslie Hello, I’m writing an article on LOC and wondering about a size update. You said 7 PB in January 2014, I imagine that would put it at at least 9 PB today?