The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Web Archiving Team.
If science reporters, IT industry pundits and digital storage and network infrastructure purveyors are to be believed, devices are being lab-tested even now that can store all of the data in the Library of Congress or transmit it over a network in mere moments. To this list of improbable claims, I’d like to add another: by the most conservative estimates, I transfer more than a Library of Congress’ worth of data to the Library of Congress every month.
Clearly, that doesn’t make any sense, but allow me to explain. You may have noticed that the “data stored by the Library of Congress” has become a popular, if unusual, unit of measurement for capacity (and the subject of a previous Library of Congress blog post, to boot). More cautious commentators instead employ the “data represented by the digitized print collections of the Library of Congress.” My non-exhaustive research (nonetheless, corroborated by Wikipedia) suggests that in instances where a specific number is quoted, that number is most frequently 10 terabytes (and, in a curious bit of self-referentiality, the Library of Congress Web Archiving program is referenced in Wikipedia to help illustrate what a “Terabyte” is). From whither, 10 terabytes?
The earliest authoritative reference to the 10 terabytes number comes from an ambitious 2000 study by UC Berkeley iSchool professors Peter Lyman and Hal Varian which attempted to measure how much information was produced in the world that year. In it, they note with little fanfare that 10 terabytes is the size of the Library of Congress print collections. They subsequently elaborate their assumptions in an appendix: the average book has 300 pages, is scanned as a 600 DPI TIFF, and, finally, compressed, resulting in an estimated size of 8 megabytes per book. At the time of the study’s publication, they supposed that the Library of Congress print collections consisted of 26 million books. Even taking these assumptions for granted, the math yields a number much closer to 200 terabytes. Sure enough, the authors note parenthetically elsewhere in the study that the size of the Library of Congress print collections is 208 terabytes. No explanation is offered for the discrepancy with the other quoted number.
For whatever reason, though, it’s the 10 terabyte figure that took hold in the public’s imagination. To be sure, 10 terabytes is an impressive amount of data, but it’s far less impressive than the amount of data that the Library of Congress actually contains (and, I suspect, even just counting the print collections). While I’m neither clever nor naïve enough to propose what a more realistic number might be, returning to my original provocation, I did wish to further discuss a digital collection I know quite well: the Library of Congress Web Archives.
As explained previously in The Signal, we currently contract with the Internet Archive to perform our large-scale web crawling. One ancillary task that arises from this arrangement is that the generated web archive data (roughly 5 terabytes per month) must be transferred from the West Coast to the Library of Congress. This turns out to be non-trivial; it may take the better part of a month with near-constant transfers over an Internet2 connection to move 10 terabytes of data. For all the optimism about transmitting “Libraries of Congress” of data over networks, putting data on physical storage media and then shipping that media around remains a surprisingly competitive alternative. Case in point: for all of the ethereality and technological sophistication implied by so-called cloud services, at least one of the major providers lets users upload their data in the comparatively mundane manner of mailing a hard drive.
Of course, transfer is just the initial stage in our management of the web archive data; the infrastructure demands compound when you consider the requirements for redundant storage on tape and/or spinning disk, internal network bandwidth, and processor cycles for copying, indexing, validation, and so forth. In summary, I doubt that we have spare capacity to store and process many more “Libraries of Congress” of data than we are currently (though perhaps that’s self-evident).
Suffice it to say, I look forward to a day when IT hardware manufacturers can legitimately claim to handle magnitudes of data commensurate with what is actually stored within the Library of Congress (whatever that amount may be). In the meantime, however, I suppose I’d settle for the popular adoption of fractional “Library of Congress” units of capacity (e.g., “.000001% of the data stored at the Library of Congress”) – likely no more or less realistic than what the actual number might be, but at least it’d more appropriately aggrandize just how much data the Library of Congress has.
Comments (11)
I’ve always had a problem with the 10 TB number. It’s one of those numbers you see tossed around with too great a frequency lacking any reference (kind of like the “we only use 10% of our brains”, which I’m inclined to actually believe in the case of certain reality TV stars) to take seriously. The Library of Congress is, in effect, a mystical unknown to most Americans, and represents a romantic ideal of preservation and data, and as such is an evocative measure of data.
The text alone, without formatting, of the 2002 edition of the Encyclopedia Britannica adds up to about 264 GB, so it’s preposterous to assume the LoC amounts to a mere 40 times that number.
Even dealing with raw text, the encoding can matter quite a bit. Unicode UTF-8 and UTF-16 can increase sizes used in a document by quite a bit. For most English documents and other languages which use a Latin script, Unicode encoding automatically doubles the file size. Admittedly, this is only necessary when the source document uses characters not present in ASCII, but it certainly is worth considering.
I’m using this number as a trivia note in a project where, if I’m wrong, won’t negatively affect anyone in any important way, and I’m going to take the liberty of citing the 208 TB number as a “Very conservative estimate circa 2000”, which I think is a lot more reflective of reality than the 10 TB number.
Thanks for the post!
The area is thrilling to me.Thanks a lot!
Characters or images…. 26 million books of 500 pages of 2000 characters per page at 1 byte per character is 26 TB
If we are going to be so technical about everything, why not put the real number of TB based off of 8MB per book at 26 million books, which is closer to 198 TB.
26,000,000 * 8 / 1,024 / 1,024 = 198.36 TB, not 208 TB, where lazy division was used (1,000 instead of 1,024).
I like the LOC, but I also view it as tentacle of state power with a stranglehold on information. So I’m interested in its size and the corresponding size of the internet, the same way as I’d be interested in the size of two fighters when placing bets before a match. Too bad the admins at the LOC have apparently decided it’s too difficult to weigh their fighter.
re comment #1, 264MB not Gigabytes. 2010 version 32,640 pages about 8,000 Characters per page = millions not billions
re comment #3 The average book at 1MB seems reasonable. War and Peace is about 4MB, Fahrenheit 451 is about 0.35MB.
re comment #4 using mythical books that are twice the size of War and Peace as “average” leads to an inflated number
These numbers are assuming a full ASCII 256 character for every letter, number, and space.
Basic lossless compression would cut these file sizes by about a factor of 8, yielding a bit over 3TB for LoC printed text collection
Thank you Jim Michaels for one of the few sensible calculations…admittedly ignoring diagrams/pictures in those books. Nonetheless if we are simply looking at the information, represented by language, stored then a 3TB number seems appropriate.
What is really quite interesting is how tech capacity has outstripped the generation of real data (not things like static CCTV streams or web records for each purchase of bananas), real knowledge that would be criminal to lose. Case in point is Toshiba’s recent announcement of a 1.33Tbit flash memory that can be stacked 16 high in a single semi package (one IC) and has a capacity of 2.66 Terabytes = roughly 1 LOC 🙂
Possibly the ‘discrepancy’ is the difference in size between:
a) Just the text of a book
1 byte per letter.
26 Million * 1MB per book = 26 TB compressed ~ 10 TB
b) Saving a scan of each page
‘600 DPI TIFF’ might imply ‘a picture’ of each page,
where you would see every crease and font change.
26 Million * 500 pages * 50KB a page as TIFF = 650 TB
compressed ~200 TB
(Images compress less well than text.)
Can we count the pictures in a book using the “a picture is worth a thousand words” equivalence (and using 5 letters per word plus a space as a approximation)?
PJJ points to the key issue. The “information” in a page of text is FAR less than the # of bytes needed to store a 600 dpi IMAGE of the text on that page.
Graphics are more complex, but can often be represented by less information than is required for a high-res scan — data plots, for example, may be even less information-dense than text occupying the same space, and line drawings only modestly greater.
Artistic works and high-resolution photographs may contain even more information than the human eye and brain are capable of perceiving, but is that level of detail “information?” For purposes of knowledge capture (e.g., for a book about Renaissance Art), a moderate-resolution scan of Da Vinci’s Mona Lisa is as good as the original. For artistic appreciation, of course, that’s another matter.
For books, my standard would be: “The minimum amount of information that would be required to faithfully reproduce the printed original at the original resolution,” whether captured by scan, OCR, or mathematical representation (e.g., vector graphics).
DARPA was responsible for creating the internet backbone. The nodes are geographically where large computers and their associated storage are located, typically large Universities. For the rationale as to why I use here an example. Chicago has a node, it is also for a long time unique repository of most if not all our knowledge for Atomic science learned from making the bombs as well as the plans for making terrifying bombs. So if a member of the Axis of Evil wanted to hurt us badly a simple task of landing a bomb in one specific place such as the Fermi Labs at the Univ of Chicago they could.
So all the knowledge from the Fermi Labs was placed onto/into the internet backbone and now copies reside at each node. This basically describes why DARPA created the internet.
So I would think the internet archive compressed, flattened and zipped, here for discussion purposes only, I will call 16 LoCs (1 LoC is a Library of Congress data volume unit) should also be copied to each node. Understanding this compressed volume may not fit in each node the backup copy should be placed in the Cheyenne Mountain Complex.
The web crawlers should run there 24/7 incrementally growing the size of the archive volume. It would be the second copy of the data.
I’ve asked and read around yet have seen or heard no discussion realizing most yet soon to be all of the worlds history, work and Arts/Humanities content…etc will be available online(a copy of all those originals) and our internet is the same as the content stored in/on the biological network which all livings on the planet Pandora(Avatar the movie) we’re part. All life, and the information all lives, since the beginning, possessed existed as a copy in/on Pandora. No reincarnation necessary. Collective of organized information energy when only changed from corporeal form to a bio energy, the planet.
Us one day I would hope.