I want to preface this post by reiterating one of our general disclaimers up front, to wit: “This blog does not represent official Library of Congress communications.” Because this post will edge slightly closer to “editorializing” than most of my previous posts.
Working in the Office of Communications as I do, I’m aware of the lion’s share of news coverage about and references to the Library, whether via Lexis-Nexis, Google News, or any number of other monitoring services to which we have access. This is a rough approximation — and perhaps a bit hyperbolic — but I would guess that somewhere between a quarter and a third of the general references I see to the Library are along these lines: “Our servers can store the equivalent of the Library of Congress,” or “our network is fast enough to download the entire Library of Congress in a millisecond.” You get the drift.
So it begs the question, just how “big” is the Library of Congress, in terms of our content, but especially if one tried to equate it to the digital realm?
I won’t go into any of the specific claims that are being made, but they’re easy to find out there in the ether, and suffice it to say that the Library would stand behind very few if any of them. There are certain things we can quantify, but far more that are purely speculative.
For instance, we can as of this moment say that the approximate amount of our collections that are digitized and freely and publicly available on the Internet is about 74 terabytes. We can also say that we have about 15.3 million digital items online.
Some may be tempted to extrapolate that those digital items represent a precise percentage of the nearly 142 million items in the Library’s physical collections, and then estimate some kind of digital corollary. But comparing digital and physical items is apples and oranges, at best. A simple example of that fallacy would be represented by a single photograph online depicting several physical objects.
Another source of digital estimates is likely based on the number of books and printed items in our collections, which is currently about 32 million. One could attempt to establish the average length of those items (pages, words, characters, etc.) and extrapolate the digital equivalent of those 32 million physical items.
Assuming one could do that with any degree of accuracy — and that’s a big assumption — it overlooks the fact that those 32 million books represent only about one-quarter of the entire physical collections. The rest are in the form of manuscripts, prints, photographs, maps, globes, moving images, sound recordings, sheet music, oral histories, etc. So how does that other three-quarters of the Library equate digitally? Can one automatically assume the digital resolution at which all maps or photographs, for instance, would be scanned? Those are major wildcards indeed.
And then there are our motion pictures, videos and sound recordings alone — around 6 million items stored at our new Packard Campus for Audio-Visual Conservation in Culpeper, Va. What is their digital equivalent? Most people who record television programs onto a computer or DVR know that a hard drive with hundreds of megabytes or even a terabyte or more can quickly fill up.
One more thing we can quantify or at least estimate: The folks at the Packard Campus say that when their systems are fully online, they expect to be able to digitize between 3 and 5 petabytes of content per year. (That is to say, 3,000 to 5,000 terabytes, for those who are playing at home. Put another way, a single petabyte stored on CD-ROMs would create a stack of discs more than a mile high.) And even at that rate, it would still take decades to digitize the existing content.
So at this point, we’re talking about potentially mind-boggling amounts of data — we’re into the territory of angels and heads of pins. And while it is a bit of a tangent, it points to the continued importance of libraries as places (and librarians) when too many people assume that “everything is online.”
While it is certainly flattering that the Library of Congress is used as a typical benchmark against which others measure their content or data capacity, we would do well to take these claims with at least a shaker of salt. We are far “bigger” than many of them might think.