Data is the New Black

At our recent Preservation Storage Meeting, the word “data” was frequently mentioned.  This was of some note to me, as cultural heritage organizations have, until recently, spoken of “collections” and “content” or even “files,” but not data.  This is of course not the case at universities, where social science and observational datasets are very much a part of the custodial landscape.  But most libraries, archives, and museums have not considered their collections to be data.

Martha Anderson recently blogged on this topic very eloquently.

I want to say this out loud:  we all have data, from metadata to full-text collections to more formal datasets.  We used to talk non-stop about metadata. Now we talk about data. Data is the new black.

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

The Storage Meeting and a recent meeting with a software vendor have had me thinking specifically about what constitutes “Big Data.”  The definition of Big Data is very fluid, as it is a moving target — what can be easily manipulated with common tools — and specific to the organization: what can be managed and stewarded by any one institution in its infrastructure.  One researcher or organization’s concept of a large data set is small to another.

In one conversation that I remember not too long ago, an organization was surprised to find that they would need 10 TB of storage for a large digital collection.  I now know of collections that add that many TB in a single week.

The Twitter archive has 10s of billions of tweets in it.

The Chronicling America collection has over 4 million page images with OCR.

Web Archives, such as the one at the Library of Congress, may be comprised of billions of files.

And researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.  Some may want to search for stories in historic newspapers.  Some may want to mine newspaper OCR for trends across time periods and geographic areas.  Some may want to see what a specific user tweeted.  Some may want to look at the spread of an event hashtag across the world in a day.

We still have collections.  But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services.  Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment.  We transitioned into a partially self-serve model  when we moved online.  But can we support real-time querying of billions of full-text items?  Can we provide tools for collection analysis and visualization?  Can we support the frequent downloading by researchers of collections that may be over 200 TB each?  These are among the questions that all of our institutions are grappling with as we build large digital collections and discover new ways in which they can be used.

One Comment

  1. curtis
    October 18, 2011 at 9:02 pm

    My feeling is that private clouds, ones close to the original data, capable of mounting a snapshot or readonly instance of the data, could be a good direction to go in in terms of allowing researchers access to digital collections.

    We just have to make it easy for them to spark up instances that can manipulate the data in a way that works for them.

    Unfortunately that is the hardest part I think…figuring out the tooling, the software that the researchers will use, because not every scientist/researcher wants to be a computer scientist. Supporting the tools will likely be harder than the private cloud portion, which isn’t easy though getting easier all the time. (See OpenStack.)

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.