At our recent Preservation Storage Meeting, the word “data” was frequently mentioned. This was of some note to me, as cultural heritage organizations have, until recently, spoken of “collections” and “content” or even “files,” but not data. This is of course not the case at universities, where social science and observational datasets are very much a part of the custodial landscape. But most libraries, archives, and museums have not considered their collections to be data.
Martha Anderson recently blogged on this topic very eloquently.
I want to say this out loud: we all have data, from metadata to full-text collections to more formal datasets. We used to talk non-stop about metadata. Now we talk about data. Data is the new black.
The Storage Meeting and a recent meeting with a software vendor have had me thinking specifically about what constitutes “Big Data.” The definition of Big Data is very fluid, as it is a moving target — what can be easily manipulated with common tools — and specific to the organization: what can be managed and stewarded by any one institution in its infrastructure. One researcher or organization’s concept of a large data set is small to another.
In one conversation that I remember not too long ago, an organization was surprised to find that they would need 10 TB of storage for a large digital collection. I now know of collections that add that many TB in a single week.
The Twitter archive has 10s of billions of tweets in it.
The Chronicling America collection has over 4 million page images with OCR.
Web Archives, such as the one at the Library of Congress, may be comprised of billions of files.
And researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus. Some may want to search for stories in historic newspapers. Some may want to mine newspaper OCR for trends across time periods and geographic areas. Some may want to see what a specific user tweeted. Some may want to look at the spread of an event hashtag across the world in a day.
We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. We transitioned into a partially self-serve model when we moved online. But can we support real-time querying of billions of full-text items? Can we provide tools for collection analysis and visualization? Can we support the frequent downloading by researchers of collections that may be over 200 TB each? These are among the questions that all of our institutions are grappling with as we build large digital collections and discover new ways in which they can be used.