On October 20 I had the extreme pleasure of being one of the plenary speakers at the 2011 Best Practices Exchange. I rarely have the opportunity to get an hour all to myself to speak about pretty much whatever I want to talk about.
One one hand, I wanted the opportunity to extend the topics that Martha Anderson and I addressed in our recent blog posts (here and here) on data in stewardship organizations. On the other hand, I wanted the chance to share the news about the upcoming launch of our collection data visualization service based on Recollection, called ViewShare.
So I did both. “From Records to Data: It’s Not Just About Collections Any More.” This is what I covered.
- In fifteen years of building digital collections, we have learned two things: researchers do not use digital collections in the same way as they use analog collections and we can never guess how our collections will be used.
Stewardship organizations have, until recently, spoken of “collections” or “content” or “records” or even “files,” but not data. Data is not just generated by satellites, experiments, or surveys; publications and archival records also contain data.
- We also need to start thinking in terms of “Big Data.” The definition of Big Data is very fluid, as it is a moving target — what cannot be easily manipulated with common tools — and specific to the organization: what can be managed and stewarded by any one institution in its infrastructure. One researcher or organization’s concept of a large data set is small to another. Not too long ago, an organization would be surprised to need 10 terabytes of storage for a large digital collection. Now a collection can increase by 10 terabytes in a single week.
- More and more, researchers want to use collections as a whole and to mine and organize the collections in novel ways. They use algorithms to do so and new tools that create visual images that transform data into knowledge. Researchers want to mine information from millions of digitized National Digital Newspaper Program newspaper pages and OCR, so the files have been made openly available through a Web API. Researchers of web archives want access to all of the archived site files and to use scripts to query the full text for the information they want. They don’t want to read Web pages.
- The sheer volume of the electronic data cultural stewardship organizations need to keep is a challenge. The Twitter archive comprises tens of billions of individual tweets and grows by several million tweets every hour. LC is determining how best to manage, preserve and provide comprehensive access to this mass of data. Researchers have already requested uses of the collection that includes the study of the geographic spread of the dissemination of news, the spread of epidemics and the transmission of new uses of language.
- We have have already made the switch to a self-serve model of reference services. Our community used to expect researchers to come to us, ask us questions about our collections and use our digital collections in our environment. Now they want to find the materials they need and then work with them in their own work spaces using their own tools. We need to create mechanisms that make it easy for them to do so, to support real-time querying of billions of full-text items and the frequent downloading of collection datasets that may well be over 200 terabytes. We may also need to think about providing tools that support various forms of collection analysis in our own environments, such as visualization.
- We can’t be afraid of cloud computing. Given the volumes of data coming our way and mounting researcher demands for access to vast quantities of data, the cloud is the only feasible mechanism for storing and providing access to the materials that will come our way. We need to focus on developing authentication, preservation and other tools that enable us to keep records in the cloud.
So what are our institutions doing about preservation and access to our Big Collections and Big Data? They are supporting the use and re-use of their collections by exposing them as open access data, including the growing use of Linked Open Data APIs. They are collaborating through the efforts of the NDSA. And we are developing and using open source tools and sharing information about the use of those tools across the community.
I used the soon-to-be-launched ViewShare as a case study for an open source tools that can benefit small- or medium-sized stewardship organizations.
It’s a tool that supports allows organizations to upload metadata with links to digital media, make the collections available as linked data, support the creation of visualizations and, perhaps the most importantly, allows researchers to grab the data and/or make their own visualizations for their own use within ViewShare or embedded elsewhere.
What Big Data challenges are your institutions facing, and what are you considering for solutions?