Data-Intensive Librarians for Data-Intensive Research

The following is a guest post by Chelcie Rowell, 2012 Junior Fellow.

A packed house heard Tony Hey and Clifford Lynch present on The Fourth Paradigm: Data-Intensive Research, Digital Scholarship and Implications for Libraries at the 2012 ALA Annual Conference.

Stellar Shrapnel Seen in Aftermath of Explosion - A supernova remnant located in the Large Magellenic Cloud, by Smithsonian Institution, on Flickr

Stellar Shrapnel Seen in Aftermath of Explosion - A supernova remnant located in the Large Magellenic Cloud, by Smithsonian Institution, on Flickr

Jim Gray coined The Fourth Paradigm in 2007 to reflect a movement toward data-intensive science. Adapting to this change would, Gray noted, require an infrastructure to support the dissemination of both published work and underlying research data. But the return on investment for building the infrastructure would be to accelerate the transformation of raw data to recombined data to knowledge.

In outlining the current research landscape, Hey and Lynch underscored how right Gray was.

Hey led the audience on a whirlwind tour of how scientific research is practiced in the Fourth Paradigm. He showcased several projects that manage data from capture to curation to analysis and long-term preservation.  One example he mentioned was the Dataverse Network Project that is working to preserve diverse scholarly outputs from published work to data, images and software.

Lynch reflected on the changing nature of the scientific record and the different collaborative structures that will be needed to define, generate and preserve that record.  He noted that we tend to think of the scholarly record in terms of published works. In light of data-intensive science, Lynch said the definition must be expanded to include the datasets which underlie results and the software required to render data.

When Gray coined the term Fourth Paradigm, he was thinking primarily of scientific research, but the speakers suggested that it encompasses the humanities and social sciences, too. Data supporting all forms of inquiry are collected for the benefit of countless researchers and research questions in a number of different fields and sub-fields. Yet just as we’re expanding the definition of the scholarly record to include evidence as well as argument, the challenges of preserving these evidence bases are multiplying.

Hey and Lynch observed various kinds of preservation challenges.  Some challenges are technical. Think of the kind of digital objects that wrap executable models and raw data in a document, or news websites whose databases call for utterly different strategies than stacking up and microfilming newspapers. Other challenges are organizational. For example, while the federal records management mandate is helping to capture federal government data, efforts to preserve state and local government data are strapped for resources.

Record Shelves, by FourthFloor, on Flickr

Record Shelves, by FourthFloor, on Flickr

What role should librarians and data curators play? In response to this audience question, the panelists argued that it’s possible to untwine the role of libraries from the role of librarians. Lynch said that as institutions, libraries should make the case for funding and advocate for public policy provisions (from intellectual property to privacy and permissions).  Librarians should, he said, engage these questions within their institutions as well as advocate externally.

Hey added that although we talk about big data, libraries deal with the “long tail of science,” where there are big numbers of small datasets. Often university libraries are the point of last resort for curating and preserving “long tail” data. Whether inside or outside of the library, the role of the librarian is to work with different communities to identify where data is valuable and to build a culture that facilitates the its remix and reuse.

2 Comments

  1. James B
    July 5, 2012 at 2:11 am

    When CERN needed to store and disseminate all the data from the LHC project, they made agreements with universities worldwide to create a direct intranet, then hold the data, and then use the computing power of all of those combined machines to make the necessary calculations. This was a huge IT Support project to deliver, but that is one way to do it.

  2. Daina B
    July 31, 2012 at 2:22 pm

    As an MLIS student focusing in Data Science, it’s very exciting to see these conversations taking place. Wonderful.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.