Scientific data management has some buzz going. As a longtime data archivist/advocate this is a dream come true for me. I’ve pinched myself a couple of times to make sure it’s really happening.
For decades, scientific practice focused attention on the published results of research. A substantial infrastructure supports this literature, including an article citation system to share learning and credit authors. The research data underlying the articles, however, have been treated as poor relations. Data sets often were haphazardly managed, and preserving them was a challenging and uncertain prospect. The principal issue was one of respect: articles had it and data didn’t.
I recall when, working for another agency in the previous century, I pitched a modest plan to a high-ranking government official–a scientist by training–to collect and preserve data for use in public health studies. In this case, I was confident of holding a winning hand for data preservation. Surely this was a proposition brimming with self-evident beneficence, yes?
Guess again. “I don’t care about that old data,” was the brusque reply.
But the elixir of time has worked its magic, and there is now a coalescing sense among researchers and policy makers that ongoing access to data is needed to replicate scientific results and spur new learning though secondary use.
There has been a steady stream of big, serious reports from big, serious organizations about the importance of data to drive economic and technological innovation. The White House National Science and Technology Council, for example, issued Harnessing the Power of Digital Data for Science and Society in 2009. The report presented a vision for preserving federally funded research data and making it broadly accessible. Earlier this year McKinsey & Company issued Big data: The next frontier for innovation, competition, and productivity. The report placed high importance on enabling data access, noting that finding and using multiple data sources is increasing critical in all research fields.
Perhaps most significantly, the National Science Foundation recently declared that all funding proposals must include a data management plan to encourage dissemination and sharing of data sets that underlie formal research results. Given the vast scientific work that NSF supports, this edict will have far-reaching impact.
The key to making–and keeping–data accessible is an infrastructure at least as effective as the one in place for publications. For some libraries and other collecting institutions this is a great opportunity to extend their expertise in collection building and bibliographic control to a critical new type of material. Several institutions have, in fact, moved briskly to do just that. The University Curation Center at the California Digital Library now provides extensive assistance to researchers in connection with data management, as does the University of Minnesota Libraries and the Inter-University Consortium for Political and Social Science Research at the University of Michigan. The Managing Research Data project in the UK is also doing fine work to extend practices to cite, link and preserve data sets.
It’s true that, despite the shift in favor of data management, most libraries are still working to understand how they can best play a role. And there is a multiplicity of roles to choose from: become a data repository; provide assistance to data creators; help researchers find existing data sets. One possible role is especially intriguing: data citation. Bibliographic control is, after all, something that libraries have successfully extended into the digital landscape. Researchers expect to use library resources to find and use published articles. Why not build on this to help people find and use data? The key to doing this is developing a methodology to uniquely identify individual data sets and point users to where the data can be found.
On that score, I had the pleasure of attending the 2011 Annual Summer Meeting of DataCite in Berkeley, CA. The gathering drew about 200 attendees from around the world to discuss advancing the mission of the DataCite organization: creating a global citation framework for data. One of the most popular tweets from the meeting proclaimed that “you can have articles and data at one place and that place is the library.” I’ll recap the meeting in a future post, but that quote serves quite well to frame a library strategy in connection with data sets.
Corrected web links, 9/15/2011
Comments (2)
An interesting issue that the Library of Congress faces is the role it has of serving as an “office of public record”– striking the balance between ensuring that records are made available and accessible to the public (hence, the digitization initiative) while also serving as the home of the US Copyright Office and ensuring that protected works remain… well, protected. That being said, it’s great to see the Library engaging in an effort to broaden the “public record” function of their existence to scientific data. I suspect this is an area that is often neglected by the non-science community, and it is great to see it being tied into the Library here. I would just want to make sure that opening up data preservation efforts would not be violating copyrighted material in the scientific studies that are published. Copyright is an extremely tricky issue and it is easy for people to infringe without knowing it– example: “borrowing” from a source without citing it, simply because it was just listed as a “source” from another article. Any information that is taken from somewhere else should be cited, and this practice has become increasingly abandoned with the if-it’s-on-the-internet-it’s-mine mentality.
Great comment–thanks. I’m optimistic that a uniform data citation regime will work in concert with copyright law, in the US and in other nations. But there are some untested waters to be navigated ahead!