All that Big Data Is Not Going to Manage Itself: Part One

On February 26, 2003 the National Institutes of Health released the “Final NIH Statement on Sharing Research Data.” As you’ll be reminded when you visit that link, 2003 was eons ago in “internet time.” Yet the vision NIH had for the expanded sharing of research data couldn’t have been more prescient.

"Big Data visualisation" by user stefanobe on <a href="https://www.flickr.com/photos/stefanobe/11711725656">Flickr</a>

“Big Data visualisation” by user stefanobe on Flickr

As the Open Government Data site notes, government data is a tremendous resource that can have a positive impact on democracy, civic participation, new products and services, innovation and governmental efficiency.

Since 2003 we’ve seen the National Science Foundation release its requirements for Data Management Plans (DMPs) and the White House address records management, open government data and “big data.”  There are now data management and sharing requirements from NASA, the Department of Energy, the National Endowment for the Humanities and many others.

In this two-part series on government data management we’ll take a look back at some of the guidance that is driving data management practices across the federal government. In the second part we’ll look at the tools and services that have developed to meet the needs of this expanding data management infrastructure.

It’s 2014 and we’re still struggling to ensure that the outputs of government-funded research are secure and made accessible as building blocks for new knowledge, but it’s not for lack of trying: federal government agencies such as NIH and the NSF recognized the need to preserve and keep data accessible through the requirements tied to their grant funding.

The 2003 NIH Statement referenced above noted that “Starting with the October 1, 2003 receipt date, investigators submitting an NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing or state why data sharing is not possible.”

They followed that up with Implementation Guidance a little later that year. The guidance was very high-level, but did suggest some possible data sharing methods, such as:

  • Under the auspices of the PI [Principal Investigator].
  • Data archive.
  • Data enclave.
  • Mixed mode sharing.

They didn’t go into great detail on what constitutes a “data archive” or “enclave”  (a “data enclave” looks like a “data archive” with restricted access), but they did include this helpful bit of information on what “under the auspices of the PI” might entail:

 “Investigators sharing under their own auspices may simply mail a CD with the data to the requestor, or post the data on their institutional or personal Website.”

We’re now pretty well-aware of the challenges of over-relying on PIs for archiving and keeping data accessible, concerns that are perfectly summed-up in the “Data Sharing and Management Snafu in 3 Short Acts” video.

"Data Dump" by user swanksalot on <a href="https://www.flickr.com/photos/swanksalot/2704017177">Flickr</a>.

“Data Dump” by user swanksalot on Flickr.

The NIH has continued to update its policies, now gathered together on the NIH Sharing Policies and Related Guidance on NIH-Funded Research Resources page. It’s important to note that the NIH has different requirements for “data” and for “publications.” Under section 8.2.1 “Rights in Data (Publication and Copyrighting)” in the 10/2013 version of the NIH Grants Policy Statement, “in all cases, NIH must be given a royalty-free, nonexclusive, and irrevocable license for the Federal government to reproduce, publish, or otherwise use the material and to authorize others to do so for Federal purposes.”

A little further on, in section 8.2.2 “NIH Public Access Policy,” they establish the requirements for access to the published results of NIH funded research. Under this part of their guidelines, NIH-funded investigators are required to submit an electronic version of any final, peer-reviewed manuscript to the PubMed Central archive, to be made available no later than 12 months after the official date of publication.

On January 18, 2011 the National Science Foundation started requiring Data Management Plans be submitted in conjunction with funding proposals. The most recent NSF “Proposal and Award Policies and Procedures Guide,” effective 2/24/14, describes, at a high level, the categories of information that might be included in the required NSF data management plans:

  • The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.
  • The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).
  • Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.
  • Policies and provisions for re-use, re-distribution, and the production of derivatives; and plans for archiving data, samples, and other research products, and for preservation of access to them.

The NSF has a “Data Management and Sharing Frequently Asked Questions” list, but it’s less prescriptive than it might be. For example, for the question “Am I required to deposit my data in a public database?” the NSF provides this response:

“What constitutes reasonable data management and access will be determined by the community of interest through the process of peer review and program management. In many cases, these standards already exist, but are likely to evolve as new technologies and resources become available.”

The development of data management infrastructures has accelerated over the past 5 years, catalyzed by wide-ranging guidance from the White House, starting with the December 2009 Open Government Directive, which directs executive departments and agencies to take specific actions to implement the principles of transparency, participation and collaboration in dealing with the data they create.

This was followed by the November 2011 Presidential Memorandum on “Managing Government Records” and a series of government directives on “big data,” including the “Big Data: Seizing Opportunities, Preserving Values” report released this month (though this report includes these problematic sentences: “Volumes of data that were once un-thinkably expensive to preserve are now easy and affordable to store on a chip the size of a grain of rice. As a consequence, data, once created, is in many cases effectively permanent.”)

These requirements are, slowly but surely, leading to a suite of tools and services designed to help researchers prepare plans, while also leading to the creation of repositories for the long-term storage of the resulting research output (or both, as in the case of Penn State University Library’s “Data Repository Tools and Services”).

In part two of this series we’ll look at some of the data management support tools, but feel free to point them out now in the comments.

Update: This is part one of a two-part series. Part two was published on May 28, 2014 and is available here.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.