The following is a guest post by Samantha DeWitt, National Digital Stewardship Resident at Tufts University.
Hello readers and a happy winter solstice from Medford, Massachusetts. It’s hard to believe I am already in my third month of the National Digital Stewardship Residency. There’s a chill in the air and the vivid fall colors that decorated the Tufts University campus have given way to a palette of browns and grays.
During my residency here, I have been exploring ways in which the university can get a better handle on its faculty-produced research data. The project has been illuminating. The first thing I discovered is that Tufts is not alone in their uncertainty concerning the status of institutionally connected research data. Many institutions are taking a hard look at how they approach research data management and some of the results are noteworthy. Harvard, for instance, has developed the Dataverse Network; an “open-source application for sharing, citing, analyzing and preserving research data.” Purdue has recently developed an online research repository (PURR), which provides researchers with a collaborative space during their project and long-term data management assistance. (Published datasets remain online for a minimum of 10 years as a part of the Purdue Libraries’ collection.)
Initial data storage choices
At the beginning of a project, researchers can receive assistance with storage from the Tufts technology services department. Networked (cluster) storage is available for up to several terabytes. One drive is available for smaller amounts of collaborative storage and a second can be used for individual storage space (up to four GB). Lastly, cloud storage is available through Tufts Box. Of course, one can always elect to store data on a personal hard drive or select from an array of portable storage devices.
Unfortunately, hard drives may crash and portable devices may become lost or obsolete… As this is a blog about digital preservation, I realize I don’t need to elaborate on the problems that can befall neglected media. Further, the data remaining in networked storage will be erased when a researcher leaves. Even if this were not the case, attempts to retroactively find or make sense of the data would be prohibitively time-consuming.
Data must be properly managed to be accessible
Data sharing can be seen as fundamental to the most basic tenets of the scientific method: it permits reproducibility, encourages collaboration among researchers and advocates for the re-use of valuable resources. These principles have been espoused by the National Institutes of Health (NIH) and the National Science Foundation (NSF) and they, along with provisions to increase financial transparency, have resulted in increasingly stringent data management mandates for grant-seekers.
These days, Washington isn’t the only player putting pressure on researchers to tend to their data. In 2011, The Bill and Melinda Gates Foundation began asking researchers to submit a data access plan (PDF) along with their grant proposal, stating that, “a data access plan should at a minimum address the nature and scope of data and information to be disseminated, the timing of disclosure, the manner in which the data and information is stored and disseminated, and who will have access to the data and under what conditions.” The Alfred P. Sloan Foundation, too, asks applicants to describe how their data and code will be “shared, annotated, cited, and archived.”
But just because data has been placed in an appropriate subject-based repository does not ensure that those at Tufts know where it is. (Researchers themselves may not even know or remember.) This creates a unique opportunity for the university to consider ways to catalog this data. By better understanding its research output, Tufts could more easily:
- Comply with funders’ data access mandates.
- Visualize institutional research output.
- Encourage inter-departmental collaboration.
- Avoid research duplication.
- Increase institutional visibility by data sharing.
The journals “Science” and “Nature” both require authors to submit data relevant to their publication. Furthermore, in May of this year, the Nature Publishing Group launched an open-access, online-only journal called “Scientific Data,” where researchers can access descriptions of data outputs, file formats, sample identifiers and replication structure. What is worth noting is that the site does not store data but rather acts as a finding aid for data housed in other repositories. The idea of keeping records of data while depositing them elsewhere, is intriguing. In fact, it might be possible to devise a similar sort of system here. Tufts already has a Fedora-based digital repository, so the digital object record would merely require the adequate metadata and a URL to direct the user to the right repository. This type of system could allow the university a better grasp on its research data output.
Tufts has made definite progress in advocating for best practices in data management for its researchers; the library holds frequent workshops and offers assistance in drafting data management plans. It is likely, however, that both government and non-government funders – as well as scholarly journals – will continue to focus on the effective management of research data. Moreover, because universities such as Tufts should be able to appraise one of its most fundamental assets, research data access continues to require our attention.