Data Infrastructure, Education & Sustainability: Notes from the Symposium on the Interagency Strategic Plan for Big Data

Last week, the  National Academies Board on Research Data and Information hosted a Symposium on the Interagency Strategic Plan for Big Data. Staff from the National Institutes of Health, the National Science Foundation, the U.S. Geological Survey and the National Institute for Standards and Technology presented on ongoing work to establish an interagency strategic plan for Big Data. In this short post I recap some of the points and issues that were raised in the presentations and discussion and provide links to some of the projects and initiatives that I think will be of interest to readers of The Signal.

Vision and Priority Actions for National Big Data R&D

Slide with the vision for the interagency big data activity.

Slide with the vision for the interagency big data activity.

Part of the occasion for this event is the current “Request for Input (RFI)-National Big Data R&D Initiative.” Individuals and organizations have until November 14th to provide comments on “The National Big Data R&D Initiative: Vision and Actions to be Taken” (pdf). This short document is intended to inform policy for research and development across various federal agencies. Relevant to those working in digital stewardship and digital preservation, the draft includes a focus on issues related to trustworthiness of data and resulting knowledge, investing in both domain-specific and shared cyberinfrastructure to support research and improving data analysis education and training and a focus on “ensuring the long term sustainability” of data sets and data resources.

Sustainability as the Elephant in the Room

In the overview presentation about the interagency big data initiative, Allen Dearry from the National Institute of Environmental Health Sciences noted that sustainability and preservation infrastructure for data remains the “elephant in the room.” This comment resonated with several of the subsequent presenters and was referenced several times in their remarks. I was glad to see sustainability and long-term access getting this kind of attention. It is also good to see that “sustainability” is specifically mentioned in the draft document referenced above. With that noted, throughout discussion and presentations it was clear that the challenges of long-term data management are only becoming more and more complex as more and more data is collected to support a range of research.

From “Data to Knowledge” as a Framework

The phrase “Data to Knowledge” was a repeated in several of the presentations. The interagency team working in this space has often made use of it, for example, in relation to last years “Data to Knowledge to Action” event (pdf). From a stewardship/preservation perspective, it is invaluable to recognize that the focus on the resulting knowledge and action that comes from data puts additional levels of required assurance on the range of activities involved in the stewardship of data. This is not simply an issue of maintaining data assets, but a more complex activity of keeping data accessible and interpretable in ways that support generating sound  knowledge.

Some of the particular examples discussed under the heading of “data to knowledge” illustrate the significance of the concept to the work of data preservation and stewardship. One of the presenters mentioned the importance of publishing negative results and the analytic process of research. A presenter noted that open source platforms like iPython notebook are making it easier for scientists to work on and share their data, code and research. This discussion connected rather directly with many of the issues that were raised in the 2012 NDIIPP content summit Science@Risk: Toward a National Strategy for Preserving Online Science and in its final report (pdf). There is a whole range of seemingly ancillary material that makes data interpretable and meaningful. I was pleased to see one of those areas, software, receive recognition at the event.

Recognition of Software Preservation as Supporting Data to Knowledge

Sky Bristol from USGS presenting on sustainability issues related to big data to an audience at the National Academies of Science in Washington DC.

Sky Bristol from USGS presenting on sustainability issues related to big data to an audience at the National Academies of Science in Washington DC.

The event closed with presentations from two projects that won National Academies Board on Research Data and Information’s Data and Information Challenge Awards. Adam Asare of the Immune Tolerance Network presented on “ITN Trial Share: Enabling True Clinical Trial Transparency” and Mahadev Satyanarayanan from the Olive Executable Archive presented on “Olive: Sustaining Executable Content Over Decades.” Both of these projects represent significant progress supporting the sustainability of access to scientific data.

I was particularly thrilled to see the issues around software preservation receiving this kind of national attention. As explained in much greater depth in the Preserving.exe report, arts, culture and scientific advancement are increasingly dependent on software. In this respect, I found it promising to see a project like Olive, which has considerable implication for the reproducibility of analysis and for providing long-term access to data and interpretations of data in their native formats and environments, receiving recognition at an event focused on data infrastructure. For those interested in the further implications of this kind of work for science, this 2011 interview with the Olive project explores many of the potential implications of this kind of work for science.

Education and Training in Data Curation

Slide from presentation on approaches to analytical training for working wtih data for all learners.

Slide from presentation on approaches to analytical training for working with data for all learners.

Another subject I imagine readers of The Signal are tracking is education and training in support of data analysis and curation. Michelle Dunn from the National Institutes for Health presented on an approach NIH is taking to develop the kind of workforce that is necessary in this space. She mentioned a range of vectors for thinking about data science training, including traditional academic programs as well as the potential for the development of open educational resources. For those interested in this topic, it’s worth reviewing the vision and goals outlined in the NIH Data Science “Education, Training, and Workforce Development” draft report (pdf). As libraries increasingly become involved in the curation and management of research data, and as library and information science programs increasingly focus on preparing students to work in support of data-intensive research, it will be critical to follow developments in this area.

Close Reading, Distant Reading: Should Archival Appraisal Adjust?

From time to time, co-chairs of the National Digital Stewardship Alliance Arts and Humanities Content Working Group will bring you guest posts addressing the future of research and development for digital cultural heritage as a follow-up to a dynamic forum held at the 2014 Digital Preservation Conference.   The following is a guest post from Meg […]

What Does it Take to Be a Well-rounded Digital Archivist?

The following is a guest post from Peter Chan, a Digital Archivist at the Stanford University Libraries. I am a digital archivist at Stanford University. A couple of years ago, Stanford was involved in the AIMS project, which jump-started Stanford’s thinking about the role of a “digital archivist.” The project ended in 2011 and I […]

We Want You Just the Way You Are: The What, Why and When of Fixity

Fixity, the property of a digital file or object being fixed or unchanged, is a cornerstone of digital preservation. Fixity information, from simple file counts or file size values to more precise checksums and cryptographic hashes, is data used to verify whether an object has been altered or degraded. Many in the preservation community know […]

The Library of Congress Wants Your File Format Ideas

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put […]

Announcing the Release of the 2015 National Agenda For Digital Stewardship

The National Digital Stewardship Alliance is pleased to announce the release today of the “2015 National Agenda for Digital Stewardship.”  The Agenda provides funders, decision‐makers and practitioners with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for research and development to support the work needed to ensure that today’s valuable […]

QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project. In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to […]

Preliminary Results for the Ranking Stumbling Blocks for Video Preservation Survey

In a previous blog post, the NDSA Standards and Practices Working Group announced the opening of a survey to rank issues in preserving video collections. The survey closed on August 2, 2014 and while there’s work ahead to analyze the results and develop action plans, we can share some preliminary findings. We purposely cast a […]

Untangling the Knot of CAD Preservation

At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — archivists and the […]