Yesterday, May 9, 2013, the U.S. government issued an executive order and an open data policy mandating that federal agencies collect and publish new datasets in open, machine-readable, and, whenever possible, non-proprietary formats. The new policy gives agencies six months to create an inventory of all the government-produced datasets they collect and maintain; a list of datasets that are publicly accessible; and an online system to collect feedback from the public as to how they would like to use the data. The goals are twofold — greater access to government data for the public, and the availability of data in forms that businesses and researchers can better use. This builds on the earlier White House Memorandum on Transparency and Open Government.
These documents were accompanied by a link to something that actually caught my fancy even more – a greatly expanded Project Open Data Github repository for guidelines, use cases and tools. This, alongside the ever-growing (and soon to be extensively updated) data.gov, are evidence of real efforts to release more data and make it truly useful and usable.
The documents provide guidance on open licensing, metadata, and standards, as well as lifecycle-based information stewardship. But what I personally keep struggling with are two questions: What IS open data? And how is is being preserved?
The project has some defining principles for open data that I think can inform any dataset preservation project. While reading through some of the documents, I came across this bullet point:
- Managed Post-Release. A point of contact must be designated to assist with data use and to respond to complaints about adherence to these open data requirements.
I am thrilled to see guidance about active management of datasets and supporting users in their work with the data. But what could be available for this and all open dataset projects is more attention on dataset preservation. These are a few of some great resources on this topic:
- The Library of Congress Sustainability of Digital Formats site on datasets
- A Report on the Preservation of Public Sector Datasets from Archives New Zealand
- Open Data and Archiving Datasets from the National Archives UK
- DataONE
- Data-PASS
- DataConservancy
- Life of a Dataset from ICPSR
- Best Practices for Archival Processing for Geospatial Datasets from the GeoMAPP Project
- Datasets, Issues, Contexts and Solutions from the Open Planets Foundation
Do public sector datasets present different issues for preservation from other datasets? Not really. They definitely have a potentially much higher level of public scrutiny and use. But they have the same level of investment of time and money in their creation, serve the research and public good, and present the same format preservation issues as other research data.