The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NY cohort. This post is Part 1 of 2 posts on Genevieve’s exploration of stewardship issues for preserving geospatial data.
A few weeks ago, I wrote an article for the NDSR-NY Blog about my project developing an OAIS-based pilot digital archive for the Wildlife Conservation Society Library and Archives. My post explored the importance of the Producer-Archive Interface and the challenges of developing electronic records submission policies that can accommodate the limitations of busy staff while still meeting basic standards for long-term preservation.
This week I’d like to continue that conversation in the context of the selection and appraisal of geospatial data sets, which have continued to be a major focal point of my project as a National Digital Stewardship Resident. I’ve had immense support jumping into a realm of preservation I had little experience with. This post includes some takeaways from my conversation with NYU’s Data Services and will be followed by my interview with GIS Librarian at Baruch College, Frank Donnelly, in a future post.
My project began with a series of interviews with key staff in three WCS departments: Education; Exhibits and Graphic Arts; and conservation ecologist, Eric Sanderson’s Welikia / Visionmaker (W/V) group in the GIS Lab. These interviews informed detailed profile reports that provided partial inventories of the anticipated collections, a sample cross section of the data lifecycle at WCS, and helped determine the needs and requirements of the Library and Archives. Since then, I’ve worked with my mentors, Processing Archivist, Leilani Dawson, and Spatial Analyst and Developer, Kim Fisher, to decide on a system architecture and begin identifying and transferring sample collections to test our ingest process.
Prior to transferring materials from the GIS Lab, we met as a group to discuss format sustainability, contextual information, the pros and cons of various approaches to preservation, and how to go about selecting, arranging, and processing data. Generally speaking, this collection poses several challenges common to the preservation of geospatial vector and raster data, many of which are outlined in reports like those published by the GeoMAPP project (2007-2011) and the Digital Preservation Coalition (2009). These issues pertain mostly to digital objects such as Shapefiles, GeoTIFFs, GeoPDFs, and similar output data. GeoMAPP’s Emerging Trends in Content Packaging for Geospatial Data (PDF), report details these issues, which include (but are not limited to):
- Identification of coordinate reference system information;
- Distinction between raw data and final cartographic representations (end-product maps, charts, and other publications that are created for presentation to wider audiences);
- Variation of data packaging formats and the importance of maintaining the relationships between spatial data files therein.
However, as geospatial technology has evolved, preservation issues have become more complex. The 2013 NDSA Report, Issues in the Appraisal and Selection of Geospatial Data (PDF) notes particular concerns about spatial databases. Geodatabases are “proprietary constructs that comprise a number of individual datasets in combination with relationships, behaviors, annotations and models…[they] can be managed forward in time using complex technology regularly available to the producing or managing data agency but which is not widely supported by libraries and archives” (p.8). There are open-source database formats, such as PostGIS and Spatialite, and databases can be exported to contain the database information in a single file, but this bit-level preservation may not fully capture the linkages between multiple databases and their associated spatial data resources, which can be quite numerous. One alternative – exporting individual tables of a given database as spreadsheets or CSV files – is likewise not an ideal solution when dealing with thousands of relationships, joining tables, and other dependencies.
These complexities are important to acknowledge, especially considering this project’s combined use of historical ecological data and contemporary city data, all of which serves as the foundation of two interactive web-map applications developed by the W/V group: The Welikia Project (an expansion of the Manahatta Project), and Visionmaker NYC, which adds preservation of web-content and source code to the workflow for processing geospatial data.
In GeoMAPP’s final report (PDF), they provide a diagram that summarizes the ideal lifecycle for geoarchiving. This lifecycle mirrors the OAIS Reference Model’s principles of producer-archive engagement, archival transfer and ingest with management and preservation planning, the need to provide access, and brings the model full circle by representing the fact that preservation is an ongoing process that digital archivists must constantly perform in order to maintain their collections as technology and standards change over time.
However, implementation of both models requires making decisions about the various micro-processes within each phase of the workflow. Decisions on normalization for preservation and access, and an archive’s level of adherence to geospatial metadata standards depend partly on the limitations (or flexibility) of processing tools and the amount of time and resources available to archivists. To make things easier, we’ve looked into sources like the Library of Congress’ Recommended Formats Statement, which prioritizes complete geospatial data sets over conversion to open formats (possibly in consideration of the risks and cost associated with normalization?), but recommends using the Federal Geographic Data Committee standard for geospatial metadata (which of course is ideal for large, governmental institutions, but might be demanding for a small organization with a staff of two).
A new standard in development, , is designed as a specialization of the OAIS standard, ISO 14721:2012 (which focuses on preservation of satellite imagery) and “aims to provide a model for all geospatial data.” Whether this proposed standard will be detailed enough to address these specific processing and packaging and issues remains to be seen, but I look forward to hearing more about it!
Moving on to access issues, the use cases for spatial data as a librarian or patron greatly differ from those of an institutional archive, but considering the close relationship between the WCS Library and the Archives, the management of published geospatial data at WCS is just as important to consider as the raw data. So, I reached out to some specialists working with GIS in two NYC academic libraries: NYU’s Bobst Library and Baruch College’s Newman Library.
At NYU’s Data Services, Nicholas Wolf, Vicky Steeves, and Stephen Balogh spoke with me about reproducibility and discoverability tools for spatial data collections, including the open source projects GeoBlacklight (and NYU’s instance), the OpenGeoMetadata initiative, and ReproZip.
In regards to metadata, standards like ISO 19115 (Geographic information – Metadata), ISO 19139 (Geographic information – Metadata XML schema implementation), and Federal Geographic Data Committee (FGDC) Content Standard for Geographic Metadata (PDF) are complex, lengthy, and seem better suited to governmental record keeping than to academic research and discovery. GeoBlacklight is a Ruby on Rails engine based on Blacklight which aims to provide an open repository for sharing geospatial collections. Darren Hardy and Kim Durant wrote that the geoblacklight-schema uses elements from Dublin Core and GeoRSS to leverage their normative semantics and best practices, providing a better experience for finding geospatial data.
ReproZip is a tool being developed at NYU “aimed at simplifying the process of creating reproducible experiments from command-line executions”, and could be something to consider as an alternative to many costly web-archiving services for preservation of internet-based projects and applications. Although WCS may not be in the position to implement a geolibrary or make all of their geospatial data publicly accessible right now, open-source projects like this help to inspire thinking about collaborative, cost-mitigating ways of sharing their collections in the future, and could help frame our expectations and policies for managing the data right now.
In a follow-up post, I interview Frank Donnelly, GIS Librarian at Baruch College CUNY.