Blurred Lines, Shapes, and Polygons, Part 1: An NDSR-NY Project Update

The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NY cohort. This post is Part 1 of 2 posts on Genevieve’s exploration of stewardship issues for preserving geospatial data.

NDSR at the Wildlife Conservation Society Library & Archives, (Left to right): Kim Fisher, Spatial Analyst and Developer;  Leilani Dawson, Processing Archivist;  Genevieve Havemeyer-King, NDSR Resident

NDSR at the Wildlife Conservation Society Library & Archives,
(Left to right): Kim Fisher, Spatial Analyst and Developer;
Leilani Dawson, Processing Archivist;
Genevieve Havemeyer-King, NDSR Resident

A few weeks ago, I wrote an article for the NDSR-NY Blog about my project developing an OAIS-based pilot digital archive for the Wildlife Conservation Society Library and Archives. My post explored the importance of the Producer-Archive Interface and the challenges of developing electronic records submission policies that can accommodate the limitations of busy staff while still meeting basic standards for long-term preservation.

This week I’d like to continue that conversation in the context of the selection and appraisal of geospatial data sets, which have continued to be a major focal point of my project as a National Digital Stewardship Resident. I’ve had immense support jumping into a realm of preservation I had little experience with. This post includes some takeaways from my conversation with NYU’s Data Services and will be followed by my interview with GIS Librarian at Baruch College, Frank Donnelly, in a future post.

My project began with a series of interviews with key staff in three WCS departments: Education; Exhibits and Graphic Arts; and conservation ecologist, Eric Sanderson’s Welikia / Visionmaker (W/V) group in the GIS Lab. These interviews informed detailed profile reports that provided partial inventories of the anticipated collections, a sample cross section of the data lifecycle at WCS, and helped determine the needs and requirements of the Library and Archives. Since then, I’ve worked with my mentors, Processing Archivist, Leilani Dawson, and Spatial Analyst and Developer, Kim Fisher, to decide on a system architecture and begin identifying and transferring sample collections to test our ingest process.

Prior to transferring materials from the GIS Lab, we met as a group to discuss format sustainability, contextual information, the pros and cons of various approaches to preservation, and how to go about selecting, arranging, and processing data. Generally speaking, this collection poses several challenges common to the preservation of geospatial vector and raster data, many of which are outlined in reports like those published by the GeoMAPP project (2007-2011) and the Digital Preservation Coalition (2009). These issues pertain mostly to digital objects such as Shapefiles, GeoTIFFs, GeoPDFs, and similar output data. GeoMAPP’s Emerging Trends in Content Packaging for Geospatial Data (PDF), report details these issues, which include (but are not limited to):

  • Identification of coordinate reference system information;
  • Distinction between raw data and final cartographic representations (end-product maps, charts, and other publications that are created for presentation to wider audiences);
  • Variation of data packaging formats and the importance of maintaining the relationships between spatial data files therein.

However, as geospatial technology has evolved, preservation issues have become more complex. The 2013 NDSA Report, Issues in the Appraisal and Selection of Geospatial Data (PDF) notes particular concerns about spatial databases. Geodatabases are “proprietary constructs that comprise a number of individual datasets in combination with relationships, behaviors, annotations and models…[they] can be managed forward in time using complex technology regularly available to the producing or managing data agency but which is not widely supported by libraries and archives” (p.8). There are open-source database formats, such as PostGIS and Spatialite, and databases can be exported to contain the database information in a single file, but this bit-level preservation may not fully capture the linkages between multiple databases and their associated spatial data resources, which can be quite numerous. One alternative – exporting individual tables of a given database as spreadsheets or CSV files – is likewise not an ideal solution when dealing with thousands of relationships, joining tables, and other dependencies.

These complexities are important to acknowledge, especially considering this project’s combined use of historical ecological data and contemporary city data, all of which serves as the foundation of two interactive web-map applications developed by the W/V group: The Welikia Project (an expansion of the Manahatta Project), and Visionmaker NYC, which adds preservation of web-content and source code to the workflow for processing geospatial data.

THE CHANGING FACE OF MAPS:  Left, Historical Map of Manhattan, Courtesy of NYPL. Right, Screen-capture of Welikia map of Central Park in Manhattan, linking to block-level contemporary and historical data, courtesy of WCS Welikia/Visionmaker GIS Lab.

Left, Historical Map of Manhattan, Courtesy of NYPL. Right, Screen-capture of Welikia map of Central Park in Manhattan, linking to block-level contemporary and historical data, courtesy of WCS Welikia/Visionmaker GIS Lab.

In GeoMAPP’s final report (PDF), they provide a diagram that summarizes the ideal lifecycle for geoarchiving. This lifecycle mirrors the OAIS Reference Model’s principles of producer-archive engagement, archival transfer and ingest with management and preservation planning, the need to provide access, and brings the model full circle by representing the fact that preservation is an ongoing process that digital archivists must constantly perform in order to maintain their collections as technology and standards change over time.

The Geoarchiving Lifecycle  (original diagram by GeoMAPP; this version created by Genevieve Havemeyer-King)

The Geoarchiving Lifecycle
(original diagram by GeoMAPP; this version created by Genevieve Havemeyer-King)

However, implementation of both models requires making decisions about the various micro-processes within each phase of the workflow. Decisions on normalization for preservation and access, and an archive’s level of adherence to geospatial metadata standards depend partly on the limitations (or flexibility) of processing tools and the amount of time and resources available to archivists. To make things easier, we’ve looked into sources like the Library of Congress’ Recommended Formats Statement, which prioritizes complete geospatial data sets over conversion to open formats (possibly in consideration of the risks and cost associated with normalization?), but recommends using the Federal Geographic Data Committee standard for geospatial metadata (which of course is ideal for large, governmental institutions, but might be demanding for a small organization with a staff of two).

A new standard in development, , is designed as a specialization of the OAIS standard, ISO 14721:2012 (which focuses on preservation of satellite imagery) and “aims to provide a model for all geospatial data.” Whether this proposed standard will be detailed enough to address these specific processing and packaging and issues remains to be seen, but I look forward to hearing more about it!

Moving on to access issues, the use cases for spatial data as a librarian or patron greatly differ from those of an institutional archive, but considering the close relationship between the WCS Library and the Archives, the management of published geospatial data at WCS is just as important to consider as the raw data. So, I reached out to some specialists working with GIS in two NYC academic libraries: NYU’s Bobst Library and Baruch College’s Newman Library.

At NYU’s Data Services, Nicholas Wolf, Vicky Steeves, and Stephen Balogh spoke with me about reproducibility and discoverability tools for spatial data collections, including the open source projects GeoBlacklight (and NYU’s instance), the OpenGeoMetadata initiative, and ReproZip.

In regards to metadata, standards like ISO 19115 (Geographic information – Metadata), ISO 19139 (Geographic information – Metadata XML schema implementation), and Federal Geographic Data Committee (FGDC) Content Standard for Geographic Metadata (PDF) are complex, lengthy, and seem better suited to governmental record keeping than to academic research and discovery. GeoBlacklight is a Ruby on Rails engine based on Blacklight which aims to provide an open repository for sharing geospatial collections. Darren Hardy and Kim Durant wrote that the geoblacklight-schema uses elements from Dublin Core and GeoRSS to leverage their normative semantics and best practices, providing a better experience for finding geospatial data.

ReproZip is a tool being developed at NYU “aimed at simplifying the process of creating reproducible experiments from command-line executions”, and could be something to consider as an alternative to many costly web-archiving services for preservation of internet-based projects and applications. Although WCS may not be in the position to implement a geolibrary or make all of their geospatial data publicly accessible right now, open-source projects like this help to inspire thinking about collaborative, cost-mitigating ways of sharing their collections in the future, and could help frame our expectations and policies for managing the data right now.

In a follow-up post, I interview Frank Donnelly, GIS Librarian at Baruch College CUNY.

Survey: How Do You Approach Web Archiving?

Do you have fifteen minutes to tell the National Digital Stewardship Alliance about your organization’s web archiving activities? If the answer is yes, please contribute to the NDSA Web Archiving Survey. By filling out this short survey, your institution will be part of a multi-year project to track the evolution of web archiving programs in […]

APIs: How Machines Share and Expose Digital Collections

Kim Milai, a retired school teacher, was searching on for information about her great grandfather, Amohamed Milai, when her browser turned up something she had not expected: a page from the Library of Congress’s Chronicling America site displaying a scan of the Harrisburg Telegraph newspaper from March 13, 1919. On that page was a story […]

Digital Stewardship in a Radio Archive: An NDSR Project Update

The following is a guest post by Mary Kidd, National Digital Stewardship Resident at New York Public Radio’s (NYPR) archive.  She participates in the NDSR-NYC cohort. My outlook on preservation issues surrounding radio archives has been deeply influenced by my work as a National Digital Stewardship Resident (NDSR) at New York Public Radio’s (NYPR) archive. […]

Acquiring at Digital Scale: Harvesting the Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the mobile app and interviews Kate Zwaard and […]

The Veterans History Project Marks 15 Years of Service

“The willingness with which our young people are likely to serve in any war, no matter how justified, shall be directly proportional to how they perceive the Veterans of earlier wars were treated and appreciated by their nation.” — George Washington The Veterans History Project honors the lives and service of all American veterans –not […]

The World As Seen Through Books: An Interview with Kalev Hannes Leetaru

Kalev Leetaru, a senior fellow at George Washington University Center for Cyber and Homeland Security, has written for The Signal in previous posts. I recently had the chance to ask him about his latest work, processing and analyzing digitized books stretching back two centuries. Erin: You recently completed research and analysis on large datasets of […]

DPOE Plants Seed for Statewide Digital Preservation Effort in California

The following is a guest post by Barrie Howard, IT project manager at the Library of Congress. The Digital Preservation Outreach and Education (DPOE) program is pleased to announce the successful completion of another train-the-trainer workshop in 2015. The most recent workshop took place in Sacramento, California, from September 22th–25th. This domestic training event follows […]

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online. The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also […]

Improving Technical Options for Audiovisual Collections Through the PREFORMA Project

The digital preservation community is a connected and collaborative one. I first heard about the Europe-based PREFORMA project last summer at a Federal Agencies Digitization Guidelines Initiative meeting when we were discussing the Digital File Formats for Videotape Reformatting comparison matrix. My interest was piqued because I heard about their incorporation of FFV1 and Matroska, […]