Blurred Lines, Shapes, and Polygons, Part 2: An Interview with Frank Donnelly, Geospatial Data Librarian

The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NYC cohort. This post is Part 2 on Genevieve’s exploration of stewardship issues for preserving geospatial data. Part 1 focuses on specific challenges of archiving geodata.

Frank Donnelly, GIS Librarian at Baruch College CUNY, was generous enough to let me pick his brain about some questions that came up while researching the selection and appraisal of geospatial data sets for my National Digital Stewardship Residency.

Donnelly maintains the Newman Library’s geospatial data resources and repository, creates research guides for learning and exploring spatial data, and also teaches classes in open-source GIS software. In my meeting with him, we talked about approaches to GIS data curation in a library setting, limitations of traditional archival repositories, and how GIS data may be changing – all topics which have helped me think more flexibly about my work with these collections and my own implementation of standards and best practices for geospatial data stewardship.

Genevieve: How do you approach the selection of GIS materials?

Frank: As a librarian, much of my material selection is driven by the questions I receive from members of my college (students, faculty, and staff). In some cases these are direct questions (i.e. can we purchase or access a particular dataset), and in other cases it’s based on my general sense of what people’s interests are. I get a lot of questions from folks who are interested in local, neighborhood data in NYC for either business, social science, or public policy-based research, so I tend to focus on those areas. I also consider the sources of the questions – the particular departments or centers on campus that are most interested in data services – and try to anticipate what would interest them.

I try to obtain a mix of resources that would appeal to novice users for basic requests (canned products or click-able resources) as well as to advanced users (spatial databases that we construct so researchers using GIS can use it as a foundation for their work). Lastly, I look at what’s publicly accessible and readily usable, and what’s not. For example, it was challenging to find well-documented and public sources for geospatial datasets for NYC transit, so we decided to generate our own out of the raw data that’s provided.

Genevieve: On the limitations of the Shapefile, is the field growing out of this format? And do the limitations affect your ability to provide access?

Frank: People in the geospatial community have been grumbling about shapefiles for quite some time now, and have been predicting or calling for their demise. There are a number of limitations to the format in terms of maximum file size, limits on the number of attribute fields and on syntax used for field headers, lack of Unicode support, etc. It’s a rather clunky format as you have several individual pieces or files that have to travel together in order to function. Despite attempts to move on – ESRI has tried to de-emphasize them by moving towards various geodatabase formats, and various groups have promoted plain text formats like GML, WKT, and GeoJSON – the shapefile is still with us. It’s a long-established open format that can work in many systems, and has essentially become an interchange format that will work everywhere. If you want to download data from a spatial database or off of many web-based systems, those systems can usually transform and output the data to the shapefile format, so there isn’t a limitation in that sense. Compared to other types of digital data (music, spreadsheet files) GIS software seems to be better equipped at reading multiple types of file formats – just think about how many different raster formats there are. As other vector formats start growing in popularity and longevity – like GeoJSON or perhaps Spatialite – the shapefile may be eclipsed in the future, but it’s construction is simple enough that they should continue to be accessible.

Genevieve: Do you think that a digital repository designed for traditional archives can or should incorporate complex data sets like those within GIS collections? Do you have any initial ideas or approaches to this?

Frank: This is something of an age-old debate within libraries; whether the library catalog should contain just books or should it also contain other formats like music, maps, datasets, etc. My own belief is that people who are looking for geospatial datasets are going to want to search through a catalog specifically for datasets; it doesn’t make sense to wade through a hodgepodge of other materials, and the interface and search mechanisms for datasets are fundamentally different than the interface that you would want or need when searching for documents. Typical digital archive systems tend to focus on individual files as objects – a document, a picture, etc. Datasets are more complex as they require extensive metadata (for searchability and usability), help documentation and codebooks, etc. If the data is stored in large relational or object-oriented databases, that data can’t be stored in a typical digital repository unless you export the data tables out into individual delimited text files. That might be fine for small datasets or generally good for insuring preservation, but if you have enormous datasets – imagine if you had every data table from the US Census – it would be completely unfeasible.

For digital repositories I think it’s fine for small individual datasets, particularly if they are being attached to a journal article or research paper where analysis was done. But in most instances I think it’s better to have separate repositories for spatial and large tabular datasets. As a compromise you can always generate metadata records that can link you from the basic repository to the spatial one if you want to increase find-ability. Metadata is key for datasets – unlike documents (articles, reports, web pages) you have no text to search through, so keyword searching goes out the window. In order to find them you need to rely on searching metadata records or any help documents or codebooks associated with them.

Genevieve: How do you see selection and preservation changing in the future, if/when you begin collecting GIS data created at Baruch?

Frank: For us, the big change will occur when we can build a more robust infrastructure for serving data. Right now we have a web server where we can drop files and people can click on links to download layers or tables one by one. But increasingly it’s not enough to just have your own data website floating out there; in order to make sure your data is accessible and findable you want to appear in other large repositories. Ideally we want to get a spatial database up and running (like PostGIS) where we can serve the data out in a number of ways – we can continue to serve it the old fashioned way but would also be able to publish out to larger repositories like the OpenGeoportal. A spatial database would allow us to grant users access to our data directly through a GIS interface, without having to download and unzip files one by one.

Blurred Lines, Shapes, and Polygons, Part 1: An NDSR-NY Project Update

The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NY cohort. This post is Part 1 of 2 posts on Genevieve’s exploration of stewardship issues for preserving geospatial data. A few weeks ago, I wrote an article for the […]

Digital Preservation Planning: An NDSR Boston Project Update

The following is a guest post by Jeffrey Erickson, National Digital Stewardship Resident at the University Archives and Special Collections at UMass Boston. He participates in the NDSR-Boston cohort. I am a recent graduate of Simmons College’s School of Library and Information Science as well as a current participant in this year’s Boston cohort of […]

Intellectual Property Rights Issues for Software Emulation: An Interview with Euan Cochrane, Zach Vowell, and Jessica Meyerson

The following is a guest post by Morgan McKeehan, National Digital Stewardship Resident at Rhizome. She is participating in the NDSR-NYC cohort. I began my National Digital Stewardship Residency at Rhizome — NDSR project description here (PDF) — by leading a workshop for the Emulation as a Service framework (EaaS), at “Party Like it’s 1999: […]

Inventorying Software Developed at the National Library of Medicine: An NDSR Project Update

The following is a guest post by Nicole Contaxis, National Digital Stewardship Resident at the National Library of Medicine. She participates in the NDSR-DC cohort. The National Library of Medicine (NLM) has a fifty year tradition of developing software in-house for its own use and for the use of its patrons. As part of the […]

Digital Stewardship in a Radio Archive: An NDSR Project Update

The following is a guest post by Mary Kidd, National Digital Stewardship Resident at New York Public Radio’s (NYPR) archive.  She participates in the NDSR-NYC cohort. My outlook on preservation issues surrounding radio archives has been deeply influenced by my work as a National Digital Stewardship Resident (NDSR) at New York Public Radio’s (NYPR) archive. […]

Plans for Assessing Preservation Storage Options and Lifecycles at MIT Libraries: An NDSR Project Update

The following is a guest post by Alexandra Curran, National Digital Stewardship Resident at MIT Libraries.  She participates in the NDSR-Boston cohort. Hello readers, and happy holidays! Looking back at the last few months of my residency working in collaboration with the Digital Preservation Unit (DPU) at MIT Libraries and especially their Lead for Digital […]

Tool Time, or a Discussion on Picking the Right Digital Preservation Tools for Your Program: An NDSR Project Update

The following is a guest post by John Caldwell, National Digital Stewardship Resident at the United States Senate Historical Office. Who remembers Home Improvement? Tim the “Tool Man” Taylor was always trying to show the “Tool Time” audience how to build things, make repairs and of course, demo new tools made by the show’s sponsor, […]

DPOE Plants Seed for Statewide Digital Preservation Effort in California

The following is a guest post by Barrie Howard, IT project manager at the Library of Congress. The Digital Preservation Outreach and Education (DPOE) program is pleased to announce the successful completion of another train-the-trainer workshop in 2015. The most recent workshop took place in Sacramento, California, from September 22th–25th. This domestic training event follows […]

Describing Records Before They Arrive: An NDSR Project Update

The following is a guest post by Valerie Collins, National Digital Stewardship Resident at The American Institute of Architects. At the American Institute of Architects, the AIA Archives is building a digital repository for permanent born-digital records that capture the intellectual capital of the AIA, or have continuing value to the practice of architecture. In […]