The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NYC cohort. This post is Part 2 on Genevieve’s exploration of stewardship issues for preserving geospatial data. Part 1 focuses on specific challenges of archiving geodata.
Frank Donnelly, GIS Librarian at Baruch College CUNY, was generous enough to let me pick his brain about some questions that came up while researching the selection and appraisal of geospatial data sets for my National Digital Stewardship Residency.
Donnelly maintains the Newman Library’s geospatial data resources and repository, creates research guides for learning and exploring spatial data, and also teaches classes in open-source GIS software. In my meeting with him, we talked about approaches to GIS data curation in a library setting, limitations of traditional archival repositories, and how GIS data may be changing – all topics which have helped me think more flexibly about my work with these collections and my own implementation of standards and best practices for geospatial data stewardship.
Genevieve: How do you approach the selection of GIS materials?
Frank: As a librarian, much of my material selection is driven by the questions I receive from members of my college (students, faculty, and staff). In some cases these are direct questions (i.e. can we purchase or access a particular dataset), and in other cases it’s based on my general sense of what people’s interests are. I get a lot of questions from folks who are interested in local, neighborhood data in NYC for either business, social science, or public policy-based research, so I tend to focus on those areas. I also consider the sources of the questions – the particular departments or centers on campus that are most interested in data services – and try to anticipate what would interest them.
I try to obtain a mix of resources that would appeal to novice users for basic requests (canned products or click-able resources) as well as to advanced users (spatial databases that we construct so researchers using GIS can use it as a foundation for their work). Lastly, I look at what’s publicly accessible and readily usable, and what’s not. For example, it was challenging to find well-documented and public sources for geospatial datasets for NYC transit, so we decided to generate our own out of the raw data that’s provided.
Genevieve: On the limitations of the Shapefile, is the field growing out of this format? And do the limitations affect your ability to provide access?
Frank: People in the geospatial community have been grumbling about shapefiles for quite some time now, and have been predicting or calling for their demise. There are a number of limitations to the format in terms of maximum file size, limits on the number of attribute fields and on syntax used for field headers, lack of Unicode support, etc. It’s a rather clunky format as you have several individual pieces or files that have to travel together in order to function. Despite attempts to move on – ESRI has tried to de-emphasize them by moving towards various geodatabase formats, and various groups have promoted plain text formats like GML, WKT, and GeoJSON – the shapefile is still with us. It’s a long-established open format that can work in many systems, and has essentially become an interchange format that will work everywhere. If you want to download data from a spatial database or off of many web-based systems, those systems can usually transform and output the data to the shapefile format, so there isn’t a limitation in that sense. Compared to other types of digital data (music, spreadsheet files) GIS software seems to be better equipped at reading multiple types of file formats – just think about how many different raster formats there are. As other vector formats start growing in popularity and longevity – like GeoJSON or perhaps Spatialite – the shapefile may be eclipsed in the future, but it’s construction is simple enough that they should continue to be accessible.
Genevieve: Do you think that a digital repository designed for traditional archives can or should incorporate complex data sets like those within GIS collections? Do you have any initial ideas or approaches to this?
Frank: This is something of an age-old debate within libraries; whether the library catalog should contain just books or should it also contain other formats like music, maps, datasets, etc. My own belief is that people who are looking for geospatial datasets are going to want to search through a catalog specifically for datasets; it doesn’t make sense to wade through a hodgepodge of other materials, and the interface and search mechanisms for datasets are fundamentally different than the interface that you would want or need when searching for documents. Typical digital archive systems tend to focus on individual files as objects – a document, a picture, etc. Datasets are more complex as they require extensive metadata (for searchability and usability), help documentation and codebooks, etc. If the data is stored in large relational or object-oriented databases, that data can’t be stored in a typical digital repository unless you export the data tables out into individual delimited text files. That might be fine for small datasets or generally good for insuring preservation, but if you have enormous datasets – imagine if you had every data table from the US Census – it would be completely unfeasible.
For digital repositories I think it’s fine for small individual datasets, particularly if they are being attached to a journal article or research paper where analysis was done. But in most instances I think it’s better to have separate repositories for spatial and large tabular datasets. As a compromise you can always generate metadata records that can link you from the basic repository to the spatial one if you want to increase find-ability. Metadata is key for datasets – unlike documents (articles, reports, web pages) you have no text to search through, so keyword searching goes out the window. In order to find them you need to rely on searching metadata records or any help documents or codebooks associated with them.
Genevieve: How do you see selection and preservation changing in the future, if/when you begin collecting GIS data created at Baruch?
Frank: For us, the big change will occur when we can build a more robust infrastructure for serving data. Right now we have a web server where we can drop files and people can click on links to download layers or tables one by one. But increasingly it’s not enough to just have your own data website floating out there; in order to make sure your data is accessible and findable you want to appear in other large repositories. Ideally we want to get a spatial database up and running (like PostGIS) where we can serve the data out in a number of ways – we can continue to serve it the old fashioned way but would also be able to publish out to larger repositories like the OpenGeoportal. A spatial database would allow us to grant users access to our data directly through a GIS interface, without having to download and unzip files one by one.