It’s a bird, it’s a plane, it’s a…derivative dataset!

Before I joined LC Labs, I was an English teacher. In my classroom, the word “derivative” had a negative connotation. To be derivative was to be overly indebted to another idea and thus to lack ingenuity and creativity.

When applied to datasets, however, “derivatives” abound. In fact, as you’ll see, derivative datasets sometimes serve a critical purpose in making large digital files more available to potential users. In this context, derivative is not a slight. The process of altering a dataset can be essential to a digital project, whether that “data transformation” is standardizing information, removing extraneous information, reformatting, or other tasks.

A new dataset resulting from that data transformation can be considered a “derivative dataset.” The process of alteration changes things about the primary file, including its size, format, and the information it contains. Therefore, it’s important to document any changes you make and always save a new version so you can revert to the original if needed. You’ll also want to keep track of the editorial decisions you made along the way. Summer interns working with the Digital Strategy Directorate explored dataset transformations and their effects in a design sprint this summer–check out their posts on the Signal to learn more about the ways they approached understanding and designing around derivative datasets.

I got to thinking about derivative datasets when collaborating with Peter DeCraene, the Albert Einstein Distinguished Educator Fellow at the Library of Congress. Peter has been teaching math and computer science for 31 years. At the Library, he has been investigating charts, graphs, and other forms of data (re)presentations in the Library’s collections. By working closely with the team behind the Teaching with Primary Sources program, Peter has been creating ideas for lesson plans to use items from our digital collections in the K-12 classroom.

           

Eileen Jakeway Manchester Peter DeCraene
Innovation Specialist, LC Labs Albert Einstein Distinguished Educator Fellow

Peter and I initially met due to our shared background in teaching and our interest in Library collections data. Our project grew out of a pretty simple question: “what would it mean to treat a Library of Congress dataset as a primary source?”

This work builds on the strong foundation of the pedagogical materials that the Teaching with Primary Sources program has made available as well as the pioneering work being driven by the goals of the Digital Strategy to get more digital collections online and ready for use with a computer.[1]

We took a two-pronged approach. We wanted to start with a general inquiry that would lead to best practices for questions to ask of ALL data sources. And then we thought we could really shed light on the technical and pedagogical complexity of this topic by working together on a specific case study and documenting the process of using a dataset from start to finish.

To narrow our scope, we decided to work with the Selected Datasets collection on loc.gov. As you can see in our recent post, this collection has just turned one! We examined all 99 at the time (there are now over 150) datasets in the collection. Through a lot of careful analysis, we identified some qualities we thought would be important to teachers. To that end, we decided to focus on datasets that were:

  • Tabular
  • Manageable in size (measured in MB not GB)
  • On loc.gov
  • Available for download as a zipped folder/file
  • Recommended by Library staff who are knowledgeable about the subject matter.

Our short list included…

In the end, we settled on the Grand Comics Database for the following reasons:

  • We thought that the subject matter would be appealing to students.
  • The information was stored in database files, which we could access (with some effort) to create derivative data sets as .csv files.
  • Column headings were clearly organized.
  • There was a mix of textual and numeric data.
  • The database file is large (~200MB), but can be read with a text editor, and converted into more manageable formats.

The Library started collecting datasets from the Grand Comics Database (GCD) in 2018 as part of a collaboration between the Digital Content Management Section and the Library’s resident comic book reference librarian, Megan Halsband. The GCD is a non-profit, user-generated, internet based resource that provides extensive indexing for comic books – in short one of the most extensive resources on comic book creators, artists, series, and other content.

Initially identified for inclusion in the Comics Literature & Criticism Web Archive, the database itself was not able to be preserved via current web archiving tools. Halsband, who is based in the Serial & Government Publications Division, and staff from DCMS worked with representatives from the GCD to establish a workflow and procedure for obtaining datasets, as well as acquiring backfiles. As comic book nerds ourselves, we were excited to find this treasure trove of information freely available on the Library’s website!

There was only one problem: this database was, well, a database. The data is stored in a dynamic database (i.e., SQL), as opposed to a flat dataset file that can be more easily manipulated (i.e., CSV, TSV). As we later learned, SQL is one of the recommended file formats for databases for digital preservation purposes. But neither of us had much experience working with SQL. After some exploration, we decided that in order to truly wrap our heads around this dataset, we needed it in a smaller, more easily manipulable format like CSV. So, we did what lots of data librarians and digital scholars do, and made a derivative dataset in the form of a spreadsheet!

Due to our research interests and little technical experience with databases, we decided it would be easiest for us to work in a csv file, which is a format that can be opened, viewed, and easily edited using spreadsheet software programs like Microsoft Excel. Our subset included entries in the Grand Comics Database from 2000 to 2018 and contained the following information about comic book series:

 

  • name, sort_name,
  • format, year_began,
  • year_began_uncertain (yes/no),
  • year_ended,
  • year_ended_uncertain,
  • publication_dates,
  • first_issue_id,
  • last_issue_id,
  • is_current,
  • publisher_id,
  • country_id,
  • language_id,
  • tracking_notes,
  • notes,
  • publication_notes,
  • has_gallery,
  • open_reserve,
  • issue_count,
  • created,
  • modified,
  • reserved,
  • deleted,
  • has_indicia_frequency,
  • has_isbn,
  • has_barcode,
  • has_issue_title,
  • has_volume,
  • is_comics_publication,
  • color,
  • dimensions,
  • paper_stock,
  • binding,
  • publishing_format,
  • has_rating,
  • country_name,
  • publisher_name, and
  • language_name.

Our derivative dataset contained 2,205 entries (shown as rows in a spreadsheet) which was much easier for us to manage and explore. The tabular format of the data was well-suited to Peter’s interests for computer science lesson plans. Furthermore, this process allowed us to better understand the steps it take to transform larger datasets and to create resources that may allow teachers to use the primary and derivative datasets in their classrooms.

All in all, both the Grand Comics Database primary and derivative datasets were essential to our project and pedagogical questions. However, it raises another question as well: if our goal is to treat datasets as “primary sources” then is a derivative truly still that? Or does it, by definition, stand apart as a new and thus different version of the original source? We’re going to ponder these and more questions as our work continues. If you have thoughts, let us know what you think in the comments!

We’re excited to share about this project with Signal readers because it illustrates one way we at LC Labs are working with our peers across the Library of Congress to throw open the treasure chest of our rich collections.

[1] See Digital Scholarship Working Group report and the Digital Strategy.

Authenticity Amidst Change: The Preservation and Access Framework for Digital Art Objects

The following is a guest post by Chelcie Juliet Rowell, Digital Initiatives Librarian, Z. Smith Reynolds Library, Wake Forest University. In this edition of the Insights Interview series for the NDSA Innovation Working Group, I was excited to talk with collaborators on Cornell University Library’s Preservation & Access Framework for Digital Art Objects project: Madeline […]

Stewarding Academic and Research Content: An Interview with Bradley Daigle and Chip German about APTrust

The following is a guest post by Lauren Work, digital collections librarian, Virginia Commonwealth University. In this edition of the Insights Interview series for the NDSA Innovation Working Group, I was excited to talk with Bradley Daigle, director of digital curation services and digital strategist for special collections at the University of Virginia, and R. […]

Digital Forensics and Digital Preservation: An Interview with Kam Woods of BitCurator.

We’ve written about the BitCurator project a number of times, but the project has recently entered a new phase and it’s a great time to check in again. The BitCurator Access project began in October 2014 with funding through the Mellon Foundation. BitCurator Access is building on the original BitCurator project to develop open-source software […]

Insights Interview: Josh Sternfeld on Funding Digital Stewardship Research and Development

The 2015 iteration of the National Agenda for Digital Stewardship identifies high-level recommendations, directed at funders, researchers, and organizational leaders that will advance the community’s capacity for digital preservation. As part of our Insights Interview series we’re pleased to talk with Josh Sternfeld, a Senior Program Officer in the Division of Preservation and Access at […]

Digital Preservation in Mid-Michigan: An Interview with Ed Busch

Conferences, meetings and meet-ups are important networking and collaboration events that allow librarians and archivists to share digital stewardship experiences. While national conferences and meetings offer strong professional development opportunities, regional and local meetings offer opportunities for practitioners to connect and network with a local community of practice. In a previous blog post, Kim Schroeder, […]

Collecting and Preserving Digital Art: Interview with Richard Rinehart and Jon Ippolito

As artists have embraced a range of new media and forms in the last century as the work of collecting, conserving and exhibiting these works has become increasingly complex and challenging. In this space, Richard Rinehart and Jon Ippolito have been working to develop and understand approaches to ensure long-term access to digital works. In […]

Digital Preservation Capabilities at Cultural Heritage Institutions: An Interview With Meghan Banach Bergin

The following is a guest post by Jefferson Bailey of Internet Archive and co-chair of the NDSA Innovation Working Group. In this edition of the Insights Interview series we talk with Meghan Banach Bergin, Bibliographic Access and Metadata Coordinator, University of Massachusetts Amherst Libraries. Meghan is the author of a Report on Digital Preservation Practices […]