It’s a bird, it’s a plane, it’s a…derivative dataset!

Before I joined LC Labs, I was an English teacher. In my classroom, the word “derivative” had a negative connotation. To be derivative was to be overly indebted to another idea and thus to lack ingenuity and creativity.

When applied to datasets, however, “derivatives” abound. In fact, as you’ll see, derivative datasets sometimes serve a critical purpose in making large digital files more available to potential users. In this context, derivative is not a slight. The process of altering a dataset can be essential to a digital project, whether that “data transformation” is standardizing information, removing extraneous information, reformatting, or other tasks.

A new dataset resulting from that data transformation can be considered a “derivative dataset.” The process of alteration changes things about the primary file, including its size, format, and the information it contains. Therefore, it’s important to document any changes you make and always save a new version so you can revert to the original if needed. You’ll also want to keep track of the editorial decisions you made along the way. Summer interns working with the Digital Strategy Directorate explored dataset transformations and their effects in a design sprint this summer–check out their posts on the Signal to learn more about the ways they approached understanding and designing around derivative datasets.

I got to thinking about derivative datasets when collaborating with Peter DeCraene, the Albert Einstein Distinguished Educator Fellow at the Library of Congress. Peter has been teaching math and computer science for 31 years. At the Library, he has been investigating charts, graphs, and other forms of data (re)presentations in the Library’s collections. By working closely with the team behind the Teaching with Primary Sources program, Peter has been creating ideas for lesson plans to use items from our digital collections in the K-12 classroom.

           

Eileen Jakeway Manchester Peter DeCraene
Innovation Specialist, LC Labs Albert Einstein Distinguished Educator Fellow

Peter and I initially met due to our shared background in teaching and our interest in Library collections data. Our project grew out of a pretty simple question: “what would it mean to treat a Library of Congress dataset as a primary source?”

This work builds on the strong foundation of the pedagogical materials that the Teaching with Primary Sources program has made available as well as the pioneering work being driven by the goals of the Digital Strategy to get more digital collections online and ready for use with a computer.[1]

We took a two-pronged approach. We wanted to start with a general inquiry that would lead to best practices for questions to ask of ALL data sources. And then we thought we could really shed light on the technical and pedagogical complexity of this topic by working together on a specific case study and documenting the process of using a dataset from start to finish.

To narrow our scope, we decided to work with the Selected Datasets collection on loc.gov. As you can see in our recent post, this collection has just turned one! We examined all 99 at the time (there are now over 150) datasets in the collection. Through a lot of careful analysis, we identified some qualities we thought would be important to teachers. To that end, we decided to focus on datasets that were:

  • Tabular
  • Manageable in size (measured in MB not GB)
  • On loc.gov
  • Available for download as a zipped folder/file
  • Recommended by Library staff who are knowledgeable about the subject matter.

Our short list included…

In the end, we settled on the Grand Comics Database for the following reasons:

  • We thought that the subject matter would be appealing to students.
  • The information was stored in database files, which we could access (with some effort) to create derivative data sets as .csv files.
  • Column headings were clearly organized.
  • There was a mix of textual and numeric data.
  • The database file is large (~200MB), but can be read with a text editor, and converted into more manageable formats.

The Library started collecting datasets from the Grand Comics Database (GCD) in 2018 as part of a collaboration between the Digital Content Management Section and the Library’s resident comic book reference librarian, Megan Halsband. The GCD is a non-profit, user-generated, internet based resource that provides extensive indexing for comic books – in short one of the most extensive resources on comic book creators, artists, series, and other content.

Initially identified for inclusion in the Comics Literature & Criticism Web Archive, the database itself was not able to be preserved via current web archiving tools. Halsband, who is based in the Serial & Government Publications Division, and staff from DCMS worked with representatives from the GCD to establish a workflow and procedure for obtaining datasets, as well as acquiring backfiles. As comic book nerds ourselves, we were excited to find this treasure trove of information freely available on the Library’s website!

There was only one problem: this database was, well, a database. The data is stored in a dynamic database (i.e., SQL), as opposed to a flat dataset file that can be more easily manipulated (i.e., CSV, TSV). As we later learned, SQL is one of the recommended file formats for databases for digital preservation purposes. But neither of us had much experience working with SQL. After some exploration, we decided that in order to truly wrap our heads around this dataset, we needed it in a smaller, more easily manipulable format like CSV. So, we did what lots of data librarians and digital scholars do, and made a derivative dataset in the form of a spreadsheet!

Due to our research interests and little technical experience with databases, we decided it would be easiest for us to work in a csv file, which is a format that can be opened, viewed, and easily edited using spreadsheet software programs like Microsoft Excel. Our subset included entries in the Grand Comics Database from 2000 to 2018 and contained the following information about comic book series:

 

  • name, sort_name,
  • format, year_began,
  • year_began_uncertain (yes/no),
  • year_ended,
  • year_ended_uncertain,
  • publication_dates,
  • first_issue_id,
  • last_issue_id,
  • is_current,
  • publisher_id,
  • country_id,
  • language_id,
  • tracking_notes,
  • notes,
  • publication_notes,
  • has_gallery,
  • open_reserve,
  • issue_count,
  • created,
  • modified,
  • reserved,
  • deleted,
  • has_indicia_frequency,
  • has_isbn,
  • has_barcode,
  • has_issue_title,
  • has_volume,
  • is_comics_publication,
  • color,
  • dimensions,
  • paper_stock,
  • binding,
  • publishing_format,
  • has_rating,
  • country_name,
  • publisher_name, and
  • language_name.

Our derivative dataset contained 2,205 entries (shown as rows in a spreadsheet) which was much easier for us to manage and explore. The tabular format of the data was well-suited to Peter’s interests for computer science lesson plans. Furthermore, this process allowed us to better understand the steps it take to transform larger datasets and to create resources that may allow teachers to use the primary and derivative datasets in their classrooms.

All in all, both the Grand Comics Database primary and derivative datasets were essential to our project and pedagogical questions. However, it raises another question as well: if our goal is to treat datasets as “primary sources” then is a derivative truly still that? Or does it, by definition, stand apart as a new and thus different version of the original source? We’re going to ponder these and more questions as our work continues. If you have thoughts, let us know what you think in the comments!

We’re excited to share about this project with Signal readers because it illustrates one way we at LC Labs are working with our peers across the Library of Congress to throw open the treasure chest of our rich collections.

[1] See Digital Scholarship Working Group report and the Digital Strategy.

One Comment

  1. Jessica
    November 23, 2021 at 10:12 am

    I am grappling with this myself currently on a much smaller scale. I have data that was created in an internal workflow that has great potential for researchers but I find it tricky to verify if “up to standards”. I did not create the original dataset and am leery of the public misinterpreting it and using it in ways that do not align with the actual data presented. People think data is impartial but so much of how it is/was collected is interpretive.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.