Before I joined LC Labs, I was an English teacher. In my classroom, the word “derivative” had a negative connotation. To be derivative was to be overly indebted to another idea and thus to lack ingenuity and creativity.
When applied to datasets, however, “derivatives” abound. In fact, as you’ll see, derivative datasets sometimes serve a critical purpose in making large digital files more available to potential users. In this context, derivative is not a slight. The process of altering a dataset can be essential to a digital project, whether that “data transformation” is standardizing information, removing extraneous information, reformatting, or other tasks.
A new dataset resulting from that data transformation can be considered a “derivative dataset.” The process of alteration changes things about the primary file, including its size, format, and the information it contains. Therefore, it’s important to document any changes you make and always save a new version so you can revert to the original if needed. You’ll also want to keep track of the editorial decisions you made along the way. Summer interns working with the Digital Strategy Directorate explored dataset transformations and their effects in a design sprint this summer–check out their posts on the Signal to learn more about the ways they approached understanding and designing around derivative datasets.
I got to thinking about derivative datasets when collaborating with Peter DeCraene, the Albert Einstein Distinguished Educator Fellow at the Library of Congress. Peter has been teaching math and computer science for 31 years. At the Library, he has been investigating charts, graphs, and other forms of data (re)presentations in the Library’s collections. By working closely with the team behind the Teaching with Primary Sources program, Peter has been creating ideas for lesson plans to use items from our digital collections in the K-12 classroom.
Eileen Jakeway Manchester | Peter DeCraene |
Innovation Specialist, LC Labs | Albert Einstein Distinguished Educator Fellow |
Peter and I initially met due to our shared background in teaching and our interest in Library collections data. Our project grew out of a pretty simple question: “what would it mean to treat a Library of Congress dataset as a primary source?”
This work builds on the strong foundation of the pedagogical materials that the Teaching with Primary Sources program has made available as well as the pioneering work being driven by the goals of the Digital Strategy to get more digital collections online and ready for use with a computer.[1]
We took a two-pronged approach. We wanted to start with a general inquiry that would lead to best practices for questions to ask of ALL data sources. And then we thought we could really shed light on the technical and pedagogical complexity of this topic by working together on a specific case study and documenting the process of using a dataset from start to finish.
To narrow our scope, we decided to work with the Selected Datasets collection on loc.gov. As you can see in our recent post, this collection has just turned one! We examined all 99 at the time (there are now over 150) datasets in the collection. Through a lot of careful analysis, we identified some qualities we thought would be important to teachers. To that end, we decided to focus on datasets that were:
- Tabular
- Manageable in size (measured in MB not GB)
- On loc.gov
- Available for download as a zipped folder/file
- Recommended by Library staff who are knowledgeable about the subject matter.
Our short list included…
- Grand Comics Database dataset (computer science, English )
- National Enquirer Index dataset (computer science, history )
- Free Music Archive: A dataset for music analysis (computer science, music, history )
- The GIPHY web archive (History/Social Studies, English, computer science)
In the end, we settled on the Grand Comics Database for the following reasons:
- We thought that the subject matter would be appealing to students.
- The information was stored in database files, which we could access (with some effort) to create derivative data sets as .csv files.
- Column headings were clearly organized.
- There was a mix of textual and numeric data.
- The database file is large (~200MB), but can be read with a text editor, and converted into more manageable formats.
The Library started collecting datasets from the Grand Comics Database (GCD) in 2018 as part of a collaboration between the Digital Content Management Section and the Library’s resident comic book reference librarian, Megan Halsband. The GCD is a non-profit, user-generated, internet based resource that provides extensive indexing for comic books – in short one of the most extensive resources on comic book creators, artists, series, and other content.
Initially identified for inclusion in the Comics Literature & Criticism Web Archive, the database itself was not able to be preserved via current web archiving tools. Halsband, who is based in the Serial & Government Publications Division, and staff from DCMS worked with representatives from the GCD to establish a workflow and procedure for obtaining datasets, as well as acquiring backfiles. As comic book nerds ourselves, we were excited to find this treasure trove of information freely available on the Library’s website!
There was only one problem: this database was, well, a database. The data is stored in a dynamic database (i.e., SQL), as opposed to a flat dataset file that can be more easily manipulated (i.e., CSV, TSV). As we later learned, SQL is one of the recommended file formats for databases for digital preservation purposes. But neither of us had much experience working with SQL. After some exploration, we decided that in order to truly wrap our heads around this dataset, we needed it in a smaller, more easily manipulable format like CSV. So, we did what lots of data librarians and digital scholars do, and made a derivative dataset in the form of a spreadsheet!
Due to our research interests and little technical experience with databases, we decided it would be easiest for us to work in a csv file, which is a format that can be opened, viewed, and easily edited using spreadsheet software programs like Microsoft Excel. Our subset included entries in the Grand Comics Database from 2000 to 2018 and contained the following information about comic book series:
|
|
Our derivative dataset contained 2,205 entries (shown as rows in a spreadsheet) which was much easier for us to manage and explore. The tabular format of the data was well-suited to Peter’s interests for computer science lesson plans. Furthermore, this process allowed us to better understand the steps it take to transform larger datasets and to create resources that may allow teachers to use the primary and derivative datasets in their classrooms.
All in all, both the Grand Comics Database primary and derivative datasets were essential to our project and pedagogical questions. However, it raises another question as well: if our goal is to treat datasets as “primary sources” then is a derivative truly still that? Or does it, by definition, stand apart as a new and thus different version of the original source? We’re going to ponder these and more questions as our work continues. If you have thoughts, let us know what you think in the comments!
We’re excited to share about this project with Signal readers because it illustrates one way we at LC Labs are working with our peers across the Library of Congress to throw open the treasure chest of our rich collections.
[1] See Digital Scholarship Working Group report and the Digital Strategy.
Comments (7)
I am grappling with this myself currently on a much smaller scale. I have data that was created in an internal workflow that has great potential for researchers but I find it tricky to verify if “up to standards”. I did not create the original dataset and am leery of the public misinterpreting it and using it in ways that do not align with the actual data presented. People think data is impartial but so much of how it is/was collected is interpretive.
Hi, Jessica,
Thank you for this comment! I’m so glad our questions resonate in your context. The questions raised by the Library’s Primary Source Analysis tool (https://www.loc.gov/static/programs/teachers/getting-started-with-primary-sources/documents/Analyzing_Primary_Sources.pdf) ask students to observe, reflect, and question. Clearly, this process is equally as important to our study of datasets as to any other primary sources.
I’m interested in how we might use datasets in the classroom. Will Peter be sharing some of his ideas for computer science lesson plans?
Jill,
Thanks for your interest in using datasets in the classroom! Peter has asked me to share his response with you: “I am currently working on a set of posts, some of which will be posted to the Teaching with the Library of Congress blog: https://blogs.loc.gov/teachers/. The challenges lay in sorting through available data and identifying the additional tools or understanding we need to access and interpret the data. Since different types of data require different solutions, my blog posts will start with some general ways to analyze a dataset as a primary source, based on the Library’s analysis tool: https://www.loc.gov/static/programs/teachers/getting-started-with-primary-sources/documents/Analyzing_Primary_Sources.pdf. Doing some initial observing, reflecting, and wondering about a dataset or about representations of that data is a great way for computer science students to develop a critical lens before they jump in with technical ideas. Stay tuned for more details, and please feel free to share what you are doing as well! This is a learning process for me, and I welcome the thoughts and experiences of others.”
I’m working on a similar problem of converting an SQL database to CSV, but with little experience using SQL. Would it be possible to share more information about the process you used to access and convert the data? (Your comment about doing this “with some effort” is suggestive of a larger story, and echoes my own experience so far!)
Hi, Mara,
And thanks for your comment! After much trial and error, I was able to use a tool called phpmyadmin to render the entirety of the database in SQL. This allowed me to more easily view and search through each of the tables in the database. Since we wanted to work with a smaller file, I did not export all of the tables to csv. As mentioned in the blog post, we decided to work with entries from the year 2000 and up; and we also limited our search to certain fields mentioned in the post. Once I had a sense of the search parameters, I was able to run an SQL query and export the results of that query as a csv file.
The Signal may wish to look at the Open Knowledge Foundation (OKF) Frictionless Data concept. This allows one to attached metadata to your CSV file and thereby better support provenance tracking without the need to resort to a proper database management system. Moreover the underlying text files can be put under local version control, again to support provenance and roll back. Or made available on GitHub or similar for collaborative editing. https://blog.okfn.org/2020/10/08/announcing-the-new-frictionless-framework/