In October 2022, the Computing Cultural Heritage in the Cloud (CCHC) team held a virtual Data Jam featuring speakers from higher education, library, and museum organizations around the world who presented feedback on working with Library of Congress data in the cloud. The LC Labs team recorded our event and invite you to watch it now on the Library’s website! In this accompanying blog post, I summarize some of our team’s highlights from an engaging discussion about the complexities of accessing and analyzing digital cultural materials as datasets.
Led by Senior Innovation Specialist Meghan Ferriter, the Computing Cultural Heritage in the Cloud (CCHC) initiative is one of our team’s latest efforts to make digital materials available for bulk analysis as datasets. Funded by a $1 million grant from the Mellon foundation, CCHC investigates the service models, cost implications, and technical affordances of providing access to cultural heritage collections as data in cloud-based environments. Grant activity began in October 2019 and has spanned multiple phases.
In CCHC’s penultimate “Evaluate” phase, we put findings from previous phases into action. As you can read about in this blog post, Chase Dooley (on detail from the Web Archives Team) and I first worked in close collaboration with custodial and technical Library staff to design processes for compiling, documenting, and publishing digital collections materials as derivative datasets. But work didn’t stop there.
After compiling the data packages, our team recruited seven experienced professionals with both data and research skills who could provide feedback on accessing Library data predominantly via computer programs by participating in the CCHC Data Jam. We moved the data packages to the experimental s3://data.labs.loc.gov sandbox space in the cloud and assigned each data wrangler a data package and computational access method.
The goal: for each user to provide as much real-life, authentic feedback on the experience of accessing, analyzing, and representing Library of Congress data in this way.
Dr. Zoe LeBlanc, Assistant Professor of Information Sciences, University of Illinois Urbana-Champaign, and Aaron Straup Cope, Head of Internet Typing, San Francisco Aviation Museum and Library, worked with a bulk set of over 30,000 digitized stereograph cards, which they accessed via command line interface. In her presentation, LeBlanc artfully demonstrated the limitations of certain machine learning methods, such as computer vision, in clustering historic images. Her investigation of the Stereograph Cards collection surfaced variations in metadata and descriptive language that were in themselves historic artifacts documenting the collection’s cataloging history.
Daniel van Strien, Digital Curator with the Living with Machines at the British Library and Vikram Mohanty, PhD candidate in the Department of Computer Science at Virginia Tech University analyzed close to 5,000 historic digitized map sheets surveying the expanse of the Austro-Hungarian Empire in the 19th and 20th centuries.
Finally, Dr. Tim Sherratt, Associate Professor of Digital Heritage, University of Canberra, Quinn Dombrowski, Academic Technology Specialist at Stanford University Library and Nichole Misako Nomura, PhD candidate in Stanford University’s Department of English, showcased what could be done with the text from over 90,000 digitized books. For example, Sherratt used the OCRed text from the 83,135 digitized books dataset to develop the demo application, “American World Gazetteer.” When a user selects a location on this interactive map, they’re presented with a sentence from a digitized book mentioning the name of that place. By contrast, Dombrowski and Nomura demonstrated the methodological challenge of filtering down the corpus to a subset of materials related to their specific research question about historic children’s literature.
Happily, every researcher was able to retrieve their data package successfully! The REST API method was found to be poorly documented and difficult to use, especially compared against the command line interface and software development kits which have many available online resources and community forums. The data packages themselves were rated favorably and high quality, with areas of improvement centered on issues of scale, errors in metadata, and/or the need for expert counsel on the particularities of the information represented. Finally, each presentation highlighted the trial and error nature of computational research–all users reported troubleshooting technical barriers and wanting more time to understand the characteristics of the data they were working with.
At our virtual event in October, Library staff and data jam participants also brainstormed additional resources that could help users better navigate and understand Library data. Their suggestions included creating visual representations of datasets, expanding tutorials, and building a community of experts to provide peer support for users. We’ll pair these reflections with the insights gained from our previous cohort of invited researchers, who ended their tenure in January, to inform CCHC’s recommendations report, forthcoming on the Computing Cultural Heritage in the Cloud experiment page.
See some of these presentations for yourself by watching the event recording of the CCHC Data Jam Showcase on the loc.gov website. Email us at [email protected] with questions and don’t forget to subscribe the Signal Blog if you haven’t already!
It was great to read this and the preceding blog on the Library’s initial provision of access to humanities-oriented digital materials as datasets. It is an impressive effort, well thought out. And, not surprisingly, this blog signals (no pun) the family of tricky challenges to prepare and exploit this type of content. In the age of AI, it was also fun to see the continuing dependency on human effort (no surprise!) in the preparation of datasets like these. Alas, no free lunch (yet). Best wishes as this effort continues, and thanks to the Mellon foundation for the support!