Grounding iterative experimentation with LC Labs: CCHC and Machine Learning

Across the last five years, LC Labs experiments have integrated sundry perspectives and disciplines to connect people, practice, and history; from making collections more legible and discoverable through volunteer crowdsourcing efforts with Beyond Words and By the People, to developing frameworks for ethically engaging people when adopting machine learning with Humans in the Loop, to the Innovator in Residence projects Citizen DJ and Newspaper Navigator. We rarely have the opportunity to share the ways these experiments inform one another and how we build upon their outcomes in creative and iterative approaches.

As I reflect on the last year of work on the Computing Cultural Heritage in the Cloud (CCHC) initiative, I see our success tightly connected to the multifaceted explorations that are the hallmark of LC Labs work, and to our investigation of machine learning (ML) and artificial intelligence (AI). CCHC and that latter body of efforts in ML and AI have been raised alongside and deeply inform one another. It seems fitting to share more about the ways CCHC data packaging and CCHC Data Jam activities have integrated user perspectives and findings from our recent LC Labs experiments in ML, such as our collaboration with Project Aida, as we close out 2022.

Setting the stage for computational research through experimentation 

Since 2019, we have explored the contours and challenges of bringing machine learning methods into our context as a federal and cultural memory organization. In that time, we sponsored events like the Machine Learning & Libraries Summit and shared reports about the state of the field; and released open source code and training data, derivative datasets, and Jupyter notebooks. Furthermore, we’ve publicly shared our experiences and reflect on areas we can improve in conference and workshop presentations. You can get up to speed with that body of work on our Machine Learning with LC Labs page

We surfaced a preliminary set of seven areas for further exploration as a result of what we’ve learned through that machine learning work, CCHC, and other experiments.

A list of seven principles for adopting machine learning derived from LC Labs experimentation, reports, and user feedback.

A preliminary list of LC Labs machine learning recommendations derived from four years of experimentation and collaboration with practitioners and users.

Many of our LC Labs machine learning experiments have focused on what becomes possible when collections data are made machine-readable, item level metadata are available, and users can understand collections contexts while being supported by subject matter experts; all key findings in our Digital Scholarship Working Group report. Indeed machine learning is likely to be a key approach to enabling digital scholarship and research with larger aggregates of cultural heritage data; both to transform data for computational uses and as a method to develop new analyses of history and social life.

Our ML recommendations encourage us to invite computational uses of our collections, but only if we simultaneously continue to assess risk and impact for people and collections, and resources at each step. We can tend to this challenge most effectively if we

  1. take small, specific steps;
  2. share knowledge and lessons learned;
  3. center people and integrate a range of knowledge and expertise; and
  4. work together on these shared sector and interdisciplinary challenges.

In CCHC and in our LC Labs practice alike, we’ve started small – taking iterative approaches to test methods and learn by doing with colleagues and partners. We aim to keep this activity practical and transparent by thinking together with practitioners and users, and then sharing the outcomes widely during and after experimentation. Altogether, CCHC has allowed us to put the learning into action, leading to more space for testing emerging hypotheses; as a result, we arrive at even more coherent and shareable machine learning outcomes.

Cultivating iterative solutions in the Computing Cultural Heritage in the Cloud initiative 

CCHC is a specific investigation into how we might support computational uses of Library resources and data, and the ML recommendations have meshed CCHC in different but interrelated ways in each phase of the initiative. We can take a look at the CCHC “Evaluate” phase how we centered people and their needs, developed iterative (and appropriate) solutions, and explored infrastructure and policy with our data packages and experimental sandbox.

For October’s CCHC Data Jam event, we invited expert researchers to engage with data packages and provide specific feedback. They surpassed our expectations with their insights on access, opportunities, and challenges – plus provided more than a dose of delight! To create these data packages, we wove together 3 elements of recent LC Labs work: the core values for CCHC, the machine learning recommendations above, and LC Labs practice in delivering experimentation prototypes, data, and documentation via cloud services.

We used our CCHC values to guide how we developed contextualized and well-documented work. That approach is just one way we have met our goals of being inviting, transparent, exploratory, and flexible. In previous experiments we’ve developed datasets, READMEs, and other documentation, sharing them in GitHub and serving them via S3 bucket (see for example Web Archives derivative datasets). Iterating upon these LC Labs tactics, we integrated interdisciplinary user feedback about the ways our collections data are so readily available to answer some questions; and simultaneously complex, heterogeneous, insufficient, and revealing only some of the stories that may be told with them. We have also asked new questions of the collections to capture broader context about the interpretations and decisions surrounding the digital collections as they exist now.

We’ve been working creatively and flexibly in CCHC to address the feedback we’ve gathered from users to better address their desire for contexts. Those interests include understanding the kinds of data present and absent in a given datasets, but also technical information that captures trails of transformation. We’ve done that by ensuring our data packages consist of at least 6 components: the dataset itself, metadata (in CSV, JSON, and text), technical README, a sample dataset, a data processing plan, and data coversheet. Explore these resources in the Selected Digitized Books Data Package, derived from the growing Selected Digitzed Books collection.

Screenshot of the Selected Digitized Books data package landing page, featuring robust details of the dataset, technical details, and other considerations for use with links for downloading these resources

Selected Digitized Books Data Package landing page on //data.labs.loc.gov

Secondly, the data processing plan and data coversheet allow us to model potential replicable processes that can enable computational research. Calls for refined, specific, and detailed documentation from the guidance of Always Already Computational: Collections as data, T. Gebru et al Datasheets for Datasets (2018, 2021), and participation in AI4LAM and federal AI communities of practice encouraged us to think through data dimensions and transformations and their impact on future uses: whether explicitly for machine learning or for other analyses. We also are committed to preventing harm for subjects and users directly and implicitly, as well as articulating the ways absences could proliferate damaging interpretations through use of our resulting datasets.

Furthermore, while designing data packages that can be created, maintained, and enhanced in our context, we are exploring infrastructure considerations and staffing requirements. We continue to refine how we share data: by developing data packages consisting of datasets and documentation to be used in computational research and by launching an LC Labs experimental sandbox. The CCHC data.labs.loc.gov sandbox offers opportunities to test documentation, data transformation and management, and improve upon our efforts through further experimentation. Since hosting our CCHC expert researchers last year, we have investigated pathways for data transformation to support large scale computational research, explicitly working to identify the required skills and infrastructure for these complex, iterative queries and analysis.

This phase of our CCHC work reflects our intent to support computational approaches, with machine learning specifically, in ways that are practical, coherent, transparent, and accountable. It’s rewarding and exciting to build upon our previous LC Labs experimentation and collaborations, and to move intentionally to support broader use of our vast collections. We’d love to hear what seems valuable to you in the comments below.

I’d like to close out 2022 by thanking the researchers who have shared their feedback and experiences with CCHC: Lincoln Mullen, Lauren Tilton, Taylor Arnold, Andromeda Yelton, Quinn Dombrowski, Nichole Nomura, Zoe LeBlanc, Vikram Mohanty, Aaron Straup Cope, Daniel Van Strien, and Tim Sherratt. I’m also grateful for the collaboration, support, and insight of our Library colleagues without whom our work is not possible and my LC Labs and CCHC colleagues.

Get Involved in CCHC and LC Labs experiments

We still have much to learn! Here are a few ways you can be involved with CCHC and contribute to our experimentation. First, find details on our website on the Computing Cultural Heritage in the Cloud experiment page. We will release our broad service model recommendations in the coming months and we’d love to talk with you about the ways you could benefit from staff support, cloud services, data packaging, and other training.

We are also seeking feedback from folks who are interested in using the Library’s collections as data at scale and who are asking critical questions of data, documentation, and presentation. Please get in touch with us at [email protected] if you’d be willing to help us improve our data packages. And thank you if you’ve already shared your feedback! You can also follow along with our ML experiments and let us know how you’d apply and adapt resources like the data processing plan, risk matrix, and conceptual framework on our Machine Learning with LC Labs page.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.