Stay “in the loop” with LC Labs experiment combining crowdsourcing and machine learning

In 2020, LC Labs began the Humans in the Loop experiment to explore ways to responsibly combine crowdsourcing experiences and machine learning workflows.

As you may know from following along with LC Labs’ investigations into these methods, machine learning’s reliance on pattern recognition and training decisions made by human annotators makes it really good at predicting past classifications. However, complexities emerge in accounting for human bias and error in machine learning and especially when it comes to the potential to replicate and even proliferate bias and harmful effects. This practice benefits from methodical treatment. This is true even in cases where it can be used to make massive corpora more searchable, as 2020 Library of Congress Innovator in Residence Ben Lee presented in his Newspaper Navigator experiment. Meanwhile, data generated by crowdsourcing participants show promise to serve as training data, but only if participants are fully informed. This type of engagement with participants would also require carefully designed workflows and communications strategies.

For Humans in the Loop, we are collaborating with data management solutions provider AVP as they develop a framework for ethically, engagingly, and usefully incorporating human feedback into training data and the results they drive through crowdsourcing. The experiment aims to create an experience that is both engaging and educational for users. By providing scaffolding and contextualization, it will hopefully also create training data in ethical ways that can also be used by machine learning to enrich the collections. In upcoming experiments, AVP will prototype workflows for combining crowdsourced human expertise with machine learning. One workflow will use human-generated input as the data on which to train a machine learning model; i.e. it learns what is a ‘cat’ based on what users have selected as ‘cats’ vs. ‘not-cats.’ Another prototype will incorporate human feedback into machine-generated results. This process is often called validation, i.e. users confirm or deny whether what the machine reads as a being a ‘cat’ is or is not in fact a cat. This data is then used to train the algorithm to guide its future predictions.

The Humans in the Loop experiment builds directly on LC Labs’ sustained exploration of machine learning in cultural heritage for tasks such as pre-processing, segmentation, classification, clustering, transcription, and extraction. An example is the Speech-to-Text viewer experiment designed by colleagues in the Library’s Office of the Chief Information Officer and American Folklife Center that tested the feasibility of using an out-of-the-box speech-to-text solution on digital spoken-word collections held by AFC. In 2019, the team partnered with the Project AIDA researchers on a series of demonstration projects applying machine learning to Library of Congress collections in different ways. Project results and Library-specific recommendations can be found in their Digital Libraries, Intelligent Data Analytics, and Augmented Description report and GitHub code repository.

screenshot of eight scanned manuscript pages. visual content is identified on the page with yellow and red markings.

Screenshot of the findings presented in the Project AIDA team’s report.

In September 2019, LC Labs hosted the Machine Learning + Libraries Summit, convening over 75 cultural heritage practitioners and machine learning experts. The event coincided with the announcement of Ben Lee as one of the 2020 Innovators in Residence alongside Brian Foo. Lee’s Newspaper Navigator project was released in 2020 and used a machine learning algorithm to identify, segment, and search all of the visual content in the Chronicling America database of historic newspapers. Innovator Brian Foo also used machine learning to identify, classify, and cluster samples of music from Library of Congress collections in his design of the Citizen DJ experiment. Finally, LC Labs commissioned Professor Ryan Cordell to conduct a comprehensive survey of the state of field regarding machine learning and libraries. In his final report, Cordell built on some of the Aida team’s recommendations and laid out steps for cultivating responsible ML in libraries.

The front page of a West Virginia newspaper with red boxes around visual content labeled "comics" and a purple box around visual content titled "photograph."

Screenshot of the Newspaper Navigator algorithm being used to identify and categorize visual content on a newspaper page.

Humans in the Loop is both an enactment of the Digital Strategy goal to throw open the treasure chest via computational means and a response to the recommendations made in the reports mentioned above. The University of Nebraska-Lincoln team’s top recommendations focused on developing “social and technical infrastructures” and investing in “intentional explorations and investigations of particular machine learning applications” (30). Humans in the Loop works to achieve both these goals.

Similar to the design principles that guided the development of By the People, the Library’s crowdsourced transcription program, the values guiding Humans in the Loop scrutinize the decisions underlying ML technology to redress bias and mitigate risk. One desired outcome of the project is that increased exposure to machine learning algorithms at work will lead to greater literacy about this technology. The project team’s hope is that users’ participation in the process will reveal the ways in which machine learning relies on human subjectivity and decision-making rather than objective, or neutral, classification.

As Cordell writes, “one of largest challenges facing library ML work is the labor required to create meaningful training data, and crowdsourcing efforts hold much potential for addressing that need” (18). The design of projects that combine the two thus merits careful thought and thorough investigation. When done well, the pay-off can be remarkable. A great example is the Beyond Words experiment designed by Staff Innovator Tong Wang. The application was not only incredibly popular and fun for users who wanted to dig into WWI-era newspapers but also generated derivative data that was instrumental to both the demonstration prototypes done by the University of Nebraska-Lincoln and for the Newspaper Navigator application. Without this wealth of crowd-created data that was released into the public domain as it was created, neither of these projects would have been possible. Humans in the Loop pilots the creation of interfaces that intentionally combine crowdsourcing and machine learning in the same space.

We will share more information about the project, including the collections being used in the experiment and a call for user testing of prototypes, soon.

If you have questions in the meantime or would like to sign up to test these prototypes, get in touch by email at [email protected].

That’s a wrap! 2020 Staff Innovator detail comes to a close

A reflection on the 2020 Staff Innovator detail from an LC Labs team member, shared in the hopes that some of the lessons we learned from this cross-institutional partnership may be applicable to other institutions and interesting to our readers! 

Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archivist in the Manuscript Division at the Library of Congress. In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.

Newspaper Navigator Search Application Now Live!

On September 15, 2020, the Library of Congress announced the release of Newspaper Navigator, an experimental web application which makes 1.5 million photographs from the dataset from Chronicling America available to the public to explore for the first time. Read more about the design and features of the project below or jump straight to the newly launched application at //news-navigator.labs.loc.gov/search !

Computing Cultural Heritage in the Cloud Quarterly Update

This is a guest post from LC Labs Senior Innovation Specialist Laurie Allen. This is the second post in a series where we are sharing experiences from the Andrew W. Mellon-funded Computing Cultural Heritage in the Cloud. The series began with an introductory post.  Learn about the grant on the experiments page, and see the […]

Connections in Sound and at the Library of Congress: Reaching out to experts to connect Irish traditional music through Linked Data

Patrick Egan is a scholar and musician from Ireland, who served as a Kluge Fellow in Digital Studies at the Kluge Center. He has recently earned his PhD in digital humanities with ethnomusicology in at University College Cork. Patrick’s interests over the past number of years have focused on ways to creatively use descriptive data in […]