Why Machine Learning?
Everyone at the Library of Congress wants the materials we steward and the services we offer to be useful for as many people as possible. It’s why we do what we do! And across the Library, staff have long relied on technological innovations to enable people to use our materials to become more informed, engaged, and inspired. Staff at the Library develop and use technology to preserve collections materials, to make them more available, and to help people more easily find the unique treasures we hold.
However, given the spectacular range of materials the Library stewards – music, film, websites, artwork, photographs, Copyright records, legislative information, not to mention the books – we’re only scratching the surface of what’s possible. Our vision as an institution is for all Americans to be connected to the Library, and we are continuously inspired by the idea that Congress and the American people might one day be able to explore everything at the Library easily, seamlessly, and creatively, from wherever they are.
If we hope to meet our vision, we need to try out bold new approaches that have the potential to radically change our practices. And we must match these creative explorations with the enormous care that our materials and their potential users deserve.
In LC Labs, one of our jobs is to investigate new methods and technology. Through our collaborations and research initiatives, we aim to discover where we might connect with creative partners, explore emerging approaches, and lower barriers to innovation to help the Library broaden its reach. Machine learning experiments have proliferated over the past several years, as we’ve purposefully explored its possibilities, but also in the work of our Innovators in Residence and in the explorations of researchers we worked with as part of a grant-funded exploration of Cloud infrastructure opportunities.
Why experiment? Why not just implement?
With all the possibilities, why aren’t we already embedding machine learning into everything we do? The answer is complicated, but at its simplest, machine learning (ML) and artificial intelligence (AI) tools haven’t demonstrated that they’re able to meet our very high standards for responsible stewardship of information in most cases, without significant human intervention.
As the Library of Congress, we have a responsibility to the American public. As shepherds of the largest library in the world, we also have a responsibility to all curious people, and especially to those whose stories we hold. We know how important it is to get information right, and we won’t implement technology that automates our work without thorough vetting to make sure we aren’t compromising our trustworthiness or the authenticity of the information we offer.
Frankly, in most cases, commercially available AI systems don’t work well on our materials. The potential that ML offers for us is exciting, but even the best AI models introduce unexpected errors, were mostly trained on contemporary materials, and struggle when asked to process materials produced by people living in a time without computers. A classic example is image recognition systems that may label an object from a 19th century photograph a cellphone. Basically, if you think the internet is complicated and diverse, you should check out the Library!
This doesn’t mean this technology is not worth exploring, but the challenges demand that we are only incorporating these powerful tools when we know they will work well enough for our high standards. We’re working on that, though, and are excited to share more as we go!
Looking back
Readers of the Signal know that ML and AI are not new topics for the Library of Congress. These technologies have been part of our explorations – our attempts to keep scratching at the surface of possibility – for years. The tools and framework we will describe in future posts are one contribution to a growing body of work from across the government and the libraries, archives and museum field. However, before we begin sharing this next set of tools, it’s useful to look back at some of the highlights of our explorations with ML and AI to date. We’ve reported on posts, reports, symposia, experiments, and research galore these past few years ! Below is an overview of some of the ways the Library has explored the possibilities provided by Machine Learning, and some key lessons from those collaborations .
Research and Reports
Name | Collaborator(s) | Description | Findings |
---|---|---|---|
Machine Learning + Libraries Summit Event Report. February 2020-conference proceedings. | Attendees of the September 2019 Machine Learning + Libraries conference organized by LC Labs. | Summarized key themes and conference proceedings from a one-day conference on Machine Learning + Libraries hosted by the Library of Congress on September 20, 2019. | Threads emerging from whole group discussion at the conference include: ethics, transparency, and communication; access to resources; attracting interest in GLAM (galleries, libraries, archives, museums) datasets; building machine learning literacy; expanding machine learning user communities; operationalization; connecting machine learning and crowdsourcing; metrics for evaluation of vendors and projects; and copyright and implications for the use of content. |
Machine Learning + Libraries: Report on the State of the Field. January 2020-contracted experiment report. | Ryan Cordell, Associate Professor of English, Northeastern University. | Documented the state of the art of using ML and AI in libraries. Offered detailed checklists for Library, Archives and Museum (LAM) practitioners when considering an AI program or service. | Core recommendations: cultivate responsible ML in libraries, increase access to data for ML, develop ML + libraries infrastructure, support for ML + library projects, ML expertise in libraries. |
Digital Libraries, Intelligent Data Analytics, and Augmented Description: Final Report.Published January 2020 – contracted experiment report | Associate Professor Elizabeth Lorang; Professor Leen-Kiat Soh; Yi Liu, and Chulwoo Pack, University of Nebraska-Lincoln. | Final report from a demonstration project, which details the explorations, addresses social and technical challenges with regard to the explorations and that are critical context for the development of machine learning in the cultural heritage sector, and makes several recommendations to the Library of Congress as it plans for future possibilities. | Social and technical challenges will slow down the development of machine learning programs in the cultural heritage sector. |
Experiments
Experiment | Collaborator(s) | Description | AI Findings |
---|---|---|---|
Exploring ML. Completed May 2020. | Associate Professor Elizabeth Lorang; Professor Leen-Kiat Soh; Yi Liu, and Chulwoo Pack, University of Nebraska-Lincoln. | Tests of applying six machine learning tasks to library materials: Document Segmentation; Graphic Element Classification and Text Extraction; Document Type Classification; Digitization Type Differentiation; Document Image Quality Assessment and Advanced Document Image Quality Assessment; and Document Clustering. | Tools performed well on individual tasks. No one-size-fits-all for the variety of document types. |
Speech to Text. Completed June 2020. | Julia Kim, Digital Assets Specialist, American Folklife Center; Chris Adams, Solutions Architect, Office of the Chief Information Office. | Test of Amazon Web Services (AWS) transcription service with American Folklife Center (AFC) Regional Dialects Spoken Word collection. | Out-of-the-box tools didn’t perform well with non-dominant accents or low-quality recordings. Output could augment search but not be visible to users. |
Citizen DJ. Completed July 2020. | Brian Foo, 2020 Innovator in Residence. | Prototype that creates, visualizes and offers for download samples of audio to enable remixing. | Extracting and organizing A/V collections by sonic features created value-add for users. Required lengthy and manual data prep before ML could be employed, such as rights validation and content QA. |
Newspaper Navigator. Completed September 2020. | Benjamin Charles Germain Lee, 2020 Innovator in Residence. | Prototype to identify and extract images from ChronAM pages and search according to visual similarity by training an ML classifier. Supported by layers of OCR text and ML-identified images. | Image embeddings and cloud architecture enabled unprecedented scaling. Uneven representation and errors in the data are proliferated in each layer of ML. |
Humans in the Loop. Completed June 2021. | AVP | Test workflows that combine ML + crowdsourcing to enhance search and discovery for business directories. | Very promising approach for creating high-quality metadata. Users want to know when data is generated via machine or crowdsourcing. |
Experimental Access. Completed March 2022. | Digirati | Test named entity extraction and linking to generate metadata of value to end users. | ML is helpful in identifying and linking metadata across digital collections. Need to establish quality standards for generating metadata aligned with user needs. |
America’s Public Bible: Machine-Learning Detection of Biblical Quotations Across LOC Collections via Cloud Computing. Completed March 2022- CCHC contracted research report funded by the Mellon foundation. | Lincoln Mullen, Associate Professor George Mason University. | Research in finding biblical quotations across Library collections that grow in scope. | API parameters and heterogeneity of data are a challenge. |
Access and Discovery of Documentary Images. Completed March 2022- CCHC contracted research report funded by the Mellon foundation. | Lauren Tilton and Taylor Arnold, Distant Viewing Lab, University of Richmond. | Using computer vision to to aid in the discovery and use of documentary photography collections held by the Library of Congress. | Curation decisions, such as how items were digitized or differences in metadata across Library systems, impact computational outcomes. |
Situating Ourselves in Cultural Heritage: Using Neural Nets to Expand the Reach of Metadata and See Cultural Data on Our Own Terms. Completed March 2022- CCHC contracted research report funded by the Mellon foundation. | Andromeda Yelton, software engineer. | Exploring and mapping relationships between collection items. | Challenges disambiguating terms like “Reconstruction” across collections. |
Looking Forward
In the coming days and months, we’ll be publishing our draft framework for AI planning and updates about a new set of experiments we’ve been working on in collaboration with colleagues across the agency to explore potential uses of ML and AI, internally. The tools and experiments we’ll describe represent exciting advancements in our work, and are designed to meet the opportunities and challenges presented by the giant leaps in AI that have been released over the past couple of years. We’ll be asking for feedback from Signal readers and others, and we hope that the tools we’ve been working on, which have gained so much from participation in federal and library communities, will be valuable for others.