Catalog records are key to storing and finding digital library materials. As the volume of digital materials continues to grow rapidly, the Library of Congress is exploring whether AI can help catalogers by automating the generation of metadata. AI could provide an opportunity to speed up description workflows. Yet there are numerous machine learning (ML) approaches and questions about benefits, risks, costs, and quality we must consider before adopting these technologies.
The Library recently released reports from a set of experiments called Exploring Computational Description, which examined which technologies and workflows provide the most promising support for metadata creation and cataloging, assessed the practices of other organizations, tested many different ML approaches with Library ebook data, and evaluated the output in iterative data review interfaces.
Leah Weinryb-Grohsgal, Senior Program Advisor to the Director of Digital Strategy, recently interviewed Abigail Potter, Senior Innovation Specialist in the Library’s Digital Innovation Division (LC Labs) and Caroline Saccucci, Chief of the U.S. Programs, Law and Literature Division, about their hopes for this experiment. The group also discuss how they’re interpreting the automated outputs and user implications of adopting AI for a core library workflow.
Leah: Thanks for joining us for this interview, Abbey and Caroline. We’re excited to share this experiment to test machine learning (ML) models at the Library.
You’ve described this work as an experiment in exploring computational description. Why did you decide to embark on this research, and what were you hoping to learn?
Caroline: I was contacted by Abbey to discuss whether the backlog of uncataloged ebooks would be a good fit for an experiment using machine learning methods to create bibliographic records. I was really excited to have an opportunity to collaborate with Abbey on this experiment. We got underway in August 2022. Ultimately, we wanted to learn if machine learning models could generate high quality bibliographic records at scale and, if so, which ML models were most promising. Specifically, we wanted the ML models to accurately predict certain key metadata, such as titles, authors, subjects, genres, dates, and identifiers needed to describe ebooks in bibliographic records. We also wanted to learn how ML could assist catalogers in metadata creation workflows.
Abbey: LC Labs started exploring how the Library could responsibly adopt machine learning since around 2018. In 2021 we had just concluded some research into human in the loop (HITL) workflows and experimental access demonstrations powered by data derived from AI processes. We knew these new tools could be just as powerful in transforming our staff workflows as they are in connecting the public to our collections. So, we worked with partners across the Library to identify priority areas to experiment in. I was thrilled to be connected with Caroline to test how ML could help her staff of catalogers.
Leah: How will you incorporate lessons learned into the Library’s LC Labs AI Planning Framework?
Abbey: Our AI Planning Framework was published on this blog and in our GitHub space about a year ago and first shared publicly at the 2020 iPres meeting. It was both influential to this experiment and shaped by it. Exploring Computational Description, or as we call it, ECD, was the first time we could really put the Framework in action. It is also the first experiment we did via our Digital Innovation contracting vehicle and the first experiment co-led with a colleague outside of our LC Labs team. These three aspects all contributed to the success of the experiment. It was possible to have a space outside production workflows and pressures to work through understanding the potential risks and benefits of this use cases, to bring in technical expertise, and rely on subject matter expertise from catalogers.
I think all aspects of our Framework are essential (I realize I’m biased)! But this experiment has really shown the importance of establishing design principles for AI. We wanted to center the catalogers in this experiment, include their expertise from the beginning, and allow their decisions and real-life workflows to set our direction. Having Caroline co-lead this project also gave her and her staff an opportunity to directly learn about the technology. Working with Caroline made the whole experiment better since she knows cataloging so well.
Leah: What sorts of approaches did you test here? Which data did you use?
Caroline: We started with approximately 23,000 ebook files in EPUB and PDF formats, mostly in English, of which about 13,000 had been acquired through the Cataloging in Publication program. We also included over 5,800 open access ebooks, 3,700 legal research reports from the Law Library of Congress, and a few hundred ebooks from our main accession workflow. We also provided the associated MARC records to use as ground truth for the experiment. Five open-source ML models were selected for the experiment, and we ran the ebook files through those models to determine how well each model performed in terms of predicting the required metadata.
The models’ performance were assessed by comparing the ML-predicted metadata with the fields in the original MARC records. In addition to performing quality assessments, catalogers were asked to compare the ML-generated titles and authors with titles and authors as they appeared in the MARC records, and to select whether there was at least one good match for each author and title. Catalogers were requested to provide feedback on the quality to help refine the models.
We also developed two low-fidelity prototypes with suggestion services to assist catalogers. One prototype used ML to suggest possible Library of Congress Subject Headings (LCSH). Using text classification, the ML model identified main topics of the ebook and then translated those topics into LCSH subject heading terms. The HITL cataloger reviewed these LCSH terms as well as broader, narrower, and related terms. The cataloger then selected the correct LCSH terms from the list and compared them to the original MARC record to assess how well the ML model predicted the subject headings.
The second prototype suggested author names as established in the Library of Congress Name Authority File (NAF). Using token classification, the ML model tried to accurately predict that a string of characters on a title page constituted the author’s name and then suggested possible names as they appear in the NAF. Catalogers were asked to use both prototypes and complete a survey to elicit feedback about the utility of these prototypes, identifying needed refinements and enhancements for the next experiment.
Detail from an Assisted Cataloging HITL Workflow Prototype for evaluating the quality of suggested subject terms. There are options to mark ”Acceptable”, ”Too Broad”, ”Too Narrow”, ”Wrong,” and to add comments for each suggested subject term.
Abbey: This is another example of the Framework in action: assessing models as they do specific tasks with our real data and having experts review those outcomes to start to understand what a quality baseline might be. AI is so dependent on the quality of data for accurate outcomes. And it is really important for the people who are experts in the workflow to review the output. Their feedback can both determine the quality of the AI output and help make the model better through retraining and tuning models.
Leah: Can you give any examples of successful approaches?
Caroline: For the purposes of this experiment, we set a quality threshold of a 95% F1 score, a measure of accuracy. The transformer-based models doing token classification tasks, like predicting titles and authors, were most successful, although none of the models reached our 95% threshold, except for identifying Library of Congress Control Numbers (LCCN).
Annif, a model and framework for automated subject indexing developed by the National Library of Finland, showed some promise with text classification of subjects, although the accuracy rate reached only 35%. We were able to test various Large Language Models (LLMs) and vector “searching” to create MARC fields and subfields and got good results, with some fields getting a 90% F1 score (100% being a perfect score). The LLMs scored a 26% F1 when asked to predict the same LCSH terms that catalogers assigned.
Since high quality catalog records are essential to the Library of Congress and libraries around the world who use our MARC records, the results are showing us that catalogers will need to review ML/AI output prior to publishing, which we expected. The cataloging assistance workflow prototypes enabling cataloger review and feedback showed promise, and this human-in-the-loop (HITL) concept is moving forward for further iteration.
Abbey: We used Data Processing Plans to document what data was used to assess each model. If you are interested in all the technical details about exactly how data was prepared and used and what models were assessed, please check them out! These, along with the reports, data, and prototypes, are a major deliverable of the experiment. This documentation is helping to guide our decision making for future AI technologies that would be implemented after experimentation.
I also want to mention how useful it is to be able to assess different models with the same data. It allows us to evaluate the performance of open models, proprietary models, etc. and make informed decisions. A basic tenet of government procurement is encouraging competition which drives better value, so making sure you include multiple approaches and platforms and verify performance claims often found in marketing materials is really important.
Leah: What are some challenges of using these models?
Caroline: There were a number of challenges working with the models. The first and most basic is that we did not start with enough training data. It turns out that “more is more” when it comes to training data for a machine learning project. We started with approximately 23,000 ebooks and associated MARC records for training data, but we had almost 100,000 that we could have used. We supplemented with the remaining 77,000 ebooks, so that was a good lesson learned.
A second challenge was “extreme multilabel text classification” in that a bibliographic record can include multiple subject fields, each with multiple subject terms. Furthermore, there is a very long tail of possible subject terms. About 50% of the subject terms in the training corpus appeared more than once, while another 50% of subject terms appeared only once. This huge variety in a relatively small corpus of 23,000 document resulted in very low accuracy rates for subjects. The accuracy rate for genre was even lower since the majority of the MARC records used in the training data did not include genre terms.
Abbey: Caroline is right on. Having sufficient training data to guide a model to perform a specific task is a universal challenge. Also, selecting the correct string of subjects for a book out of approximately 450,000 possible subjects is incredibly challenging for both humans and AI. The complexity of the task and a lack of balanced training data add risk to use cases. The lack of knowledge about what LLMs are trained on can also increase risk because it increases the likelihood the model will need significant prompting or tuning to perform well or in an unbiased way.
Other conditions that add risk aren’t necessarily about the performance of the model. The terms of service for some models aren’t compliant with federal security and privacy regulations, especially for how federal agencies must treat data. Looking wholistically at how an AI ecosystem could impact our bottom line, our planet, our organization, our users, and our staff must also be considered. We need to understand what the short and long-term costs are for these new tools. This is an incredibly fast-moving space, new models and products and regulations are being released almost weekly. It is challenging to keep up!
Our Framework has given guidance on how to step through assessments of AI technologies. What we are learning in this experiment is helping to formulate how we will assess AI.
Right now, the big questions we are asking are:
- Is the AI Responsible? Do the outcomes of the experiment show an appropriate balance of benefits and risks? Is the tool or approach we want to use compliant with our standards and regulations. Do the AI approaches and outcomes support trustworthiness, accountability, and respect for equity, rights, security, and privacy?
- Is the AI effective? Did the combination of our data, staff expertise, and model selection combine to produce an outcome that benefits our users, stakeholders, staff, and the organization?
- Is the AI practical? Can we integrate this AI tool or process in our infrastructure? Can we manage and ensure stable performance over time? Is it affordable?
We are still answering these questions for ECD, but we are encouraged by what we are seeing so far.
Leah: Has anything surprised you in the results?
Caroline: I was not that surprised that ML models could be trained to accurately predict author names, ebook titles, and identifiers. I also was not really surprised that ML models had difficulty with applying subject and genre terms because asking a machine to determine the main topic(s) of a book and convert that to a set of terms from a controlled vocabulary is some serious deep learning. The results for the 23,000 ebooks showed that ML models are not ready for wholesale bibliographic description at scale, although, as already noted, perhaps we might have gotten better results if we had started with 100,000 ebooks.
For me, the most surprising results were the responses from catalogers to ML. I had assumed that catalogers would be really uneasy about ML methods applied to bibliographic description, but the catalogers who participated in the testing showed curiosity and an open mind. This was particularly true for the catalogers who tested the HITL prototypes because they could get a sense of how ML could augment and support their work.
Abbey: I, like others, am really optimistic about ML and how it could improve our processes at the Library. I initially thought creating metadata from ebooks as a “low hanging fruit” task and was excited to help move it forward. When widely available LLMs were released, I was incredibly impressed by what they could do. But as I learn more about this technology and started to evaluate it, I realized that you can’t take what you see in demos or cited performance metrics and expect similar results for a complex task like cataloging. There are so many variables that make it challenging.
Most AI tools are not designed to work with long texts. Hallucinations are a regular part of Generative AI, and we need high quality and consistent output. Doing this experiment showed me implementing AI within our workflows at the Library will not be simple or fast. Developing training, tuning, and benchmark datasets that are balanced yet reflect the wide variety of content that is cataloged at the Library will be important steps, as will developing infrastructure to manage the models, test them, and switch them out as needed. New quality standards and interfaces and programs to review and assess model output will be needed. Feedback mechanisms from catalogers and users to inform model performance and systems to manage all the data will need to exist. And staff and expertise to run, monitor and maintain this new capability have to be in place. This is all possible but it will take time and resources.
Leah: Do you expect that the Library will use these technologies widely in the future?
Caroline: The results from the first experiment showed that this technology is still maturing and fast moving. ChatGPT 3.5 was released halfway through the first phase of experimentation. The second phase included the best of the models from the first phase, with additional LLMs and two cataloging assistance prototypes. We may get to the point where ML models can reliably predict titles, author names, publication information, and identifiers etc., but we will need the HITL for the more complex work of bibliographic description.
Abbey: I think our staff are in a very good position to use AI effectively. We collect, preserve, and provide access to information in many formats that have existed over time, across most subject areas. We have created and adopted standards and practices for every previous wave of new technology that came at us.
We created the MARC format to enable machine-readable and sharable bibliographic data. We were one of the first cultural heritage organizations to digitize and share collections online through the American Memory program. The Library led a national digital preservation program which funded and proliferated standards and guides to support the preservation of digital materials at organizations of every size across the nation. We lead and model accessibility practices that broaden the reach of our programs and services, and we collaborate with partners across the world to preserve web archives, share innovative practices, and advance responsible adoption of AI. We were the first nation to provide API access to legislation.
Our staff (and librarians in general) know our data and the context of our organizations, and we know how to use new technologies to benefit our users, communities, and staff. This is another step in what we’ve done before.
Leah: Do you see other work following and building on your findings here?
Caroline: We just kicked off a third phase of the experiment in August 2024. I am very excited that we will be experimenting earlier in the pre-publication workflow to create bibliographic metadata for print and ebooks and export the metadata in the Bibframe format. I’m hopeful that we will be able to involve other institutions in the assessment and review. It will be an opportunity to think creatively about the entirety of the cataloging workflow and imagine the possibilities for the future.
I see us building on our findings. Each phase is bringing us closer to implementing a pilot program to use ML and HITL for cataloging digital materials in a production environment.
Abbey: We are learning so much from this and other AI experiments we’re doing. They have been foundational to our emerging strategy, roadmap, and policies for AI. I know it is a privilege to have the time, support, and resources to experiment with AI in the way we are. We hope by sharing what we’re learning, others can benefit as well.
This interview discusses an experiment to create MARC records from ebooks using the AI Planning and Implementation Framework released by LC Labs in November 2023. Learn more about the introduction of this planning framework in Introducing the LC Labs Artificial Intelligence Planning Framework or on the LC Labs Github LC Labs AI Planning Framework.
Comments (2)
Thank you for this very important for the Library “world.” I may share this information to my librarian colleagues in Indonesia and SEA.
This is very exciting! The Copyright Office, Office of Copyright Records is also experimenting with AI in our work. Being a participant on the experiment, I’m not fearful of the technology and see how it can be useful to help streamline many routine processes we do daily. I do think the human component of review will always be a crucial step in the process. Even if it doesn’t hit the target accuracy, getting the majority of the data is a win, as it is still a time saver to simply review and correct a record with minimal errors as opposed to creating the record in the traditional manner. I think as the AI learns over time, these errors will become fewer, but still not to a level that would exclude human review (trust but verify). Thanks for a very informative interview Abbey (loved your presentation at the IS&T Conference last year) and Caroline.