Today’s blog post is co-authored by Dave Durden, Madeline Goebel, and Liz Holdzkom, Digital Collections Specialists in the Digital Content Management Section (DCM) at the Library of Congress.
Large-scale harvesting and acquisition of open access books sounds like a tall order and, after several years of attempting such work, we can confirm it is! From varied license terms (“open access” versus “openly available”) to generically named works (“Multilevel Analysis”), sifting through immense collections of open access literature, looking for books that meet a long list of requirements and selection criteria is far more than one librarian (or library) can reasonably do on their own–that’s why we make computers do most of it. A key part of the Library of Congress Digital Collections Strategy is to “expand and routinize acquisition and access of openly licensed and openly available digital works,” so figuring out how to best meet this challenge is an important part of our work. In this post, Digital Content Management (DCM) staff discuss how a massive spreadsheet becomes hundreds of e-books and MARC records available on loc.gov without an individual having to review every single title by hand (or eye, for that matter).
Starting the Process
The Directory of Open Access Books (DOAB) is an independent aggregator of open access books, and is not associated with the Library of Congress. As of August 2023, the Directory of Open Access Books provides entries for over 69,000 “scholarly, peer-reviewed open access books” (DOAB homepage). That is, DOAB provides the invaluable aggregation of thousands of titles in one place – the main ‘product’ of DOAB is the title-level metadata. Titles added to DOAB are primarily created by a community of open access publishers; however, some titles are cross-listed from Open Access Publishing in European Networks (OAPEN), which curates and enhances title-level metadata. Using an export of DOAB-aggregated data, Digital Collections Specialists from DCM begin downloading metadata for all of the titles currently listed in DOAB at the time of the export. DOAB data is robust but there’s a catch: it also reflects the variety of data creators and sources, resulting in variations in text formatting, omission of license or download URLs, and every imaginable ISBN format. When presenting this work at the Code4Lib 2023 conference, we learned that we aren’t the only ones identifying these challenges.
To maximize the usefulness of the exported DOAB data, we standardize and clean up the textual data (e.g. abstract, title, CC license) and ISBN fields relevant to our work, as well as extract ISBNs scattered throughout the data using regular expression pattern matching. After these data enhancements are made, we are left with a large pool of title strings and ISBNs that we use to search book APIs for existing metadata records, including Google Books, OpenLibrary, and OCLC. We also search these ISBNs in the Library’s catalog using Z39.50. Matching results are then added back into the dataset for further analysis.
Process
In previous blog posts (read more here and here), DCM described how it has iterated on a workflow for manually processing DOAB (and other) titles. We are now utilizing the below workflow to routinize the bulk process.
Once the DOAB data is prepared and cleaned for processing, DCM identifies the titles that would be eligible for a bulk processing workflow.
We start with titles that we know have already been selected for acquisition in print. That way we know they fit selection criteria for the Library’s permanent collections and we can draw from existing records. This has the benefit of making items already held in the Library’s collections much more broadly accessible. For these titles, we work with existing records in our catalog from which we could pull and transform most of the metadata, and those that include fields containing Creative Commons language for each book’s individual license. Data about titles that do not match our existing print holdings are shared with subject experts across the Library, referred to as Recommending Officers, who then review and consider adding titles to the collection if they are determined to be in scope as per the Library’s Collections Policy Statements.
As part of the workflow, we run a set of Python scripts to pull content and cover images for the titles that we identified, as well as to add metadata that is needed for processing to records.
Despite the development of this bulk processing workflow, staff intervention is still needed at points throughout the process. With the addition of several technician positions to the DCM team, we have even more expert eyes to help with this process. Staff are needed to conduct quality assurance on the thumbnails and, of course, to run the processing scripts in order. Throughout the process, titles are dropped from the bulk processing workflow, including titles for which content or cover images could not be pulled or titles that did not have a corresponding print record in the Library’s catalog.
The Results
Over 2,400 e-books have been made available in multiple batches as part of the Open Access Books Collection through these bulk processes since 2022. Development of a comprehensive data management plan has allowed DCM staff to improve upon the previously established workflows, making tracking more efficient and effective with each iteration. We intend to continue improving our workflows with each batch. These bulk-processed e-books cover a wide variety of topics and were published in 24 countries in 12 different languages. A sample of the most-viewed titles represents this diversity: Tracing pathways: interdisciplinary studies on modern and contemporary East Asia, Putting purpose into practice: the economics of mutuality, Typical girls: the rhetoric of womanhood in comic strips, and Salud y equidad: una mirada desde las ciencias sociales.
Once the bulk processing workflow is completed, DOAB titles that were deemed ineligible for batch processing are set aside for manual processing by DCM staff. In recent months, the pace of this manual processing has significantly increased thanks to the great work of DCM technicians. DCM staff is considering developing workflows to process these “manual review” titles in more automated ways, borrowing from established bulk processing workflows. Additionally, we’re looking for other ways to grow this collection, possibly with data from other aggregators of open access books. DOAB titles that are not currently in the Library of Congress catalog are sent to Recommending Officers for review. If selected, these e-books will become part of the collection in the future.
Through bulk or manual processing, open access e-books are continuously released to loc.gov. Check out the ever-growing collection!