More Web Archives, Less Process

This is a guest post by Grace Thomas, a Digital Collections Specialist for the Library of Congress Web Archiving Team.

The Library of Congress Digital Content Management Section is excited to announce the release of 4,240 new web archives across 43 event and thematic collections on loc.gov, our largest single release of web archives to date! Web archives such as Slate Magazine from 2002 to present, Elizabeth Mesa’s Iraq War blog, and Sri Lanka’s current president Maithripala Sirisena’s campaign website (no longer live on the web) are now waiting to be discovered alongside millions of other Library items. Keep watching The Signal for deeper dives into the unique collections with web archives now available on loc.gov. The Web Archiving Team sends its deepest gratitude to all involved in this significant achievement for the Library.

Challenges of Scale

With over 20,000 web archives among 114 ongoing and finished collections, the scale of the Library’s web archive has grown significantly, presenting compelling new challenges for description along the way. To provide access at the same rate the archive continues to expand, the Web Archiving Team (WAT), representatives from Acquisitions and Bibliographic Access (ABA), and Web Services created an innovative new MPLP cataloguing approach. The approach, known internally as the minimal-record approach, combines the descriptive talents of cataloging librarians with the power of Python scripting to automatically create MODS records.

The Library successfully implemented the minimal-record approach during its previous releases of the Federal Courts, International Tribunals, and Legislative Branch Web Archive collections. In planning subsequent releases, WAT saw that many web archives overlap between thematic collections — this is possible because of the way the Library collects and manages the collections when building them. For example, Hark! a vagrant, appears in the Webcomics Web Archive and the Small Press Expo Comic and Comic Art Web Archive. In the current release, there are even more complicated examples, such as Beliefnet, which appears in three different collections curated by four different library units.

This interwoven quality prompted WAT to implement the minimal-record approach at a much larger scale than previously attempted and process the entire backlog of web archives publicly available through the Wayback, but not yet discoverable on loc.gov. This is the first leap forward in the Library’s effort to streamline the release of web archives, preparing for a future where web archives are made discoverable automatically as they leave the Library’s required one-year embargo for web archives content.

The Minimal-Record Schema

In keeping with the use of Metadata Object Description Schema (MODS) for Library of Congress web archives descriptive records, ABA created a slimmed MODS schema for use in the minimal-record approach. The schema includes all fields necessary to describe the unique web archives at a basic level from data available in Digiboard, WAT’s homegrown curatorial tool. Library staff use Digiboard for a number of web archiving management tasks, from nomination of sites for future capture to quality review of captured content. It does not, however, have a full cataloging and description component. Rather, the curatorial data held in Digiboard is utilized for description in the minimal-record approach.

The minimal-record MODS fields include:

  • an identifier assigned by Digiboard upon nomination for inclusion in the web archives,
  • a descriptive title,
  • the archived URLs,
  • collection title(s),
  • curatorial department(s),
  • content language(s),
  • country of publication,
  • rights restrictions,
  • thumbnail preview images of the archived URLs, if available,
  • and a summary, if available.

There are also boilerplate fields such as genre, form, and online format, which default to “web site,” “electronic,” and “web page,” respectively.

The minimal-record MODS schema does not include Library of Congress Subject Headings or Name Authority File information by nature. Some web archives descriptive records created in the past contain information in these fields, but they were added manually during a separate enhancement process.

Description at Scale

With the MODS schema and fields set, the true magic happens when two scripts run to automatically generate the descriptive records. First, a WAT-created Python script pulls data from Digiboard and drops each value into its specified place within the MODS XML blocks. Second, another WAT-created Python script creates the thumbnail preview images of the archived URLs based on unique identifiers set in the MODS.

An example of a web archive’s curatorial data in Digiboard.

Examples of Digiboard curatorial data used in the descriptive MODS record for the web archive.

Examples of Digiboard curatorial data used in the descriptive MODS record for the web archive.

The final MODS XML records and thumbnail JPG images are handed off to Web Services to complete the Extract Transform Load (ETL) process, and publicly release the records on loc.gov.

Librarians + Scripting into the Future

The minimal-record approach to describing web archives works seamlessly for a rolling release of content. However, like standard MPLP practice, the basic records exclude highly tailored information, leaving space for more detailed processing down the line. For this release, the cataloging librarian from ABA manually prepped item titles in Digiboard two days per week for four months, prior to the automatic generation of the MODS. While this involved many hours of focused, hand-curated work, it was a small fraction of the time it would have taken to manually create the 4,240 records included in the release.

In order to add Library of Congress Subject Headings and Name Authority File information to bulk up the records later, the WAT and ABA have a few ideas brewing. We hope to advance the librarian + scripting partnership by utilizing id.loc.gov, the Library’s linked data service giving unique identifiers to each term in Library of Congress authorities and vocabularies.

With exciting new content processing infrastructure available to WAT at the Library, WAT and ABA will also begin to imagine even more computational methods for assigning descriptive terms. Perhaps using text analysis or natural language processing, the content itself will help librarians with description in the future.

2 Comments

  1. Vinod Bhadu
    August 3, 2018 at 9:34 pm

    Thanks for your help

  2. Mary Johnson
    August 6, 2018 at 3:06 pm

    Since reading your blog post, I’ve been sharing collections of interest to teachers across various professional networks. The “event and thematic collections” link is particularly useful. What an impressive (mind-boggling, actually) and important service you have built and made available!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.