In a Web Archives Frame of Mind: Improving Access and Describing the Collections

This is a guest post by Lauren Baker, a Librarian-in -Residence on the Library of Congress Web Archiving Team (a part of the Digital Collections Management & Services Division). The Librarians-in-Residence Program offers early career librarians an opportunity to contribute to Library projects while learning from professionals in the field.

In 2018, the Library of Congress Web Archiving Team embarked on a journey to streamline description of the Library’s voluminous web archives. As part of that continuing effort, the Library of Congress Digital Content Management Section is excited to announce the release of 4,258 new web archives across 97 event and thematic collections! With this release, some of our newer collections are coming out of embargo, and we are making records available for a number of older web archives and collections not previously available for use. This release marks the next significant milestone in the Library’s ability to serve web archives immediately upon exiting their one-year embargo period, garnering momentum for regular, monthly updates coming in 2020.

Discovering Web Archive Collections

Screenshot of framework for the new Women’s and Gender Studies Web Archive //www.loc.gov/collections/womens-and-gender-studies-web-archive/about-this-collection/

Figure 1. The new collections page for the recently launched Women’s and Gender Studies Web Archive at the Library of Congress.

To create a better user experience and make it easier to browse the newly available web archives, we have been working to add collection pages, known as “framework” here at the Library. Framework is a page on the Library’s website that acts as an entry point to a digital collection. It provides researchers with a narrative description and some contextual information about the scope of a collection.

Creating framework is a collaboration between the Web Archiving Team and staff from across the Library of Congress.  We collaborate with subject specialists from more than ten different divisions who curate the collections and with colleagues in IT Design & Development, who manage the Library’s web content and post the completed framework to loc.gov to make the collections publicly available. Along with the 4,258 records released, we have nearly doubled the number of web archive collections accessible to users. We created framework for web archives with records previously posted to the Library’s website and for new archives recently released from embargo. For the first time, there are collections on LGBTQ+ studies, veterans history, composers, authors, American business, economics, zines, and many more events and themes!

The creation of so many new collections offered an opportunity to consider how to better describe the web archives at the collection level. Taking cues from our colleagues within the library and in the larger web archiving community (see Peter Webster’s and Jessica Cebra’s work on web archive description), there are four areas where we’ve enriched the collection descriptions:

  1. frequency of collection
  2. languages
  3. acquisition information
  4. expert resources

We were able to harness information that we already had from the nomination process, crawl logs, and the archives themselves. Below, we’ll show how these four types of enhancements provide users with more information about the archives, make the information provided across the collections more consistent, and enable better search.

Understanding and Using the Collections

 Detail of a one-time capture of The Guardian’s “Choose your own pope – with our interactive Pontifficator” in the Papal Transition 2013 Web Archive

Figure 2. Detail of a one-time capture of The Guardian’s “Choose your own pope – with our interactive Pontifficator” in the Papal Transition 2013 Web Archive.

What can be gleaned from this additional descriptive information? The frequency of collection gives users information about how often the sites were crawled, which is often based on the frequency of updates on the live site by its owners and the depth of information available from those sites. Within a given collection, sites may be crawled at a variety of frequencies. For example, in the Papal Transition 2013 Web Archive, the majority of sites were targeted for capture weekly given that this was an event-based collection where information was changing quickly. Other sites within the collection, including selected Wikipedia pages, Twitter accounts, news sites, and organization websites, were crawled once, which allowed for a more in-depth capture to provide a snapshot in time of this historic announcement.

Languages is a new field added to the collection description that puts the web archive descriptions more in line with archival finding aids. While available at the item level, languages that the content is in may be a determining factor for a researcher to consider if they want to use a collection. Making this information available at the collection level allows this decision to be made earlier in the research process. Language information is crucial for the Afghanistan, Iran, Pakistan Government Web Archive, where sites in the collection contain information in multiple languages. We determined that the most frequently appearing language would appear first followed by additional languages.

Acquisition information can contribute to web archive provenance. We have started including a statement when collections are ongoing to indicate that sites are continuing to be identified and added. In the case of the East European Government Ministries Web Archive, sites for countries and ministries continue to be added incrementally, which means that there are earlier captures and, therefore, more information for some sites. This is an example of how knowing more about a web archive’s origins can help researchers understand what is and is not in the archive.

Expert resources is the final area where we have enhanced the collection descriptions. The more linking the better, in terms of search and discoverability, so we have placed greater emphasis on linking collections to related resources within and outside the Library, including the relevant reading rooms, Ask a Librarian service, LibGuides, related collections, and blog posts about the web archives.

The overlap among collections for specific items can be seen in the various “Part Ofs” listed on an item page in loc.gov. For example, the Human Rights Campaign can be found in no fewer than five collections: LGBTQ+ Politics and Political Candidates Web Archive, United States Supreme Court Nominations Web Archive, Public Policy Topics Web Archive, Researcher and Reference Services Division, and Law Library of Congress. However, with the addition of more precise expert resources, users can also recognize the interconnectedness earlier at the collection level before drilling down into the items. Two newly posted collections – American Music Creators Web Archive and American Music Industry Web Archive – link to one another. The Comics Literature and Criticism Web Archive references the previously posted Webcomics and Small Press Expo Comic and Comic Art Web Archives. As we continue to add to the Library’s web archive, many more links and discoveries will be possible!

Opening the Past & Moving to the Future

And finally, as a part of this release, in our general web archives, we have released a number of non-candidate web archives that were collected during the early United States Election Web Archiving projects. These had been collected off and on from 2000 through 2008 during United States election periods, but in 2010, a decision was made to simply focus our election archiving efforts on campaign websites. While many of the non-candidate sites archived with the election archives moved to the Public Policy Topics Web Archive for regular, ongoing collection, some did not, and we are pleased to finally make this content available for research use.

All of this sets the stage nicely as the Library begins to celebrate its twentieth year of web archiving in 2020. In early 2020, the Web Archiving Team plans to begin monthly, automated updates of content coming out of the one-year embargo and we will continue to release framework for additional collections as their records exit embargo.

In the meantime, and once rolling releases become standard practice at the Library, we will remain committed to improving description of the Library’s web archives to enable better access and ease of use. The exponential growth of the Library’s web archive shows no sign of halting and we are excited to embrace new modes and workflows of description that can help the Library serve the content earlier and more comprehensively.

What information would you like to know about the web archives? In what interesting ways are you able to use the newly released collections? Let us know in the comments and keep watching The Signal for more web archive updates!

Introducing the Computing Cultural Heritage in the Cloud Project

With support from the Andrew W. Mellon Foundation, the LC Labs team will pilot ways to combine cutting edge technology and the collections of the largest library in the world, to support creative new uses of collections. This project will explore service models to support researchers accessing Library of Congress collections in the cloud, with findings shared throughout the 2 year project.

In the Library’s Web Archives: 1,000 U.S. Government PowerPoint Slide Decks

The Digital Content Management section has been working to extract and make available sets of files from the Library’s significant Web Archives holdings. The outcome of the project is a series of web archive file datasets, each containing 1,000 files of related media types selected from .gov domains. You can read more about this series […]

In the Library’s Web Archives: Dig If You Will the Pictures

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” […]

In the Library’s Web Archives: Totally Tabular Data

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s […]

In the Library’s Web Archives: US Government Audio on Shuffle

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s […]

In the Library’s Web Archives: Sorting through a Set of US Government PDFs

The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant web archives holdings. This is another step to explore the web archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, “real world” content in the Library’s […]

The Library of Congress Web Archives: Dipping a Toe in a Lake of Data

Today’s guest post is from Chase Dooley and Grace Thomas, Digital Collections Specialists on the Library of Congress Web Archiving Team.  Over the last two decades, the Library of Congress Web Archiving Program has acquired and made available over 16,000 web archives, as part of more than 114 event and thematic collections. Each Web Archive […]

Foreign Law Web Archives

Law and government are major areas of web archiving at the Library of Congress, and feature prominently among the event and thematic collections available on loc.gov. The Law Library, which holds the largest collection of legal materials in the world, also coordinates the collection of Law websites through five significant collections: the Federal Courts Web […]

Science Blogs Web Archive

This guest post is an interview with Lisa Massengale, Head of the Science Reference Section, with contributions by the Web Archive’s creator Jennifer Harbster, a Science Reference and Research Specialist for the Science, Technology and Business Division from Oct. 2001- Dec. 2015.  Along with her reference duties for the Library’s Science Reference Service, she created […]