Have you ever wondered what exactly is web archiving? How the Library select which websites to preserve? Or how you would find and search the web archives? The Web Archiving Team’s Senior Digital Collection Specialists gathered to answer these questions and more in a live webinar during the Preservation Directorate’s celebration of Preservation Week. If you missed it, we have good news– a video is now available to watch on the Library’s website, and you can also read a short summary of the presentation here. The presentation consisted of four parts– an overview of the web archiving program, a closer look at the U.S. Elections Web Archive, an introduction to web archives data sets, and a Q&A session with the audience.
Grace Bicho kicked off the presentation by defining web archiving as “the process of selecting, collecting, and preserving parts of the web, then making the resulting archive accessible for use.” Web archiving is not, as Bicho helpfully explained, creating an image or screenshot of a website, but rather it seeks to capture a reproducible copy of a website at the point in time in which it was preserved. “Once it’s made available for access, archive users can interact with web archives via a browser just like you would browse through any website,” Bicho said. The Library tries to capture as much of the user experience and the “look and feel” of the website as possible, but Bicho also offered the important disclaimer that web archiving is not perfect– it’s impossible to capture everything on the internet and web archiving tools can lag behind the complex, constantly evolving technology of the websites they are trying to preserve.
At the Library of Congress, the web archives are managed by a team of eight full-time staff members, known informally as the “Web Archiving Team,” but they are aided by over 235 staff members currently involved in the process. Bicho noted that one important group of contributors are Recommending Officers, subject specialists within the Library that nominate content for the web archive that fall within their field of expertise. These Recommending Officers are guided by the Library’s collection policy statements on specific subject areas, as well as supplementary guidelines for web archives. The Web Archiving Team also relies on the Library’s IT specialists for technical support and the lawyers in the Office of the General Counsel, who weigh in on issues related to copyright.
It takes many partners and collaborators to create the Library’s web archive, because any way you measure it, the archive is enormous! Bicho listed off some of the facts– 3.7 petabytes of data, 14,594 seed URLs, 200 countries of publication, 122 languages, 78 active collections, and 188 total collections. So what kind of material would you find in that vast treasure trove of content, and how would you go about browsing or searching it? To answer those questions, Bicho passed the presentation over to Lauren Baker. Baker explained that most web archiving collections are representative and done in the context of event-based or thematic collections (like the Public Policy Topics Web Archive, the Coronavirus Web Archive, and the Web Cultures Web Archive), but the Library collects websites comprehensively in three areas– legislative branch agencies, congressional members and committees, and U.S. national election campaigns.
The U.S. Elections Web Archive spans over twenty years and is the web archive program’s oldest collection. The collection began with presidential campaign websites in 2000, and later expanded to include congressional and gubernatorial campaigns. Baker outlined a couple of different ways that users can explore the U.S. Elections Web Archive, or other content in the web archive. The first approach is browsing by collection, which groups together related records with descriptive information to provide additional context. Another way is with a record search on loc.gov, where you can refine the results by applying facets, like the election year or candidate’s name. And lastly, if you already know what URL you are looking for, you can search for it on webarchive.loc.gov. (Our web archives are currently undergoing upgrades to our infrastructure which may make access to webarchive.loc.gov intermittent. Read here to find out more.)
For data-savvy researchers, Chase Dooley introduced another, “computationally-friendly” way to explore the web archives through datasets. The Web Archiving Team has partnered with LC Labs to make web archive data available in bulk, by creating encrypted data sets from metadata about the resources in the archive. Using the example of the U.S. Elections Web Archive dataset, Dooley explained that within it, there are over 400,000 CDXs available, totaling over 250 gigabytes in size. According to Dooley, these CDXs (or crawl indexes) are “created as the sites are being crawled, [and] they contain metadata about the crawl that are used by Wayback software in order to replay that content.” Explaining further, “each line in a CDX is representative of a web resource,” and “they contain things like the URL captured, the timestamp, mime type, HTTP response code, hash of that resource, byte location inside the WARC file, and the name of that file.” For those who would like to preview the dataset or need a little help getting started, Dooley recommends the Jupyter Notebook that the Library made available on its GitHub page, which provides examples of how to explore and analyze the data. (You can also learn more about CDXs, the U.S. Elections Web Archive data set and the Jupyter notebook here.)
We encourage you to watch the video to learn more about the web archives, including more in-depth information about the above topics and the answers to audience questions, such as– what software do you use for crawling websites? How do you prioritize what to archive? How does the Library approach quality assurance? And does the Library ever collaborate with external subject matter experts? You can also learn more about other types of preservation work at the Library by visiting the Preservation Directorate’s website and browsing their list of past Preservation Week presentations.