Diving into Digital Ephemera: Identifying Defunct URLs in the Web Archives

This is a guest post written by Olivia Meehan, a 2022 Junior Fellow working with the Web Archiving Team. Her project was completed under the mentorship of  Lauren Baker.

If you follow a link referenced in an article just to hit a dead-end, “page not found” error message – you have experienced link rot. “Link rot” and “reference rot” are terms used to describe the gradual loss of web resources.

According to a 2014 study from Harvard Law Review, “Link rot refers to the URL no longer serving up any content at all. Reference rot, an even larger phenomenon, happens when a link still works but the information referenced by the citation is no longer present, or has changed.” This study found that in 2014, 50% of URLs referenced in Supreme Court opinions from 1996 to 2013 had succumbed to reference rot. Over time, URLs change, content is deleted, businesses close, servers break down and, ultimately, information is lost.

Web archives exist to prevent this loss of information online. This summer, as a Junior Fellow with the Library of Congress Web Archiving Team, I have been investigating ways to identify content in the Library of Congress Web Archives that is no longer available on the live web. Identifying and communicating the status of URLs captured in a collection can not only demonstrate the value of web archives, it can also illustrate the impermanence of the internet more broadly.

The life cycle of a website, however, is unpredictable. Verifying the status of a site requires some level of manual investigation, which is time consuming and may be impractical when working with large collections. The Library of Congress, for example, has over 58,000 unique seed URLs in its collections.

Over the course of the summer, I was able to speak or correspond with web archivists from Columbia University and the Ivy Plus Libraries, the Bibliothèque nationale de France, the New York Art Resources Consortium (NYARC), and the Collaborative ART Archive (CARTA) about this issue. These conversations helped me identify some relevant concerns and challenges across web archiving organizations. For example: routinely checking the status of sites is not usually possible, so web archivists rely on information from subject experts or error messages from web crawlers (the software used to archive a website) to determine what sites to manually review. In some cases, offline websites might eventually go back online, so one manual review may not even be sufficient to definitively say a site is permanently defunct.

With those considerations in mind, the primary objectives of this project were to take a close look at some of the Library’s web archive collections; identify the status of sites captured in those collections; take note of what tools and information were needed to confirm the status of those URLs; and explore how the resulting data could be effectively communicated with Library staff and users.

The Collection

The Library of Congress has 85 described web archive collections available online. For the purpose of this project, I chose to work primarily with the Papal Transition 2005 Web Archive. This collection contains 217 web archives related to the death of Pope John Paul II and the selection of Pope Benedict XVI in spring 2005. I picked this collection, in particular, because I felt that the size, age, and event-based context suggested that it would contain an informative variety of websites.

Homepage for the official website of the Vatican

The official website of the Vatican as captured by a web crawler on April 7, 2005.

Using an HTTP Status Tool

When we access websites through a browser, the browser sends an HTTP request to a server hosting the website. An HTTP response code is a three-digit value that indicates the outcome of that request. A response code of 200, for example, means the request was successful and any code in the 400 and 500 range indicates an error. 300 codes indicate redirects and are not typically the final outcome of a request. It is also possible to receive an error message instead of a response code, which typically means the attempt to reach the remote web server was unsuccessful.

I used an online tool to generate a list of response codes for all 217 URLs. These response codes provide an informative high-level view of the collection, but they are not definitive indicators of an individual website’s status – in fact, they are frequently misleading.

For example, a 200 response just means that whatever website is hosted at the provided URL is online. URLs can change ownership, however, so the site that loads may not be the site that was originally there when the collection was created. Likewise, a 404 error means that there is a problem with the URL as entered – in many cases, the content originally hosted at that URL is still online, just at a different location.

Statement on Pope John Paul II that appeared on the Muslim Public Affairs Council website

A statement from the Muslim Public Affairs Council as captured by a web crawler on April 22, 2005. In 2022, this statement is still online, just at a different web address.

So, as things stand, the best way to tell if a website is actually offline is to manually go to the URL and check.

Checking URLs manually

I used a browser to systematically check all 217 URLs in the collection and organized them into four categories:

  1. Available – The expected content loads without error.
  2. Content Relocated – The expected content is still online, just at a different web address.
  3. Content Missing – The website itself is still online, but the expected content cannot be found, even at a different web address.
  4. Website down – The entire website is inaccessible at any address.

If a site loaded as expected, I marked it as available. If I encountered any issues or error messages, I did additional research to see if the content was relocated or to find any articles, announcements, or definitive indications that the site is offline.

Results

Once I confirmed the status of all the URLs, I was able to compile some useful statistics:

Pie chart of the status of URLS in the Papal Transition 2005 Web Archive

52% of the sites in the collection were still available; 7% of the sites were still available but at a different location; 21% were completely offline; and 20% were still online but were missing the specific page or content that was intended to be archived.

Graph showing response codes vs. confirmed website status

About 73% of URLs that originally returned a 200 response code were still available. 6% were available but missing the expected page and content, 3% were available at a different address, and 18% were completely offline.

While reproducing these steps with another collection would still require the manual review of URLs, there are some parts of the process that can be improved and automated. I wrote a Python script that reads a text file of URLs, sends an HTTP request to each URL, and then records the URL and the corresponding HTTP response code in a CSV file. This allows for all the URLs in a collection to be processed at once – the online tool I used originally could only accept up to 100 at a time.

Final Thoughts

Based on the results I have so far and conversations I’ve had with other web archivists, the lifecycle of websites is unpredictable to the extent that accurately tracking the status of a site inherently requires nuance, time, and attention – which is difficult to maintain at scale. This data is valuable, however, and is worth pursuing when possible­. Using a sample selection of URLs from larger collections could make this more manageable than comprehensive reviews.

Of the content originally captured in the Papal Transition 2005 Collection, 41% is now offline. Without the archived pages, the information, perspectives, and experiences expressed on those websites would potentially be lost forever. They include blogs, personal websites, individually-maintained web portals, and annotated bibliographies. They frequently represent small voices and unique perspectives that may be overlooked or under-represented by large online publications with the resources to maintain legacy pages and articles.

Homepage of religonfacts.com

The homepage of ReligionFacts.com as captured by a web crawler on April 13, 2005.

The internet is impermanent in a way that is difficult to quantify. The constant creation of new information obscures what is routinely deleted, overwritten, and lost. While the scope of this project is small within the context of the wider internet, and even within the context of the Library’s Web Archive collections as a whole, I hope that it effectively demonstrates the value of web archives in preserving snapshots of the online world as it moves and changes at a record pace.

3 Comments

  1. Jim Kay
    August 7, 2022 at 6:00 pm

    I found this report fascinating. I am just starting a web archiving project. I would appreciate a future blog or post describing how to use a browser to manually check the status of a web site. It would also be helpful to see the Python script that reads a text file of URLs and records the responses in a CSV file.

  2. Peter Chan
    August 11, 2022 at 2:37 pm

    Do you mind share which “HTTP Status Tool” did you use?

  3. Shawn Mulligan
    August 16, 2022 at 6:40 pm

    curl (available for pretty much every platform for free) can do this. You can use curl -I URL to only request headers, or if you want a more realistic test, curl -sD -s -o /dev/null URL to request the whole page, dump the headers, and throw out the page contents. You’ll have to modify that slightly for non-Linux systems. You can pass the output through grep / sed to clean up the output (to return only the status code and nothing else, for example).

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.