The following is a guest post by Nicholas Taylor, Data Specialist for the National Digital Information Infrastructure and Preservation Program.
A previous post on this blog explored why it’s so hard to come up with a reliable measurement of the average lifespan of a webpage. In essence, the argument came down to this: links and the websites they represent tend to become decoupled over time. Without a broad understanding of how that process takes place, it’s hard to make definitive claims about the persistence of websites when available automated tools can only capably check for the persistence of links.
In an ideal web, webmasters would adhere to Tim Berners-Lee’s notion of “cool URIs” – links that have been purposely maintained so as to remain stable. Stable links are more useful to users, and it is technically feasible to maintain any particular link for at least the lifespan of the resource it points to. However, given both the popular perception of and the abundance of scholarly literature on link decay, it’s probably safe to say that Tim Berners-Lee’s vision for a cool URI-enabled web hasn’t yet been realized.
The good news is that websites are more durable than links. This is supported by multiple studies and makes intuitive sense, as well. The bad news is that most contemporary web archiving tools are actually link archiving tools; they are designed to agnostically capture and replay the content represented by links, not the intellectual objects (i.e., the websites) of interest per se. For the Library of Congress thematic web archives, we can only assure that the links we’re capturing continue to correspond to the websites we care about preserving by manually inspecting them on an ongoing basis.
To better understand the discrepancy between link and website persistence as well as the disposition of websites that we previously archived, intern Heidi Hernandez and I revisited 1,071 links archived as part of the U.S. Election 2002 web archive collection. We excluded over 1,000 links corresponding to electoral candidate websites, as they were especially short-lived. The remaining links corresponded to state government, political party, advocacy group, newspaper, and political blog sites.
We followed a two-part methodology. First, following the approach of many other link persistence studies, we ran the entire list of links through a link checker and recorded the http response codes (e.g., 404, 200, 301, 500, etc.). Second, we visited each of the links and noted whether the corresponding website was the same as that which was archived. If the website was different, or if the link didn’t work, we tried to discover the new location of the website using search engines.
There were a few noteworthy findings:
- Taking the link persistence measurement as a measurement of website persistence overestimated the latter, because some working links pointed to different websites. In our study, 8% of working links pointed to different websites.
- Taking the link persistence measurement as a measurement of website persistence also and more significantly underestimated website persistence, because websites often existed at new locations even if their previous link either no longer worked or now corresponded to a different website. In our study, 82% of websites associated with non-working links and 48% of websites whose links now corresponded to different websites still existed.
- In aggregate, 94% of the previously-archived websites still existed, 3% more than would’ve been predicted by checking links alone.
This last point should most certainly not be interpreted as a sign of the superfluity of web archiving; recall that over 1,000 links to now-disappeared websites were excluded from the analysis. Also consider, for example, that just because the White House website still exists eleven years after we archived it as part of the U.S. Election 2002 web archive collection doesn’t mean that any of the resources that made up the White House website of eleven years ago are still accessible now.
All-in-all, though, the results suggest a more complicated picture of the ephemeral web than the popular conception that tends to conflate the disappearance of links and that of websites.