The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.
What is the average lifespan of webpage? Predictably, estimates vary and vary over time. A 1997 special report in Scientific American claimed 44 days. A subsequent 2001 academic study in IEEE Computer suggested 75 days. More recently, in 2003, a Washington Post article indicated that the number was 100 days.
While there appear to be overall fewer estimates of webpage longevity floating around than, say, the amount of data stored in the Library of Congress, we can at least feel more assured that they’ve all come from someone who should know: Brewster Kahle, founder of the Internet Archive.
Determining the average lifespan of a webpage is complicated not just by the infrastructure required to analyze a plausibly representative sample of links across the web but also because it’s easy to conflate “the average lifespan of a webpage” with other closely-related concepts that are, in actuality, much more difficult to measure. That is to say, we take for granted that we know what it means that a webpage has “died.”
For instance, is a “webpage” defined by its URI or by its contents? A non-resolving link doesn’t necessarily imply that the content once hosted there no longer exists (1); it may have been archived or simply exist at a new location (albeit, one mediated by a paywall) to which the web server was not configured to redirect page requests. Conversely, a resolving link doesn’t necessarily imply that the same content is still hosted there as it once was.
An automated link checker visiting a list of URIs and logging all ultimately successful and failed requests would miss these subtleties. A human being with a limitless amount of time who set out to manually check the same list might still get hung up on exhaustively concluding that a disappeared webpage did not, in fact, exist at a new URI or on the subjective determination of whether an extant webpage could be said to be the “same” webpage as before.
There are additional complications in these sorts of analyses. While even the longest of the aforementioned webpage lifespans suggests that webpages are ephemeral, some are so fleeting that their lifespans are better measured in hours rather than days. Analyzing its web index, Google noticed that the median lifespan of malware-distributing domains decreased from one month in 2007 to a mere two hours by 2010. Since most commercial web search engines penalize listings from such domains, malware distributors are incentivized to churn quickly through massive numbers of new domains. The number of domains being created and their transience may skew average lifespan calculations by automated methods downward.
Finally, there’s no practical way of knowing precisely when a webpage disappears; we can only know the time difference between a previous visit when the webpage existed and a later visit when it didn’t. Depending on the breadth of the crawl and the infrastructure available, it may be days, weeks, or even months before the crawler visits the same webpage twice. This margin of variability may undermine the precision of webpage lifespans for which the appropriate scale of measurement likewise appears to be days, weeks, or months.
What this all means for calculations of the average longevity of a webpage is that, while Internet Archive’s estimates may be the best available, there are key limitations and caveats behind any of the numbers proffered to date. Unfortunately, it’s unlikely that we’ll have objective measurements better than the gross methodologies permitted by automated link checking any time soon.
Comments (8)
The Chesapeake Digital Preservation Group has an interesting 2011 study on “link rot” in online legal resources: http://cdm16064.contentdm.oclc.org/cdm/compoundobject/collection/p266901coll4/id/3505
Also interesting would be to calculate the median lifespan of a webpage and see how long the tail is… Are there any stats about that ?
For these very reasons, for ten years, I manually created and painstakingly kept up-to-date a collection page of links and documents in a specific field because information was so fluid and people in the field needed to see the yearly change in documents, regulations and such. Fully 1/2 of my links were eventually to the Internet Archive and its copies as things disappeared. Then sone of the authoring organizations realized the old copies were sticking around but they did not want that history available (presumably to hide the fact they were changing things without notice or public disclosure). So they started forcing the removal of PDF and similar documents off the Archive and then eventually my site. So sad as even history for an educational understanding is now lost and been reshaped. I liken it to an equivalent of a book burning of old as it truly can become a new form of deliberate censorship and rewriting of history as well.
@Ido Peled: The academic studies that I’ve seen on the topic of link rot have been concerned with the persistence of cited urls in published research. It’s possible that some of these may offer median webpage lifespans, albeit for a narrower link corpus. If you wanted to explore this, I’d recommend searching for publications that cite the foundational study by Steve Lawrence et al. (2001), “Persistence of Web References in Scientific Research.” I’m not aware of any statistics on the median lifespan of webpages generally, though.
Thanks to Mike Ashenfelder providing analysis on Nov 8, 2011, by Nicholas Taylor and others who comment on short lifespan of websites, including ephemeral links to sources, most recently in 2003, about 100 days.
http://blogs.loc.gov/digitalpreservation/2011/11/the-average-lifespan-of-a-webpage/
The Internet enables saving lives, time and money with traceability to original sources essential for performing daily work quickly and accurately. Low durability significantly undermines this benefit. One approach is to incorporate sources into a web data base that is locally maintained and cite it there with reference to the original source, so that content can consistently be reviewed quickly when needed.
Is this still true nowadays?
did this webpage already die?
Thanks x