The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.
What is the value of a broken link? For understandable reasons, many would say, “not much.” While the destination-unaware nature of hyperlinks has facilitated the decentralized growth of the web, it has also greatly contributed to the perceived ephemerality of its contents. A recent literature review of the extent of link rot in academic publications, for example, demonstrated broken link rates of 39-83%. Concerned, in particular, with this reality, the Modern Language Association prominently dispensed with the requirement that URLs be included in works-cited lists in the most recent revision of its handbook. It might be said that if links are the currency of the web, then broken links are paper notes in a defunct coinage.
Nonetheless, there are at least two good reasons to conserve broken links: facilitating discovery of archived versions of a resource and providing metadata about the resource, whether or not it’s available.
Wikipedia, an endeavor that is heavily reliant on the validity of external links to corroborate facts, sagely advises in its editing guidelines that dead links should be flagged, not removed. The rationale provided is that 1) even a dead link strengthens the case that a fact was externally corroborated at one time; 2) checking a link and getting a 404 error can only be taken as confirmation that the resource was unavailable at that moment in time and at that particular location, not that it either doesn’t exist or has been permanently removed; and 3) knowing the original URL of a resource makes it much easier to rediscover it either in an archive or at a new location.
Meanwhile, the tools to locate archived versions of web resources continue to improve. We’ve discussed Memento before, but to briefly re-cap in the context of link rot, Memento is a protocol that theoretically should be able to answer the question: “does an archived version of this now-missing resource exist in any web archive?” Many people are familiar with Internet Archive’s Wayback Machine, but this is but one of many web archives and may lack content that others have collected. The existence of diverse and distributed web archives increases the possibility that the content on the other side of a broken link may be rediscovered somewhere.
Aside from helping to recover putatively-disappeared resources, URLs describe those resources, as well. URLs aren’t arbitrary; they may convey information about website structure, the date a particular resource was published, the document title, the author, descriptive keywords, and other characteristics. Even a URL whose host is merely an IP address potentially indicates the geographic locale where that domain is hosted.
Consider the URL: http://wordpress.org/news/2011/08/state-of-the-word/. Without knowing anything more about that webpage, it can reasonably be assumed that a news item was published on an official WordPress website titled “State of the Word” in August 2011. If you navigate to that page (assuming the link isn’t broken by the time you read this), you’ll learn that 14.7% of the top million websites in the world and 22 out of every 100 newly-registered domains run WordPress. Generalizing somewhat, that implies that an increasing percentage of the web is using content management systems supporting richer-metadata URLs. Moreover, using meaningful, human-readable URLs has garnered increasing attention as a best practice in user-centered design.
Admittedly, broken links themselves are not altogether user-friendly, but, on balance, there is much to be gained from conserving rather than discarding them. As the means to rediscover archived resources become more seamless and URLs become more descriptive and human-legible, if anything, the value of a broken link seems to be increasing.