The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.

What is the value of a broken link? For understandable reasons, many would say, “not much.” While the destination-unaware nature of hyperlinks has facilitated the decentralized growth of the web, it has also greatly contributed to the perceived ephemerality of its contents. A recent literature review of the extent of link rot in academic publications, for example, demonstrated broken link rates of 39-83%. Concerned, in particular, with this reality, the Modern Language Association prominently dispensed with the requirement that URLs be included in works-cited lists in the most recent revision of its handbook. It might be said that if links are the currency of the web, then broken links are paper notes in a defunct coinage.

Nonetheless, there are at least two good reasons to conserve broken links: facilitating discovery of archived versions of a resource and providing metadata about the resource, whether or not it’s available.

Wikipedia, an endeavor that is heavily reliant on the validity of external links to corroborate facts, sagely advises in its editing guidelines that dead links should be flagged, not removed. The rationale provided is that 1) even a dead link strengthens the case that a fact was externally corroborated at one time; 2) checking a link and getting a 404 error can only be taken as confirmation that the resource was unavailable at that moment in time and at that particular location, not that it either doesn’t exist or has been permanently removed; and 3) knowing the original URL of a resource makes it much easier to rediscover it either in an archive or at a new location.

Meanwhile, the tools to locate archived versions of web resources continue to improve. We’ve discussed Memento before, but to briefly re-cap in the context of link rot, Memento is a protocol that theoretically should be able to answer the question: “does an archived version of this now-missing resource exist in any web archive?” Many people are familiar with Internet Archive’s Wayback Machine, but this is but one of many web archives and may lack content that others have collected. The existence of diverse and distributed web archives increases the possibility that the content on the other side of a broken link may be rediscovered somewhere.

Aside from helping to recover putatively-disappeared resources, URLs describe those resources, as well. URLs aren’t arbitrary; they may convey information about website structure, the date a particular resource was published, the document title, the author, descriptive keywords, and other characteristics. Even a URL whose host is merely an IP address potentially indicates the geographic locale where that domain is hosted.

Consider the URL: Without knowing anything more about that webpage, it can reasonably be assumed that a news item was published on an official WordPress website titled “State of the Word” in August 2011. If you navigate to that page (assuming the link isn’t broken by the time you read this), you’ll learn that 14.7% of the top million websites in the world and 22 out of every 100 newly-registered domains run WordPress. Generalizing somewhat, that implies that an increasing percentage of the web is using content management systems supporting richer-metadata URLs. Moreover, using meaningful, human-readable URLs has garnered increasing attention as a best practice in user-centered design.

Admittedly, broken links themselves are not altogether user-friendly, but, on balance, there is much to be gained from conserving rather than discarding them. As the means to rediscover archived resources become more seamless and URLs become more descriptive and human-legible, if anything, the value of a broken link seems to be increasing.


  1. Trevor
    March 28, 2012 at 11:12 am

    Great post Nick. This is what I find particularly irksome about the MLA style guide’s approach to remove links from bibliographic citations. The base url, the top level domain, and each bit of the rest of the link communicate a lot of information. Links themselves tell us something about how information is organized, and at the very least provide a unique identifier for where something once resided.

  2. Gloria
    March 28, 2012 at 2:01 pm

    You make a strong argument, thanks for a great post. For anyone interested in additional related reading on the topic, I suggest the short book “Vanishing Act: The Erosion of Online Footnotes and Implications for Scholarship in the Digital Age” by Michael Bugeja and Daniela Dimitrova (2010). They research the “half-life” of linked citations (the amount of time it takes for half of the links in a given journal volume to rot).

  3. Walker
    March 28, 2012 at 6:34 pm

    Great point and post.

    Wikipedia’s third reason for flagging broken URLs is very true and in my experience an understatement.

    Internet Archive and Wayback Machine both require a URL to begin a search — without that URL, one is at a loss. In a particular case I had to Google and click around for hours to find a lost URL so I could send the Internet Archive on its way to the stored copies of the site.

    It’s funny to think this URL from 12+ years ago, which I never thought at the time to be important, became the key to recovering the material. Thank goodness I found a copy of that broken link.

  4. Robin Camille
    March 29, 2012 at 11:50 am

    Thanks for this fantastic post. I too am always surprised when bibliographies omit URLs, and I was troubled when the new and much-touted tweet citation standard did not include a URL, either. You make a great point, and I hope the effort designers are putting into human-friendly URLs pays off as metadata at the very least.

    Another item of concern: using URL-shortening services like I am certain that those links will all be broken (or repurposed?) in a very short period of time, with zero metadata to provide any clues.

  5. Nicholas Taylor
    March 30, 2012 at 3:59 pm

    @Robin Camille: thanks for the comment; those are two very timely examples!

    I can sort of understand the tweet citation standard insofar as the citation itself ostensibly includes all of the “content” of the cited work. But then you look at Raffi Krikorian’s graphic deconstruction of all the tacit information in a tweet and realize just how much data is effectively consigned to obscurity by leaving out the URL. It may not be especially likely, even armed with a tweet’s URL, that one will be able to retrieve it from somewhere, but it’s far more certain that omitting the URL eliminates that possibility.

    URL-shortening services indeed add an additional point-of-failure to link resolution. Moreover, Old Dominion University’s Web Research and Digital Libraries Research Group found that randomly-sampled URLs dereferenced from one URL shortener were more poorly represented in web archives and search engine caches than randomly-sampled URLs from search engine results, a bookmark-sharing service, and an open link directory service. It is largely for this reason that Internet Archive now archives the shortened URL mappings.

    On a related note, if I’ve got my Robin Camille’s right, thanks for a great blog post last summer on designing preservable websites! Hope you don’t mind that I ran with your idea in a follow-up post of my own. :)

