Link Persistence, Website Persistence

The following is a guest post by Nicholas Taylor, Data Specialist for the National Digital Information Infrastructure and Preservation Program.

A previous post on this blog explored why it’s so hard to come up with a reliable measurement of the average lifespan of a webpage. In essence, the argument came down to this: links and the websites they represent tend to become decoupled over time. Without a broad understanding of how that process takes place, it’s hard to make definitive claims about the persistence of websites when available automated tools can only capably check for the persistence of links.

Working link, different website: in 2002, http://www.fb.com/ was the American Farm Bureau website. The link still works in 2013, but it now corresponds to Facebook's website. The American Farm Bureau website has moved to http://www.fb.org/.

Working link, different website: in 2002, http://www.fb.com/ was the American Farm Bureau website. The link still works in 2013, but it now corresponds to Facebook’s website. The American Farm Bureau website has moved to http://www.fb.org/.

In an ideal web, webmasters would adhere to Tim Berners-Lee’s notion of “cool URIs” – links that have been purposely maintained so as to remain stable. Stable links are more useful to users, and it is technically feasible to maintain any particular link for at least the lifespan of the resource it points to. However, given both the popular perception of and the abundance of scholarly literature on link decay, it’s probably safe to say that Tim Berners-Lee’s vision for a cool URI-enabled web hasn’t yet been realized.

The good news is that websites are more durable than links. This is supported by multiple studies and makes intuitive sense, as well. The bad news is that most contemporary web archiving tools are actually link archiving tools; they are designed to agnostically capture and replay the content represented by links, not the intellectual objects (i.e., the websites) of interest per se. For the Library of Congress thematic web archives, we can only assure that the links we’re capturing continue to correspond to the websites we care about preserving by manually inspecting them on an ongoing basis.

Non-working link, but the website still exists: in 2002, the Missouri Secretary of State website could be found at http://www.state.mo.us/. That link no longer works in 2013, but the website still exists, now at http://www.sos.mo.gov/.

Non-working link, but the website still exists: in 2002, the Missouri Secretary of State website could be found at http://www.state.mo.us/. That link no longer works in 2013, but the website still exists, now at http://www.sos.mo.gov/.

To better understand the discrepancy between link and website persistence as well as the disposition of websites that we previously archived, intern Heidi Hernandez and I revisited 1,071 links archived as part of the U.S. Election 2002 web archive collection. We excluded over 1,000 links corresponding to electoral candidate websites, as they were especially short-lived. The remaining links corresponded to state government, political party, advocacy group, newspaper, and political blog sites.

We followed a two-part methodology. First, following the approach of many other link persistence studies, we ran the entire list of links through a link checker and recorded the http response codes (e.g., 404, 200, 301, 500, etc.). Second, we visited each of the links and noted whether the corresponding website was the same as that which was archived. If the website was different, or if the link didn’t work, we tried to discover the new location of the website using search engines.

There were a few noteworthy findings:

  1. Taking the link persistence measurement as a measurement of website persistence overestimated the latter, because some working links pointed to different websites. In our study, 8% of working links pointed to different websites.
  2. Taking the link persistence measurement as a measurement of website persistence also and more significantly underestimated website persistence, because websites often existed at new locations even if their previous link either no longer worked or now corresponded to a different website. In our study, 82% of websites associated with non-working links and 48% of websites whose links now corresponded to different websites still existed.
  3. In aggregate, 94% of the previously-archived websites still existed, 3% more than would’ve been predicted by checking links alone.

This last point should most certainly not be interpreted as a sign of the superfluity of web archiving; recall that over 1,000 links to now-disappeared websites were excluded from the analysis. Also consider, for example, that just because the White House website still exists eleven years after we archived it as part of the U.S. Election 2002 web archive collection doesn’t mean that any of the resources that made up the White House website of eleven years ago are still accessible now.

All-in-all, though, the results suggest a more complicated picture of the ephemeral web than the popular conception that tends to conflate the disappearance of links and that of websites.

Older Personal Computers Aging Like Vintage Wine (if They Dodged the Landfill)

We have moved so far so fast with personal computing that older machines are acquiring a cultural patina. Everyone, seemingly, has a memory of  “old computers,” even if some people think having a hard drive under 100 gigabytes fits the definition. There are perhaps two ways to think about obsolete computers. One is as trash […]

The Content Matters Interview Series: Dr. Sylvia Chou of the National Cancer Institute

The following is a guest post by Christie Moffatt, Manager, Digital Manuscripts Program, History of Medicine Division, National Library of Medicine In this installment of the “Content Matters” series of the National Digital Stewardship Alliance Content Working Group, I interview Dr. Sylvia Chou, PhD, MPH, Program Director of the National Cancer Institute’s Health Communication and […]

Born Digital Archival Materials at NYPL: An Interview with Donald Mennerich

I’m excited to chat with Donald Mennerich, a Digital Archivist at the New York Public Library, as part of our Insights series. Insights is an occasional feature sharing interviews and conversations between National Digital Stewardship Alliance Innovation Working Group members and individuals involved with projects related to preservation, access and stewardship of digital information. Donald […]

Official, Authenticated, Preserved, and Accessible: The Uniform Electronic Legal Material Act

This post is cross posted on the blog of the Law Library of Congress, In Custodia Legis. In Custodia Legis is an excellent source of information on current legal trends and materials from the Library’s collections pertaining to the law. Digital technology makes documents easy to alter or copy, leading to multiple non-identical versions that […]

Pass it On: Preservation Week 2013 at the Library of Congress

We are excited to host a number of events in celebration of ALA’s Preservation Week (April 21-27, 2013) here at the Library of Congress next week.  We’ve collaborated with the Preservation Directorate and the Veteran’s History Project on programs to provide in-person guidance and help to those interested in saving their personal collections — digital […]

Viewshare Meet JSON, JSON Meet Viewshare

This is a guest post by Bill Amberg, a contract software developer and Camille Salas, an intern with the Library of Congress, who are both working on Viewshare. Trevor Owens also contributed to the post.  We are pleased to announce a fifth option for importing data into Viewshare – JSON. Users now have the ability […]

Mascots for Team Digipres!

The following is a guest post by Tess Webre, intern with NDIIPP at the Library of Congress Baseball season is upon us! It’s time we root root root for the home teams, eat peanuts and crackerjacks, and yell at the umpires.  It is also time to delight in the zany antics of mascots. However, there […]