Designing Preservable Websites, Redux

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group at the Library of Congress.

As much as we can do to preserve archived websites once we have them, the challenges we encounter are always already determined by how those websites were originally constructed. In the interest of giving us and others the best possible chance of preserving your online content, I wanted to follow on an excellent blog post by Robin Davis (previously) of the Smithsonian Institution Archives on the topic of designing preservable websites. Here are some best practices to keep in mind:

1)      Follow web standards and accessibility guidelines.

Following web standards and accessibility guidelines is useful for reasons beyond web archiving, namely, (universal) usability and SEO. It also facilitates better website archiving and replay. Because web crawlers, including the archival Heritrix crawler, access websites in a manner similar to a text browser, accessible websites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine must accommodate over time in rendering archived websites.

Lascaux cave painting by Flickr user goforchris

Lascaux cave painting by Flickr user goforchris

2)      Be careful with robots.txt exclusions.

Certain types of instructions entered into robots.txt may at once be fine for search engine crawlers but prevent archival crawlers from capturing content that is crucial for a faithful reproduction of the website. For example, instructing crawlers to stay out of a website’s CSS and JavaScript directories wouldn’t detract significantly from the quality of a search engine index, but it would make a big difference in the quality of an archival capture.

3)      Use a site map, transparent links, and contiguous navigation.

A crawler can only capture webpages that it knows about. It discovers webpages by traversing links, meaning that it can ultimately only ever capture pages that are accessible by following links alone. A corollary is that a user browsing an archived website can only navigate by following links, because server-side functionalities like search don’t work in the archive. Avoid relying on Flash, JavaScript, or other techniques that tend to obfuscate links as the sole means of navigating to any specific page, and consider creating a comprehensive site map to ensure that the crawler doesn’t miss anything.

4)      Maintain stable URIs and redirect when necessary.

The stability of the Library of Congress’ URI over time makes it possible to view website captures from 1997 to present in a single unbroken timeline in the Internet Archive Wayback Machine. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot on the web generally is, by unfortunate contrast, altogether common.

When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived and almost assures that access to the website archives from prior to the URI change will be disassociated from those following. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs may be similarly dissociated from earlier captures of the same resource.

5)      Consider using a Creative Commons license.

Pending changes to the U.S. copyright statute to address digital preservation needs, we must request permission from most website owners to re-display their crawled website outside of the Library of Congress and/or to even crawl their website in the first place. The Library of Congress is but one of a number of web archiving institutions that must solicit permissions. A website published under a Creative Commons license provides an affirmative permission to be crawled and preserved.

Rosetta Stone_British Museum_2475 by Flickr user KitLKat

Rosetta Stone_British Museum_2475 by Flickr user KitLKat

6)      Use sustainable data formats.

Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a website, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats website outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”

7)      Embed metadata, especially the character encoding.

Since web servers don’t reliably report character encoding, it is important that pages do so. Use an HTML meta tag or XML doctype declaration to indicate what encoding should be used to render the page. Additional embedded metadata is useful both for SEO and for curated collections of web archives, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points.

8)      Use archiving-friendly platform providers and content management systems.

While platform providers have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your website is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the website templates or content management systems they utilize may not archive well. Look at how other websites built on the same platform replay in Internet Archive’s Wayback Machine, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.

While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your website, not following them will ensure additional archiving and preservation challenges.

4 Comments

  1. Lynda Schmitz Fuhrig
    February 7, 2012 at 2:53 pm

    Nicholas,
    These are great tips that pair nicely with Robin Davis’ blog posting from the Smithsonian Institution Archives. I think more organizations are starting to realize the importance of web preservation and the role of standards, especially when it comes to their own creations.

    Cheers,
    Lynda Schmitz Fuhrig,
    Smithsonian Institution Archives

  2. Jim Brewer
    February 7, 2012 at 3:09 pm

    So often web sites lack any date reference, not even in the source. Archiving would be significantly helped if administrators used a template that showed the page edit/creation date. Generally this notion seems omitted in any discussion of best practices.

  3. Martha Anderson
    February 8, 2012 at 10:45 am

    Preservation starts at creation. These are good tips from an experienced team.

  4. Beehivews
    June 25, 2014 at 1:08 am

    All above points should be useful for all start up website designers.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.