The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group at the Library of Congress.
As much as we can do to preserve archived websites once we have them, the challenges we encounter are always already determined by how those websites were originally constructed. In the interest of giving us and others the best possible chance of preserving your online content, I wanted to follow on an excellent blog post by Robin Davis (previously) of the Smithsonian Institution Archives on the topic of designing preservable websites. Here are some best practices to keep in mind:
1) Follow web standards and accessibility guidelines.
Following web standards and accessibility guidelines is useful for reasons beyond web archiving, namely, (universal) usability and SEO. It also facilitates better website archiving and replay. Because web crawlers, including the archival Heritrix crawler, access websites in a manner similar to a text browser, accessible websites are friendlier to web crawlers. Adherence to web standards makes for fewer cumulative idiosyncrasies that the Wayback Machine must accommodate over time in rendering archived websites.
2) Be careful with robots.txt exclusions.
3) Use a site map, transparent links, and contiguous navigation.
4) Maintain stable URIs and redirect when necessary.
The stability of the Library of Congress’ URI over time makes it possible to view website captures from 1997 to present in a single unbroken timeline in the Internet Archive Wayback Machine. It also means that any individual bookmarks saved or inbound links published and circulated continue to work as they always have. Link rot on the web generally is, by unfortunate contrast, altogether common.
When a URI changes and a redirect to the new resource location isn’t put in place, it decreases the likelihood that the new URI will be archived and almost assures that access to the website archives from prior to the URI change will be disassociated from those following. Web archiving tools’ sensitivity to URI stability also means that URIs containing session IDs may be similarly dissociated from earlier captures of the same resource.
5) Consider using a Creative Commons license.
Pending changes to the U.S. copyright statute to address digital preservation needs, we must request permission from most website owners to re-display their crawled website outside of the Library of Congress and/or to even crawl their website in the first place. The Library of Congress is but one of a number of web archiving institutions that must solicit permissions. A website published under a Creative Commons license provides an affirmative permission to be crawled and preserved.
6) Use sustainable data formats.
Though a webpage is presented as a unified experience, it consists of many different files and file types. A commitment to preserving that experience therefore implies a commitment to managing the potentially distinct preservation risks of all the component file types. When deciding what types of code and file formats to use in building a website, open standards and open file formats are generally the best choices for preservation. The exception is when the open format is either poorly-documented or allows for vendor-specific extensions – these may well be worse than well-documented proprietary formats that are widely-implemented in a uniform way. The Sustainability of Digital Formats website outlines a number of criteria that make for a truly “sustainable” format besides ostensible “openness.”
7) Embed metadata, especially the character encoding.
Since web servers don’t reliably report character encoding, it is important that pages do so. Use an HTML meta tag or XML doctype declaration to indicate what encoding should be used to render the page. Additional embedded metadata is useful both for SEO and for curated collections of web archives, such as those maintained by the Library of Congress which draw upon site-provided metadata for access points.
8) Use archiving-friendly platform providers and content management systems.
While platform providers have incentives to permit commercial search indexers to access at least some of the content they host, they are not always so accommodating of archival crawlers. If the archivability of your website is important, examine the company’s robots.txt or inquire about their policies before committing to their platform. Also, even if a company doesn’t block archival crawlers outright, the website templates or content management systems they utilize may not archive well. Look at how other websites built on the same platform replay in Internet Archive’s Wayback Machine, and, if you’re using an open source content management system, be sure to review the configuration of any bundled robots.txt.
While adhering to these recommendations won’t guarantee a high-quality archival capture and subsequent flawless preservation of your website, not following them will ensure additional archiving and preservation challenges.