A Vision of the Role and Future of Web Archives: Conclusions and the Role of Archives

The following is a guest post by Kalev H. Leetaru, University of Illinois, who presented these ideas at the 2012 General Assembly of the IIPC. This post is the third in a three-part series. Read the first and second posts or the full paper on netpreserve.org.

As web archives mature and expand, a growing question revolves around the role of these archives in society. What should their primary mission(s) be and how can they best fulfill those roles? At their most basic level, I believe web archives fulfill three primary roles: Preservation, Research, and Authentication, in that order.

• Preservation. First and foremost, web archives preserve the web. They act as the web equivalent of the archive or library, constantly monitoring for new content, requesting a copy of that content, and keeping a copy of it for posterity. In this role, their mission is to acquire and preserve the web for future generations, with access being primarily through basic browsing and retrieval. Some archives, for legal purposes, may not even be able to provide access to their holdings during the lifetime of the organizations providing them content, instead holding that material under embargo for a certain number of years, but ensuring its continued survival for future generations.

Library of Congress Main Reading Room, Carol M. Highsmith Archive.

• Research. A unique and emerging use of archives is as a research service for scholars. Very few academics, especially in the social sciences and humanities, have the computational expertise or resources to crawl and download large portions of the web for research. Commercial web crawling companies like Google do not provide their data for research, and thus web archives provide a fundamentally unique and enabling resource for the study of the web that scholars can turn to. Even more critically, many key humanities and social science questions revolve around how ideas and communication change over time, and web archives capture the only view of change on the web. In this role, the secondary mission of archives is to provide access to their holdings that goes beyond the basic browsing needed for casual use or deep scholarly reading of a small number of works, towards providing programmatic tools and access policies that support computational data mining of large portions of their holdings.

• Authentication. A final emerging use of archives is as an authentication service. Web data is highly mutable, changing constantly, and there is no way to authenticate whether the page I see today is the same as what I saw yesterday, especially if the change is a small one. It took more than five years for changes to White House press releases to be spotted via copies held in the Internet Archive, and even then the discovery was entirely by accident. Third party archives allow “authentication” of what a page looked like at a given moment. One could even imagine someday a browser plugin that, as a user browsed certain sites on the web (such as government pages, perhaps medical or other pages), would compare each page with the most recent copy stored by a network of web archives, and display an indicator to the user as to whether the page has changed since it was last archived, as well as highlight those changes. In this role, the third peripheral mission of the web archive is to act as a “disinterested third party” that can authenticate and verify the contents of a given web page at a given moment in time.

Wikipedia offers an intriguing vision of what the ultimate web archive might look like. Every edit to every page since the inception of the site has been archived and is available at a mouse click, allowing a visitor or scholar to trace the entire history of every word. Every operation taken on the site and the complete source code to every algorithm used for various automated processes are fully documented and make available, offering complete technical transparently. Finally, a dedicated bulk download page is maintained in which researchers may download a ZIP file containing the entirety of the site and every edit ever performed, which has made Wikipedia a mainstay of considerable social and computer science research.

As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations.

Add a Comment

Add a Comment Cancel reply

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.

Name (no commercial URLs) *

Email (will not be published) *

Comment: