The following is a guest post by Jefferson Bailey, Fellow at the Library of Congresss Office of Strategic Initiatives.
In a previous post on The Signal, we examined some of the themes that emerged from the survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. The survey was conducted by the Content Working Group of the National Digital Stewardship Alliance from October 3 through October 31, 2011.
The goal of the survey was to gain a better understanding of the landscape of web archiving activities in the United States. The survey garnered 77 unique responses to 28 questions about current web archiving activities.
The full report is now available here (PDF).
Instead of reiterating content that is in the report, we wanted to pull out some interesting or relevant charts and statistics and provide some additional explication of the summary themes and highlight survey results not featured in the previous blog post.
Policies for Web Archiving
One emergent theme discussed in the previous post was the lack of consistency around incorporating web archiving into institutional policy. To provide some additional detail to that idea, the following chart shows how institutions with active, testing, or planned web archiving programs are incorporating web archiving into their collection or selection policies:
Web Archives: Acquisition & Access
Another survey question that elicited an interesting result was the question about respecting robots.txt files. Robots.txt is a file put on a web server which tells web-crawling robots not to visit or harvest that particular website. (More information on robots.txt here.) How institutions handle robots.txt files has a direct impact on their ability to acquire web content. In the recent post, Legal Issues in Web Archiving, Abbie Grotke addressed in detail the challenges of robots.txt files.
Another topic examined in the report is the types of access that institutions are providing to web archives. Chart 3 provides some examples and percentages, though other means of access are listed in the full report.
Tools for Archiving the Web
The web archiving survey also sought to gain a better understanding of the specific tools being used both to collect content and to display archival collection. The Library of Congress does not endorse these tools, but merely provides this information as a resource for understanding what tools are being used by the web archiving community. Of the 63 respondents indicating their tools for harvesting web materials:
- 60% (38) were using an external service for acquisition
- 26% (16) were using an in-house method for acquisition
- 14% (9) were using both in-house and external services for acquisition
Chart 4 and Chart 5 document some of the services and tools being used to build archives of web content.
Beyond the charts and statistics offered here, the themes discussed in the previous blog post on the Web Archiving Survey merit repeating. The inconsistent custodianship of web archives and the policy and technical challenges of harvesting web content have not dampened the dramatic increase in the number of active web archiving programs. These issues also have not impeded overall efforts to preserve and provide access to a rich, diverse body of web-born content, as is seen in the large number of programs initiated within the last five years. Web archiving is increasingly a core function of collection development for many institutions and, as the survey documents, the web archiving community has demonstrated a keen interest in collaborative activities, knowledge sharing, and joint efforts on conducting research and determining best practices. Groups such as the NDSA and the IIPC continue to offer an open, cooperative space to support institutions working to archive the web.
Many presentations from the recent IIPC 2012 General Assembly can be found here.
Updated 7/6/12 for typos.