The following is a guest post by Jefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.
The NDSA Content Working Group, one of the five working groups of the National Digital Stewardship Alliance focuses on identifying content already preserved, investigating guidelines for the selection of significant content, discovery of at-risk digital content or collections, and matching orphan content with NDSA partners who will acquire the content, preserve it, and provide access to it. As part of this effort, the group conducted a survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. Conducted from October 3 through October 31, 2011, the goal of the survey was to better understand the landscape of web archiving activities in the United States by identifying the organizations involved, the types of web content being preserved, the tools and policies being used, and the types of access being provided.
Preliminary results of the report presented here are being released in conjunction with the International Internet Preservation Consortium 2012 General Assembly taking place this week here at the Library of Congress. The full report will be made available soon.
The survey featured 28 questions and garnered 77 unique responses from a range of institutions, with survey participants primarily representing the cultural heritage (29%, 22 of 77), government (22% 17 of 77), and university communities (46%, 36 of 77). Of the survey respondents, 31% (24 of 77) were members of the NDSA and 8% (6 of 77) were members of the IIPC.
Web Archiving Activity
The current web archiving activities of the survey respondents was as follows:
- 63% (49 of 77) have an active web archiving program.
- 16% (12 of 77) are actively testing a web archiving program.
- 17% (13 of 77) are planning on pursuing a web archiving program in the near future.
- 4% (3 of 77) formerly managed web archiving programs, but no longer do so.
Interestingly, of the 71 respondents that identified their web archiving goals, 80% (57 of 71) were archiving content “from other organizations or individuals for future research,” 69% (49 of 71) were preserving their own institutional web content, and 49% (35 of 71) were doing both.
In reviewing the full survey results, a number of themes emerged.
The recent emergence of web archiving, especially at academic institutions
One surprising result was the preponderance of universities that have initiated web archiving programs in the last 5 years. Of the 68 respondents that identified the specific year their web archiving began, nearly a third, 32% (22 of 68) began their programs within the last two years, the exact same number of institutions (22, 32%) that began archiving web content in the 17 years between 1989 and 2006. The recent surge in web archiving within the last 5 years – 68% (46 of 68) of those surveyed – is primarily due to universities starting web archiving programs.
One discovery of the survey was the low percentage of respondents that have transferred their archived data from their external service to their institution. Only 18% (9 of 49) of survey members have transferred their data in-house, including only 2 of the 12 government respondents and only 4 of the 25 university respondents. A total of 82% of those using an external service have not transferred data to their institution. Free text comments for this question pointed to many concerns for transferring externally harvested data to in-house systems including “duplicate costs,” a lack of infrastructure, confidentiality concerns, and cataloging and accessibility challenges.
Lack of policies and unclear guidance on permissions
Internal policy documentation appeared to be an area of continued improvement for many institutions. While some programs had incorporated web-materials into existing policies and procedures, others had not and some seemed unsure of their institution’s current policy status for web content.
The survey also brought to light an acute lack of clarity around seeking permission from content creators, both for harvesting and for providing access to collections. Chart 4 and Chart 6 show policies related to seeking permission from content creators to harvest content and provide access to archived web sites.
Collecting trends and collaborative potential
The types of content being acquired included websites, blogs, and social media:
- 78% (60 of 77) included or plan on including websites in their archive
- 57% (44 of 77) included or plan on including blogs in their archive
- 38% (29 of 77) included or plan on including social media in their archive
A free-text survey question asked for respondents to “briefly describe the scope of your web archive collections.” Broadly stated, these responses fell into one of three categories: institutional self-documentation, collection enhancement, and thematic. Chart 6 shows the survey responses when asked to choose from among a variety of specific subject topics.
The potential for collaboration was a notable aspect of the survey results. Though only 23% of organizations were currently collaborating on web archiving, 96% (64 of 67) answered either “yes” (34, 51%) or “maybe” (30, 45%) when asked if they were interested in participating in future collaborative collecting activities. As these numbers demonstrate, there is a significant interest in the collaborative opportunities around joint web archiving, but little current action in this area.
While the survey sometimes exposed the continued challenges of preserving content that is created on the web, as well as the ongoing permission and management challenges of providing access, it also pointed to the growing importance of web archiving as a core function of collection development for many institutions. This, coupled with the openness towards collaboration, suggests that many of the challenges evident in the report will be addressed in due time by the combined efforts of the entire community. Events like the IIPC General Assembly and alliances like the NDSA are a key part of the knowledge-sharing and collaboration essential to organizations as they work to archive and preserve web collections.