Top of page

Web Archiving Arrives: Results from the NDSA Web Archiving Survey

Share this post:

The following is a guest post by Jefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.

The NDSA Content Working Group, one of the five working groups of the National Digital Stewardship Alliance focuses on identifying content already preserved, investigating guidelines for the selection of significant content, discovery of at-risk digital content or collections, and matching orphan content with NDSA partners who will acquire the content, preserve it, and provide access to it. As part of this effort, the group conducted a survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. Conducted from October 3 through October 31, 2011, the goal of the survey was to better understand the landscape of web archiving activities in the United States by identifying the organizations involved, the types of web content being preserved, the tools and policies being used, and the types of access being provided.

Preliminary results of the report presented here are being released in conjunction with the International Internet Preservation Consortium 2012 General Assembly taking place this week here at the Library of Congress. The full report will be made available soon.

The survey featured 28 questions and garnered 77 unique responses from a range of institutions, with survey participants primarily representing the cultural heritage (29%, 22 of 77), government (22% 17 of 77), and university communities (46%, 36 of 77). Of the survey respondents, 31% (24 of 77) were members of the NDSA and 8% (6 of 77) were members of the IIPC.

Web Archiving Activity

The current web archiving activities of the survey respondents was as follows:

  • 63% (49 of 77) have an active web archiving program.
  • 16% (12 of 77) are actively testing a web archiving program.
  • 17% (13 of 77) are planning on pursuing a web archiving program in the near future.
  • 4% (3 of 77) formerly managed web archiving programs, but no longer do so.
Chart 1: Status of current web archiving activities.

Interestingly, of the 71 respondents that identified their web archiving goals, 80% (57 of 71) were archiving content “from other organizations or individuals for future research,” 69% (49 of 71) were preserving their own institutional web content, and 49% (35 of 71) were doing both.

In reviewing the full survey results, a number of themes emerged.

The recent emergence of web archiving, especially at academic institutions

One surprising result was the preponderance of universities that have initiated web archiving programs in the last 5 years. Of the 68 respondents that identified the specific year their web archiving began, nearly a third, 32% (22 of 68) began their programs within the last two years, the exact same number of institutions (22, 32%) that began archiving web content in the 17 years between 1989 and 2006. The recent surge in web archiving within the last 5 years – 68% (46 of 68) of those surveyed – is primarily due to universities starting web archiving programs.

Chart 2: Year began archiving web content.

Inconsistent custodianship

One discovery of the survey was the low percentage of respondents that have transferred their archived data from their external service to their institution. Only 18% (9 of 49) of survey members have transferred their data in-house, including only 2 of the 12 government respondents and only 4 of the 25 university respondents. A total of 82% of those using an external service have not transferred data to their institution. Free text comments for this question pointed to many concerns for transferring externally harvested data to in-house systems including “duplicate costs,” a lack of infrastructure, confidentiality concerns, and cataloging and accessibility challenges.

Chart 3: Rates of transferring web content in-house for those collecting content through an external service.

Lack of policies and unclear guidance on permissions

Internal policy documentation appeared to be an area of continued improvement for many institutions. While some programs had incorporated web-materials into existing policies and procedures, others had not and some seemed unsure of their institution’s current policy status for web content.

The survey also brought to light an acute lack of clarity around seeking permission from content creators, both for harvesting and for providing access to collections. Chart 4 and Chart 6 show policies related to seeking permission from content creators to harvest content and provide access to archived web sites.

Chart 4: Policies towards seeking permission to crawl websites.

Collecting trends and collaborative potential

The types of content being acquired included websites, blogs, and social media:

  • 78% (60 of 77) included or plan on including websites in their archive
  • 57% (44 of 77) included or plan on including blogs in their archive
  • 38% (29 of 77) included or plan on including social media in their archive
Chart 5: Policies towards seeking permission to provide access to archived web content.

A free-text survey question asked for respondents to “briefly describe the scope of your web archive collections.” Broadly stated, these responses fell into one of three categories: institutional self-documentation, collection enhancement, and thematic. Chart 6 shows the survey responses when asked to choose from among a variety of specific subject topics.

The potential for collaboration was a notable aspect of the survey results. Though only 23% of organizations were currently collaborating on web archiving, 96% (64 of 67) answered either “yes” (34, 51%) or “maybe” (30, 45%) when asked if they were interested in participating in future collaborative collecting activities. As these numbers demonstrate, there is a significant interest in the collaborative opportunities around joint web archiving, but little current action in this area.

Chart 6: Subjects currently or planned to be represented in respondents’ web archives.

 

Chart 7: Current participation and interest in future participation in collaborative web archiving.

While the survey sometimes exposed the continued challenges of preserving content that is created on the web, as well as the ongoing permission and management challenges of providing access, it also pointed to the growing importance of web archiving as a core function of collection development for many institutions. This, coupled with the openness towards collaboration, suggests that many of the challenges evident in the report will be addressed in due time by the combined efforts of the entire community. Events like the IIPC General Assembly and alliances like the NDSA are a key part of the knowledge-sharing and collaboration essential to organizations as they work to archive and preserve web collections.

Comments (3)

  1. Because the Internet Archive crawls and preserves the web, how does this affect decisions of institutions to crawl and archive their own sites? Do they feel they are duplicating the work of the Internet Archive? How are institutions providing access to their preserved sites?

  2. Hi Matt,

    Thanks for your comment. Great question. For a lot of institutions, web archiving has replaced internal records management requirements or even informal existing practices to capture certain types of materials that were formerly created in print form. Many organizations would be hesitant to entrust such self-documentation to a third party. Internet Archive crawls are, to my knowledge, not comprehensive in that they do not capture every change to every page of every website, so an institution could not be assured IA would capture all their webpages.

    Providing access to web archives is tricky, as access is often dependent on the type of collection and any associated copyright, IP, and privacy issues. More details on access will be in the full report, but you can visit our web archive to see how LC provides access: http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

    Thanks again for your questions!

    Jefferson

  3. I talked about the question of “why archive if Internet Archive is capturing the web” in a recent article about the LC’s archiving activities:

    “Often the question is asked, if the Internet Archive is crawling the web, why are others archiving at all? The answer is simple: No one institution can collect an archival replica of the whole web at the frequency and depth needed to reflect the true evolution of society, government, and culture online. A hybrid approach is needed to ensure that a representative sample of the web is preserved, including the complimentary approaches of broad crawls such as the Internet Archive does, paired with deep, curated collections by theme or by site, tackled by other cultural heritage organizations. ”

    (Full article is here: http://www.infotoday.com/cilmag/dec11/Grotke.shtml)

    In terms of organizations (such as university archives) archiving their own sites, many we’ve talked to prefer to archive their own sites more often than the Internet Archive might capture, and need to archive things such as intranets and social media content that IA may not discover. Also, many organizations are interesting in capturing and storing this content alongside their other institutional archival materials. Doing their own archiving provides much more control in the process for whatever goals the institution may have.

    -Abbie

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.


Required fields are indicated with an * asterisk.