The following is a guest post by Jefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.
The NDSA Content Working Group, one of the five working groups of the National Digital Stewardship Alliance focuses on identifying content already preserved, investigating guidelines for the selection of significant content, discovery of at-risk digital content or collections, and matching orphan content with NDSA partners who will acquire the content, preserve it, and provide access to it. As part of this effort, the group conducted a survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. Conducted from October 3 through October 31, 2011, the goal of the survey was to better understand the landscape of web archiving activities in the United States by identifying the organizations involved, the types of web content being preserved, the tools and policies being used, and the types of access being provided.
Preliminary results of the report presented here are being released in conjunction with the International Internet Preservation Consortium 2012 General Assembly taking place this week here at the Library of Congress. The full report will be made available soon.
The survey featured 28 questions and garnered 77 unique responses from a range of institutions, with survey participants primarily representing the cultural heritage (29%, 22 of 77), government (22% 17 of 77), and university communities (46%, 36 of 77). Of the survey respondents, 31% (24 of 77) were members of the NDSA and 8% (6 of 77) were members of the IIPC.
Web Archiving Activity
The current web archiving activities of the survey respondents was as follows:
- 63% (49 of 77) have an active web archiving program.
- 16% (12 of 77) are actively testing a web archiving program.
- 17% (13 of 77) are planning on pursuing a web archiving program in the near future.
- 4% (3 of 77) formerly managed web archiving programs, but no longer do so.
Interestingly, of the 71 respondents that identified their web archiving goals, 80% (57 of 71) were archiving content “from other organizations or individuals for future research,” 69% (49 of 71) were preserving their own institutional web content, and 49% (35 of 71) were doing both.
In reviewing the full survey results, a number of themes emerged.
The recent emergence of web archiving, especially at academic institutions
One surprising result was the preponderance of universities that have initiated web archiving programs in the last 5 years. Of the 68 respondents that identified the specific year their web archiving began, nearly a third, 32% (22 of 68) began their programs within the last two years, the exact same number of institutions (22, 32%) that began archiving web content in the 17 years between 1989 and 2006. The recent surge in web archiving within the last 5 years – 68% (46 of 68) of those surveyed – is primarily due to universities starting web archiving programs.
Inconsistent custodianship
One discovery of the survey was the low percentage of respondents that have transferred their archived data from their external service to their institution. Only 18% (9 of 49) of survey members have transferred their data in-house, including only 2 of the 12 government respondents and only 4 of the 25 university respondents. A total of 82% of those using an external service have not transferred data to their institution. Free text comments for this question pointed to many concerns for transferring externally harvested data to in-house systems including “duplicate costs,” a lack of infrastructure, confidentiality concerns, and cataloging and accessibility challenges.
Lack of policies and unclear guidance on permissions
Internal policy documentation appeared to be an area of continued improvement for many institutions. While some programs had incorporated web-materials into existing policies and procedures, others had not and some seemed unsure of their institution’s current policy status for web content.
The survey also brought to light an acute lack of clarity around seeking permission from content creators, both for harvesting and for providing access to collections. Chart 4 and Chart 6 show policies related to seeking permission from content creators to harvest content and provide access to archived web sites.
Collecting trends and collaborative potential
The types of content being acquired included websites, blogs, and social media:
- 78% (60 of 77) included or plan on including websites in their archive
- 57% (44 of 77) included or plan on including blogs in their archive
- 38% (29 of 77) included or plan on including social media in their archive
A free-text survey question asked for respondents to “briefly describe the scope of your web archive collections.” Broadly stated, these responses fell into one of three categories: institutional self-documentation, collection enhancement, and thematic. Chart 6 shows the survey responses when asked to choose from among a variety of specific subject topics.
The potential for collaboration was a notable aspect of the survey results. Though only 23% of organizations were currently collaborating on web archiving, 96% (64 of 67) answered either “yes” (34, 51%) or “maybe” (30, 45%) when asked if they were interested in participating in future collaborative collecting activities. As these numbers demonstrate, there is a significant interest in the collaborative opportunities around joint web archiving, but little current action in this area.
While the survey sometimes exposed the continued challenges of preserving content that is created on the web, as well as the ongoing permission and management challenges of providing access, it also pointed to the growing importance of web archiving as a core function of collection development for many institutions. This, coupled with the openness towards collaboration, suggests that many of the challenges evident in the report will be addressed in due time by the combined efforts of the entire community. Events like the IIPC General Assembly and alliances like the NDSA are a key part of the knowledge-sharing and collaboration essential to organizations as they work to archive and preserve web collections.
Comments (3)
Because the Internet Archive crawls and preserves the web, how does this affect decisions of institutions to crawl and archive their own sites? Do they feel they are duplicating the work of the Internet Archive? How are institutions providing access to their preserved sites?
Hi Matt,
Thanks for your comment. Great question. For a lot of institutions, web archiving has replaced internal records management requirements or even informal existing practices to capture certain types of materials that were formerly created in print form. Many organizations would be hesitant to entrust such self-documentation to a third party. Internet Archive crawls are, to my knowledge, not comprehensive in that they do not capture every change to every page of every website, so an institution could not be assured IA would capture all their webpages.
Providing access to web archives is tricky, as access is often dependent on the type of collection and any associated copyright, IP, and privacy issues. More details on access will be in the full report, but you can visit our web archive to see how LC provides access: http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
Thanks again for your questions!
Jefferson
I talked about the question of “why archive if Internet Archive is capturing the web” in a recent article about the LC’s archiving activities:
“Often the question is asked, if the Internet Archive is crawling the web, why are others archiving at all? The answer is simple: No one institution can collect an archival replica of the whole web at the frequency and depth needed to reflect the true evolution of society, government, and culture online. A hybrid approach is needed to ensure that a representative sample of the web is preserved, including the complimentary approaches of broad crawls such as the Internet Archive does, paired with deep, curated collections by theme or by site, tackled by other cultural heritage organizations. ”
(Full article is here: http://www.infotoday.com/cilmag/dec11/Grotke.shtml)
In terms of organizations (such as university archives) archiving their own sites, many we’ve talked to prefer to archive their own sites more often than the Internet Archive might capture, and need to archive things such as intranets and social media content that IA may not discover. Also, many organizations are interesting in capturing and storing this content alongside their other institutional archival materials. Doing their own archiving provides much more control in the process for whatever goals the institution may have.
-Abbie