Harvesting and Preserving the Future Web: Content Capture Challenges

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.

IIPC logoFollowing our earlier summary of the recent International Internet Preservation Consortium General Assembly, I thought I’d share some of the insights from the workshop, “Harvesting and Preserving the Future Web.

The workshop was divided into three topics:
1) capture, challenges in acquiring web content;
2) replay, challenges in recreating the user experience from the archived content; and
3) scale, challenges in doing both of these at web scale.

I’ll be talking about capture here, leaving replay and scale for a second post.

Capture
Kris Carpenter Negulescu from the Internet Archive cued up the session with an overview (PDF) of challenges to capturing web content. She noted that the web-as-a-collection-of-documents is rapidly becoming something much more akin to a programming environment, marked by desktop-like interactive applications, complex service and content mashups, social networking services, streaming media, and immersive worlds.

Kris also provided an overview of current and prospective strategies for tackling these challenges: making the traditional web crawler behave more like a browser; integrate diverse approaches into unified workflows; design and code new custom tools; record screenshots to capture look-and-feel; record video of user interactions; and deposit web content into archives.

Adam Miller, also from Internet Archive, explained his use of PhantomJS, a headless browser, to identify JavaScript links and trigger AJAX content that might otherwise be opaque to Heritrix. Herbert van de Sompel from Los Alamos National Laboratories Research Library followed, presenting (PDF) a non-traditional web archiving paradigm called transactional web archiving. Instead of dispatching a client crawler, a transactional archive-enabled web server would “archive itself” by mirroring its HTTP responses to an archive.

Notwithstanding overcoming some of the technical challenges in capturing content, there was consensus that increasing personalization and integration of third-party services was eroding the notion that any sort of canonical user experience could be archived. David Rosenthal from the LOCKSS Program expressed this sentiment most eloquently with the comment, “we may have to settle for capturing ‘a’ user experience rather than ‘the’ user experience.”

On the other hand, it could be said that which has been archived of the web so far hasn’t been as generic as supposed, given extant customization based on geography, user-agent, and cookies. Kris took this point further, arguing that we should be careful to suppose that archives have even historically been representative of a “universalized” social experience; some of the oldest preserved documents reflect only the upper classes of ancient Egyptian society.

Summarizing the session, my takeaway was that the field needed not only innovative new technical approaches to capturing content but also an evolution in understanding of what it meant to “archive the web.”

For another account of the session, please see also David Rosenthal’s write-up.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.