Harvesting and Preserving the Future Web: Replay and Scale Challenges

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group.

This is the second part of a two-post recap of the “Harvesting and Preserving the Future Web” workshop at the recent International Internet Preservation Consortium General Assembly.

The session was divided into three topics:

  1. Capture: challenges in acquiring web content;
  2. Replay: challenges in recreating the user experience from the archived content; and
  3. Scale: challenges in doing both of these at web scale.

Having covered the topic of capture previously, this post addresses replay and scale.

Replay

Replay from user photogestion on Flickr

Replay from user photogestion on Flickr

Andy Jackson from the British Library’s UK Web Archive explained how he enhanced the Wayback Machine’s archival replay to allow for in-page video playback using FlowPlayer and re-enabled dynamic map services using OpenStreetMap. While inarguably providing a better user experience than the alternative of, respectively, video disaggregated to a separate archive and static Google Maps image tiles, the re-intermediation of technologies that were technically absent from the archive prompted questions about what it meant to “recreate the user experience.”

In the ensuing conversation, David Rosenthal from the LOCKSS Program questioned whether in the context of long-term preservation it would be useful to conceptualize the “user experience” as extending beyond the boundaries of the website itself; perhaps the Wayback Machine should be re-engineered to serve the archived website within an emulated contemporaneous browser, for example.

Responding to this suggestion, Bjarne Andersen from the Netarchive.dk (Danish, national) web archiving program noted that SCAlable Preservation Environments (SCAPE) will be built to be sensitive to the differences in viewing a website in different generations of browsers; SCAPE will compare screenshots to determine whether the contemporary rendering remains faithful to the website’s historical appearance.

Summarizing the session, my takeaway was that robust preservation of the user experience may require more than simply replaying whatever happens to be in the archive.

Scale

Scale-a-Week: 17 December 2011 from user puuikibeach on Flickr

Scale-a-Week: 17 December 2011 from user puuikibeach on Flickr

Readers of The Signal will be familiar with some of our past discussions of the challenges of scale.

In the first panelist presentation, Gordon Mohr, lead architect of Heritrix, noted that the dilemma of scaling storage has been “solved” with the maturation of cloud services. The caveat was that funding has thus become the more fundamental scale limitation. Aaron Binns of the Internet Archive argued that the long-term financial sustainability of digital preservation often doesn’t receive as much attention as the infrastructure challenges, and the Blue Ribbon Task Force on Sustainable Digital Preservation and Access was commissioned for this reason.

Returning to the discussion of infrastructure-related scale challenges, Rob Sanderson from Los Alamos National Laboratories Research Library cited the difficulty of maintaining synchronization of resources located at multiple network endpoints. The dominant protocol of the web, HTTP, is poorly-suited to synchronizing resources that are either large (because failed transfers have to be re-initiated from the beginning) or that change rapidly (because requests can’t necessarily be submitted quick enough to keep pace with the rate of change). To address these challenges, the Research Library is working on a new framework called ResourceSync.

On an impressive side note, Youssef El Dakar from the Bibliotheca Alexandrina noted that simply regenerating checksums for their 100 TB of web archives would take an entire year with their current infrastructure.

Summarizing the session, my takeaway was that scale was the least tractable of the three sets of challenges; it’s easier to imagine that technical breakthroughs might make a significant difference for either capture or replay, but resource scarcity is a more fundamental problem.

3 Comments

  1. Carl Fleischhauer
    June 19, 2012 at 8:33 am

    From one who did not attend this session, thank you very much for this helpful summary! Interpreting your final paragraph, I make it this way: “of all the technical challenges, scale is the most formidable” and “underlying everything, however, is the problem of continuing cost.” Have I correctly caught your summation?

  2. Renee Marie Jones
    June 19, 2012 at 4:39 pm

    “On an impressive side note, Youssef El Dakar from the Bibliotheca Alexandrina noted that simply regenerating checksums for their 100 TB of web archives would take an entire year with their current infrastructure.”

    Tnis is very hard to believe. I have a very slow, cheap 1TB raid box at home, yet it took well under a day to copy it to a new backup drive. Checksumming is hardly more complex than reading. I could easily checksum the contents of a 1TB raid in one day, then manually switch to a new raid each day, completing 100 TB of checksumming in only 100 days.

    If someone spent more than the 300 dollars I spent for my storage system, surely they could do even better. A whole year for 100 TB? I really doubt that. Unless … is it on floppies?

  3. Nicholas Taylor
    June 29, 2012 at 9:34 am

    @Carl Fleischhauer: Yes, I think that’s a good summation.

    @Renee Marie Jones: Youssef didn’t specify what kind of infrastructure constraints made for the long processing time, but it seems plausible if, for example, the data had to be staged off of tape before it could be checksummed. Another important consideration is that the storage and processor cycles needed to checksum that much data are shared resources used by other projects, as well.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.