The MH17 Crash and Selective Web Archiving

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on <a href="http://web.archive.org/web/20140717155720/https://vk.com/wall-57424472_7256">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine
.

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

5 Comments

  1. Nicholas Taylor
    July 28, 2014 at 3:47 pm

    Michael Nelson pointed out in an offline conversation that this Internet Archive blog post implies that the six-month public access embargo is no longer in effect for the Wayback Machine. If that’s true, then there’s no hypothetical cache of content that the web-wide crawlers have captured that just hasn’t been made available yet, further underscoring the importance of selective web archiving.

  2. Phil
    July 28, 2014 at 4:17 pm

    Great post!

  3. Gary Price
    July 29, 2014 at 4:05 pm

    Greetings from Silver Spring. I am a librarian, regular visitor to LC, and the editor/writer of infoDOCKET, a blog from Library Journal.

    Quick note about your blog post re: Wayback that I also saw on Sci Am.

    1. While it’s true that Wayback crawls the web on its own users can now submit any crawlable url or PDF for immediate crawling, indexing, and access. This service became listed on the Wayback homepage last Fall. In fact, I was using shortly before reading your blog post.

    Go to http://archive.org/web and look for the “save page now” box on the right side of the page.
    Btw, I also have a bookmarklet that makes using it even easier and faster. \

    Although the Save Page Now box first appeared last Fall I believe it was possible to add pages to Wayback manually before this was introduced.

    2. The six-month embargo from the official crawl is still what’s documented but in reality the embargo is much shorter.

    This shorter embargo was mentioned by Brewster Kahle in a blog post I noted on infoDOCKET on January 9, 2013.

    http://www.infodocket.com/2013/01/10/nice-milestones-wayback-machine-now-home-to-240000000000-urls-and-with-improved-currency/

    He wrote:

    “Now we cover from late 1996 to December 9, 2012 so you can surf the web as it was up until a month ago.”

    Btw, I spot checked a few sites and found this month delay was accurate for some sites. In other case, much less.

    For example, the LC homepage is as current as today.
    The Signal was last crawled on July 1.

    DISA.mil last crawled yesterday.

    NATO July 25.

    cheers,
    gary

  4. Jason Scott
    August 1, 2014 at 12:44 pm

    For the last couple of years, Archive Team has been collaborating with the Internet Archive to grab additional, specific websites and URLs that pop up in the news or social media, as well as “it would be a good idea” URLs that represent true loss if the site were to ever go away.

    This program, called “Archive Bot”, allows dozens of people to tell the Bot about sites and URLs to grab. On average, Archive Bot grabs tens of thousands of URLs every day, with a size in the range of 100 to 300 gigabytes a day.

    If you visit archivebot.com, you can see our dashboard in effect, as we grab various URLs extremely quickly. Rest assured, if a site makes the news, or an individual dies, or a tweet or URL contains something controversial, the Archive Bot is probably swooping in, and the results are on archive.org’s Wayback machine within 24-48 hours.

    Next problem, please.

  5. Ivan
    August 1, 2014 at 3:12 pm

    ArchiveBot has moved on from grabbing tens of thousands… we’re at four million URLs per day of selectively targeted crawls.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.