The MH17 Crash and Selective Web Archiving

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on <a href="http://web.archive.org/web/20140717155720/https://vk.com/wall-57424472_7256">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine
.

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

Scoring, Not Storing: Digital Preservation Assessment Criteria at #digpres14

The following is a guest post by Seth Anderson, consultant at AVPreserve.  This is part of an ongoing series of posts to highlight and preview the Digital Preservation 2014 program.  Here Seth previews the session he organized, “Digital Preservation Audit and Planning with ISO 16363 and NDSA Levels of Preservation,” scheduled for Wednesday, July 23 […]

Extending the Life of a Story Through Taxonomy at National Public Radio

Hannah Sommers has done just about every job one can do in a library.  Today she serves as NPR’s first Library Program Manager, helping forge a new path for the profession in her role directing product development for the NPR Library. This is her guest post. NPR’s mission is to create a more informed public, […]

Tag and Release: Acquiring & Making Available Infinitely Reproducible Digital Objects

What does it mean to acquire something, like a set of animated .gifs,  that are already widely available on the web? Archives and Museums are often focused on acquiring, preserving and making accessible rare or unique documents, records, objects and artifacts. While someone might take a photo of an object, or reproduce it in any […]

Preserving Digital and Software-Based Artworks: Recap of a NDSA Discussion

In response to a suggestion from our active membership, the NDSA Standards and Practices Working Group recently hosted a discussion about preserving digital and software-based artworks. Interestingly, the suggestion for this topic came not from a museum staffer but by Winston Atkins, Preservation Officer at Duke University Libraries. Complex materials like digital art works and […]

All that Big Data Is Not Going to Manage Itself: Part Two

Yesterday’s blog post described some of the federal government initiatives that have driven data management requirements over the past ten years or so. “Data management” is a hot job area right now, and if you tilt the digital stewardship universe a certain direction, almost everything we do falls under the rubric of “data management.” Data […]

When Literature Professors’ Bots Read Collections of ROMS: An interview with Zach Whalen

How are researchers and scholars going to make use of born-digital primary sources? It’s an open question which many working in digital preservation are interested in. As part of the NDSA innovation working group’s ongoing Insights interview series I am excited to talk with Zach Whalen, an english professor at the University of Mary Washington,  […]

The Meaning of the MP3 Format: An Interview with Jonathan Sterne

What does the history of the MP3 format mean for those interested in ensuring long-term access to our digital cultural heritage? In this installment of the NDSA’s Insights interview series I talk with historian Jonathan Sterne about his book MP3: The Meaning of a Format. You can read the introduction to his book, titled “Format […]

Protect Your Data: Information Security and the Boundaries of your Storage System

The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress. The NDSA Levels of Digital Preservation are useful in providing a high-level, at-a-glance overview of tiered guidance for planning for digital preservation. One of the most common requests received […]

Exploring Computational Categorization of Records: A Conversation with Meg Phillips from NARA

Continuing the insights interview series, I’m excited to share this conversation with Meg Phillips, External Affairs Liaison at the National Archives and Records Administration. A few years back we “un-chaired” CURATEcamp Processing: Processing Data/Processing Collections together. Meg wrote a guest post reflecting on that event for the Signal titled More Product, Less Process for Born-Digital […]