The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.
The Internet Archive Wayback Machine has been mentioned in several news articles within the last week (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.
An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.
Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.
Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.
Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.
A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.
At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.
Comments (7)
Michael Nelson pointed out in an offline conversation that this Internet Archive blog post implies that the six-month public access embargo is no longer in effect for the Wayback Machine. If that’s true, then there’s no hypothetical cache of content that the web-wide crawlers have captured that just hasn’t been made available yet, further underscoring the importance of selective web archiving.
Great post!
Greetings from Silver Spring. I am a librarian, regular visitor to LC, and the editor/writer of infoDOCKET, a blog from Library Journal.
Quick note about your blog post re: Wayback that I also saw on Sci Am.
1. While it’s true that Wayback crawls the web on its own users can now submit any crawlable url or PDF for immediate crawling, indexing, and access. This service became listed on the Wayback homepage last Fall. In fact, I was using shortly before reading your blog post.
Go to http://archive.org/web and look for the “save page now” box on the right side of the page.
Btw, I also have a bookmarklet that makes using it even easier and faster. \
Although the Save Page Now box first appeared last Fall I believe it was possible to add pages to Wayback manually before this was introduced.
2. The six-month embargo from the official crawl is still what’s documented but in reality the embargo is much shorter.
This shorter embargo was mentioned by Brewster Kahle in a blog post I noted on infoDOCKET on January 9, 2013.
http://www.infodocket.com/2013/01/10/nice-milestones-wayback-machine-now-home-to-240000000000-urls-and-with-improved-currency/
He wrote:
“Now we cover from late 1996 to December 9, 2012 so you can surf the web as it was up until a month ago.”
Btw, I spot checked a few sites and found this month delay was accurate for some sites. In other case, much less.
For example, the LC homepage is as current as today.
The Signal was last crawled on July 1.
DISA.mil last crawled yesterday.
NATO July 25.
cheers,
gary
For the last couple of years, Archive Team has been collaborating with the Internet Archive to grab additional, specific websites and URLs that pop up in the news or social media, as well as “it would be a good idea” URLs that represent true loss if the site were to ever go away.
This program, called “Archive Bot”, allows dozens of people to tell the Bot about sites and URLs to grab. On average, Archive Bot grabs tens of thousands of URLs every day, with a size in the range of 100 to 300 gigabytes a day.
If you visit archivebot.com, you can see our dashboard in effect, as we grab various URLs extremely quickly. Rest assured, if a site makes the news, or an individual dies, or a tweet or URL contains something controversial, the Archive Bot is probably swooping in, and the results are on archive.org’s Wayback machine within 24-48 hours.
Next problem, please.
ArchiveBot has moved on from grabbing tens of thousands… we’re at four million URLs per day of selectively targeted crawls.
What is the current purpose of LOC setting a year embargo on archived sites? Will the wayback machines policy of no embargo ever be adopted by loc?
Hi Jack, thanks for the question. The Library of Congress imposes a one-year embargo on archived sites in order to avoid competition with active websites. It does not anticipate changing this policy. You can learn more about our program here: https://www.loc.gov/programs/web-archiving/about-this-program/ and visit the web archives at https://www.loc.gov/websites/
-Abbie Grotke, Web Archiving Team Lead