The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.
In a blog post about six months ago I wrote about how the Library of Congress web archiving program was starting to harvest “general” internet news sites such as Daily Kos, Huffington Post and Townhall.com, as well as newer sites such as news.vice.com and verge.com.
Many of these sites are extremely large. How large? While not an exact count (and in fact, far from it), use of the “site” limiter in Google will provide a count of digital objects found and indexed by Google (which is a far larger number than the number of web pages, but gives some sense of relative scale to other sites). A “site:huffingtonpost.com” search in Google returns “about 3,470,000 results.” That is large.
When harvesting sites like these in a “traditional” way the harvesting starts at the home page and follows links found on pages that are in scope, capturing each page and the digital bits and pieces (such as JPEG images) so that playback software can recreate the page accurately later. For example, with HuffingtonPost.com that would mean links on the URL huffingtonpost.com and not links out to some other URL.
Unfortunately such sites are so large that the harvesting process runs out of time. Even though we were harvesting (to stay with this example) HuffingtonPost.com on a weekly basis, capturing whatever we could manage to get each time, there was no assurance that over time we would capture all the content published to the site even once, since with each harvest the process of winding through the site started over again but then followed a different path.
As we reviewed the early results of harvesting very-large-news-sites, I began to think about the different use cases that have been connected with web archiving by the International Internet Preservation Consortium.
What is the use case served by trying to capture the entire site in one visit (leaving aside completeness)? Presumably it is that the archive ends up with multiple copies of the same news item over time, perhaps showing that some changes were made for one reason or another. This is the accountability use case:
Crawling websites over time allows for modifications to content to be observed and analyzed. This type of access can be useful in ensuring accountability and visibility for web content that no longer exists. On one hand, companies may archive their web content as part of records management practices or as a defense against legal action; on the other, public web archives can show changes in governmental, organizational, or individual policy or practices.
This would be a fine use case for our efforts to support, but if we aren’t able to reliably harvest all the content at least once, it seems less than helpful for users of the archived site to have multiple copies of some news items and none of others (on a completely random basis, from the user perspective).
As it turns out, the IIPC has a different use case for “news in the 21st century“:
Blogs, Tweets, and status updates on social media are just as likely sources for news today as traditional newspapers and broadcasts. Traditional media have also adapted to deliver news online. Libraries and archives have always collected newspapers, these are the core collections of many local historical societies. If the news that is distributed online is not preserved there will be a huge hole in our collective memory.
For this use case, completeness of harvest, getting at least one copy of every news story published on the site, is a more useful outcome over time than having multiple copies harvested over a period of time of some of the news stories.
And there is another use case that will be better served by completeness – text mining. The Library of Congress does not now support any type of text mining tools to interact with its web archives, but when it does (someday), completeness of capture of all that was published will be more important than multiple copies of mostly the same thing. But how do we achieve this so-called completeness, if not by attempting regular top-to-bottom harvesting?
Borrowing from an approach used by Icelandic colleagues, we have tried to achieve a more complete harvest over time by making use of RSS feeds provided by the news sites. Over the course of a week, RSS for articles that are produced by “target” news sites (such as HuffingtonPost.com) are aggregated into one-time use “seed lists” and the crawler then can go to the news site and harvest just those seeds, reliably. Although there is a certain extra effort in this approach in building custom seed lists week after week, over time (by which I mean years) it will assure completeness of capture. We should get at least one capture of every new item published moving forward in time. This is good.
We will also do occasional attempts (maybe twice a year) to harvest a whole site thoroughly to fill gaps, perhaps to help with completeness and to provide multiple captures of some content.
What will this mean for future users of these archived news sites? As we begin to make these sites that depend on harvesting-by-RSS-seed-URL available, users may find more broken link problems when browsing (although maybe not – it simply isn’t clear). While our present interface is an archive that reproduces the browse-the-site experience and it can be useful for users to have the “time machine” experience of browsing older versions of a site, this is not the only significant use case. If a user has a particular URL and wants to see it in the archive, browsing is not necessary. We still need to add textual search to support our users, but that would also offer a path around broken links. And over time (again, years) the completeness of coverage on an ongoing basis will build more reliability.
That is what a national library is supposed to be about – building significant collections over time, steadily and reliably. And this is where we want web archiving to be.