If We Capture, Will They Come? Researcher Uses for Web Archive Collections

This is a guest post by Emily Reynolds, a former Library of Congress Junior Fellow and recent Alternative Spring Break intern.

Last week, as part of the University of Michigan School of Information’s Alternative Spring Break program, I worked on a project to develop web archiving use cases for the International Internet Preservation Consortium. The use cases are intended for a general audience hoping to learn more about what purposes web archives serve, and how they can be used in a variety of disciplines. I spent the week browsing scholarly research and other secondary uses of web archive collections, looking for general trends in how people are actually using web archives.

Visualization of linking between websites of different languages, Babel 2012 project

Researcher use of web archive collections is most easily delineated into two broad categories: machine access, for text mining, visualization, and other “big data” modes of analysis; and human access, for viewing particular sites that may have changed or disappeared. Early literature on web archives focused on the latter of these uses, and we all know how satisfying it can be to find a lost site in the Wayback Machine, as well as how important it is to preserve ephemeral web content.

But researchers are increasingly taking advantage of the massive data sets held in web archives to create visualizations and perform large-scale data analysis to answer questions about language, technology, and history, among other topics.

Contests for research using the Common Crawl data set, containing roughly 6 billion crawled pages, show the great potential for research using the huge amounts of data contained in web archives. The Norvig Web Data Science Award and the Common Crawl Code Contest both encouraged researchers to visualize and explore the data set. The results show the great diversity in data available from web archive collections, as well as the many applications in which it can be used:

Babel 2012 Web Language Connections visualized links between websites in different languages.
App Rankings extracted links and references to mobile apps and ranked them based on frequency of mentions.
The Web Data Commons project extracted structured data formats from Common Crawl sites to determine which were most widely used.

Some uses of web archives are more unconventional, like The Deleted City, a visualization of the Geocities archive, or the State Library of Queensland’s use of archived webpages in a museum exhibit.

HTML version usage over time, from the UK Web Archive

As use of web archive collections expands beyond simply viewing specific URLs, some collecting organizations are developing their suites of access tools to facilitate and encourage more complex uses of the data. Because of the technological and legal complexity of web archives, this can be a difficult proposition.

The UK Web Archive, administered by the British Library, has a suite of tools and specialized datasets that make it easy to browse and analyze their web archive data. For example, one data set summarizes the formats found in the collection over time, including the progression of HTML versions in use at any given time. They also provide an n-gram search tool, which allows you to visualize the frequency of word or phrase usage in the archive over time.

The UK Web Archive also has a blog, Analytical Access to the Domain Dark Archive, where researchers post proposed research projects using the dark archive collection. The proposed projects show the wide range of disciplines and research goals for which web archives can be used. Projects include viewing the geographic distribution of French blogs written in London and performing sentiment analysis on discussions of poetry and comparing it to critical reviews.

What interesting uses (scholarly or otherwise) of web archive collections have you come across?

Add a Comment

Add a Comment Cancel reply

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.

Name (no commercial URLs) *

Email (will not be published) *

Comment: