Top of page

A Vision of the Role and Future of Web Archives: Research Use

Share this post:

The following is a guest post by Kalev H. Leetaru, University of Illinois, who presented these ideas at the 2012 General Assembly of the IIPC. This post is the second in a three-part series. View the first post in here.

Data Mining

For millennia, scholarship in archives and libraries has meant intensive reading of a small number of works. In the past decade the “digital humanities” and computational social sciences has led to the growing use of computerized analysis of archives in which software algorithms are used to identify patterns and point to areas of interest in the data. Digital archives have largely been built around this earlier access modality of deep reading, while computational techniques need rapid access to vast volumes of content, often encompassing the entire archive. New programming interfaces and access policies are needed to enable this new generation of scholarship using web archives.

Informal discussions with web archivists suggest a chicken-or-the-egg dilemma in this regard: data miners want to analyze archives, but can’t without the necessary programmatic interfaces, and archives for the most part want to encourage use of their archives, but don’t know what interfaces to support without working with data miners. Few archives today support the necessary programmatic interfaces for automated access to their collections, and those that do tend to be aimed at metadata, rather than fulltext content, and use library-centric protocols and mindsets. Some have fairly complex interfaces, with very fine-grained toolkits for each possible use scenario. The few that offer data exports offer an either-or proposition: you either download a ZIP file of the entire contents of everything in the archive or you get nothing: there is no in-between. Though there are some bright spots: the National Endowment for the Humanities has made initial steps towards helping archivists and data miners work together through grand challenge programs like its Digging into Data initiative where a selection of archives made their content available to awardees for large data mining.

Internet Map
Partial map of the Internet based on the January 15, 2005 data from the Opte Project. Used under a Creative Commons license.

Yet, one only has to look at Twitter for a model of what archives could do. Twitter provides only a single small programming interface with a few very basic options, but through that interface it has been able to support an ecosystem of nearly every imaginable research question and tool. It even offers a tiered cost-recovery model: users needing only small quantities of data (a “sip”) can access the feed for free, while the rest are charged at a tiered pricing model based on the quantity of data they need, up to the entirety of all 340M tweets at the highest level. Finally, the interfaces provided by Twitter are compatible with the huge numbers of analytical, visualization, and filtering tools provided by the Google’s and Yahoo’s of the world with their open cloud toolkits. If archives took the same approach with a standardized interface like Twitter’s, researchers could leverage these huge ecosystems for the study of the web itself.

For some archives, the bottleneck has become the size of the data, which has become too large to share via the network. Through a partnership with Google, data miners can request from the HathiTrust a copy of the 1800-1924 Google Books archive, consisting of around 750 million pages of material. Instead of receiving a download link, users must pay the cost of purchasing and shipping a box full of USB drives, because networks, even between research universities, simply cannot keep up with the size of datasets used today. In the sciences, some of the largest projects, such as the Large Synoptic Survey Telescope, are going as far as to purchase and house an entire computing cluster in the same machine room as the data archive and allowing researchers to submit proposals to run their programs on the cluster, because even with USB drives the data is simply too large to copy.

Not all of the barriers to offering bulk data mining access to archives are technical: copyright and other legal restrictions can present significant complications. Though even here technology can provide a possible alternative: “nonconsumptive analysis,” in which software algorithms perform surface-level analyses rather than “deep reading” of text, may satisfy the requirements of copyright. In other cases, transformations of copyright material to another form, such as to a wordlist, as was done with the Google Books Ngrams dataset, may provide possible solutions.

Not everyone appreciates or understands the value web archives provide society and they are constantly under pressure just to find enough funds to keep the power running. This is an area where partnering with researchers may help: there are only a few sources of funding for the creation and operation of web archives compared with the myriad funding opportunities for research. The increased bandwidth, hardware load, and other resource requirements of large data mining projects comes at a real cost, but at the same time, it directly demonstrates the value of those archives to new audiences and disciplines that may be able to partner with those archives on proposals, potentially offering new funding opportunities.

User Insight

While some archives cannot offer access to their holdings for legal reasons and instead serve only as an archive-of-last-resort, most archives would hold little value to their constituents if they were not able to provide some level of access to the content they archived. User interfaces as a whole today are designed for casual browsing by non-expert users, with simplicity and ease of use as their core principles. As archives become a growing source for scholarly research, archives must address several key areas of need in supporting these more advanced users:

• Inventory. There is a critical need for better visibility into the precise holdings of each archive. With most digital libraries of digitized materials a visitor can browse through the collection from start to end, though even there one usually can’t export a CSV file containing a master list of everything in that collection. Most web archives, on the other hand, are accessible only through a direct lookup mechanism where the user types in a URL and gets back any matching snapshots. Archives only store copies of material, they don’t provide an index to it or even a listing of what they hold: it is assumed that this role is provided elsewhere. For domains that have been deleted or now house unrelated content, this is not always the case. This would be akin to libraries dropping their reading rooms, stacks, and card catalogs, and storing all of their books in a robotic warehouse. Instead of browsing or requesting a book by title or category, one could only request a book by its ISBN code, which had to be known beforehand, and it was someone else’s responsibility to store those codes. A tremendous step forward would be a list from each archive of all of the root domains that it has one or more pages from, but ultimately having a list of all URLs, along with the number of snapshots and a list of the dates of those snapshots would really enable an entirely new form of access to these archives. This data could be used by researchers and others to come up with new ways of accessing and interacting with the data held by these archives.
• Meta Search. With better inventory data, we could build metasearch tools that act as the digital equivalent of WorldCat for web archives. Web archives today operate more like document archives than libraries: they hold content, but they themselves often have no idea the full extent of what they hold. A scholar looking for a particular print document might have to spend months or even years scouring archives all over the world looking for one that holds a copy of that document, whereas if she was looking for a book, a simple search on WorldCat would turn up a list of every participating library that held a copy in their electronic catalog. This is possible because libraries have invested in maintaining inventories of their holdings and standardizing the way in which those inventories are stored so that third parties can aggregate and develop services that allow users to search across those inventories. Imagine being able to type in a URL and see every copy from every web archive in the world, rather than just the copies held by any one archive.
• Specialty Archives. Metasearch would allow federated search across all archives, but this also raises the concern about backups of smaller specialty archives. Larger whole-web archives like the Internet Archive still can’t possibly archive everything that exists. Specialty archives fill this niche, often with institutional focuses or through a researcher creating an archive of material on a particular niche topic for her own use. Often these archives are created for a particular research project and then discarded when that paper is published. How do we bring these “into the fold”? Perhaps some mechanism is needed for allowing those archives to submit to a network of web archives and say essentially “if you’re interested, here you go?” They would need to be marked separately, since their content was produced outside of the main archive’s processes, but as web crawlers become easier to use and more researchers create their own specialty curated collections, should we have mechanisms to allow them to be archived, to leverage their resources to penetrate areas of the web we might not be able to?
• Citability. For archives to be useful in scholarly research, a particular snapshot of a page must have a permanent identifier that can be cited in the references list of a publication. The Internet Archive provides an ideal example of this, in which each snapshot has its own permanent URL that includes both the page URL and the exact timestamp of that snapshot. This URL can be cited in a publication in the same format as any other webpage. Yet, not every archive provides this type of access, some make use of AJAX (interactive JavaScript applications) that provide a more “desktop-like” browsing experience, but mask the URL for each snapshot, making it impossible to point others to that copy.

Technical Insight

In the modern era libraries and archives have existed decoupled from their researchers: a professional class collected and curated their collections and scholars traveled about to whichever institutions held the materials they needed. Few records exist as to why a given library collected this work rather than that one, and as scholars we simply accept this. Yet, perhaps in the digital era we can do better, as most of these decisions are stored in emails, memos, and other materials, all of them searchable and indexable. Web crawlers are seeded with starting URLs and crawl based on deterministic software algorithms, both of which can be documented for scholars.

Most web archives operate as “black boxes” designed for casual browsing and retrieval of individual objects, without asking too many questions about how that object got there. This is in stark contrast to digitized archives, in which every conceivable piece of metadata is collected. A visitor to the Internet Archive today encounters an odd experience: retrieving a digitized book yields a wealth of information on how that digital copy came to be, from the specific library it came from to the name of the person who operated the scanner that photographed it, while retrieving a web page yields only a list of available snapshot dates.

• Snapshot Timestamps. All archives store an internal timestamp recording the precise moment when a page snapshot was downloaded, but their user interfaces often mask this information. For example, when examining changes in White House press releases, we found that clicking on a snapshot for “April 4, 2001” in the Internet Archive would always take us to a snapshot of the page we requested, but if we looked in the URL bar (Internet Archive includes the timestamp of the snapshot in the URL), we noticed that occasionally the snapshot we were ultimately given was from days or weeks before or after our requested date. Upon further research, we found that some archives automatically redirect a user to the nearest date when a given snapshot date becomes unavailable due to hardware failure or other reasons. This is an ideal behavior for a casual user, but for an expert user tracing how changes in a page correspond to political events occurring each day, this is problematic. Archives should provide a notice when a requested snapshot is not available, allowing the user to decide whether to proceed to the closest available one, or select another date.
• Page Versus Site Timestamps. Some archives display only a single timestamp for all pages collected from a given site during a particular crawl: usually the time at which the crawlers started archiving that site. Even a medium-sized site may take hours or days to fully crawl when rate-limiting and other factors are taken into account, and for some users it is imperative to know the precise moment when each page was requested, not when the crawlers first entered the site. Most archives store this information, so it is simply a matter of providing access to it via the user interface for those users requesting it.
• Crawl Algorithms. Not every site can be crawled to its entirety: some sites may simply be too large or have complex linking structures that make it difficult to find every page, or they may be dynamically generated. Some research questions may be affected by the algorithm used to crawl the site (depth-first vs breadth-first), the seed URLs used to enter the site (the front page, table of contents pages, content pages, etc), where it aborted the crawl (if it did), which pages errored during the crawl (and thus whose links were not crawled), etc. If, for example, one wishes to estimate the size of a dynamic database-driven website, such factors can be used to draw estimates on its total size and composition, but only if users can access these technical characteristics of the crawl.
• Raw Source Access. Current archives are designed to provide a transparent “time machine” view of the web, where clicking on a snapshot attempts to render the page in a modern browser in a way that reproduces what it originally looked like when it was captured, as faithfully as possible. However, a page might contain embedded HTML instructions such as a tag or JavaScript code that automatically forwards the browser to a new URL. This may happen transparently without the user noticing. In our study of White House press releases, we were especially interested in pages that had been blanked out, where a press release had been replaced with a tag and an editorial comment in an HTML comment in the page. Clicking on these pages using the Internet Archive interface just forwarded us to the new URL indicated by the refresh command, so we had to download the pages via a special downloading software package so we could review the source code of the page without being redirected. This is a relatively rare scenario, but it would be helpful for archives to provide a “view source” mode, where clicking on a snapshot takes the user right to the source code of a page, instead of trying to display the page.
• Crawler Physical Location. Several major foreign news outlets embargo content or present different selections or versions of their content depending on where the visitor’s computer is physically located. A visitor accessing such a site will see a very different picture depending on whether she is in the United States, the United Kingdom, China, or Russia. This is actually growing as an issue, as more sites adopt content management systems that dynamically adjust the structure and layout of the site for each individual visitor based on their actions as they click through the site. Analyses of such sites require information on where the crawlers were physically located and the exact order of pages they requested from the site. As with the other recommendations listed above, this information is already held by most archives, it is simply a matter of making it more available to users.

Fidelity and Linkage

• Fidelity. Modern web archiving platforms capture not only the HTML code of a page, but also interpret the HTML and associated CSS files to compile a list of images, CSS files, JavaScript code, and other files necessary to properly display the page and archive these as well. The rise of interactive and highly multimedia web pages is challenging this approach, as pages may have embedded Flash or AJAX/JavaScript applications, streaming video, and embedded widgets displaying information from other sites. No longer limited to “high design” or highly technical sites, these features are making their way into more traditional websites, such as the news media. For example, the BBC’s website includes both Flash and JavaScript animations on its front page, while the Chicago Tribune’s site includes Flash animations on its front page that respond to mouseovers and animate or perform other actions. BBC also includes an embedded JavaScript widget that displays advertisements. Both sites include extensive embedded streaming Flash-based video. Many of these tools reference data or JavaScript code on other sites: for example many sites now make use of Google’s Visualization API toolkit for interactive graphs and displays and simply link to the code housed on Google’s site. On the one hand, we might dismiss advertisements and embedded content as not worth archiving, yet a rich literature in the advertising discipline addresses the psychological impact of advertisements and other sidebar material on the processing of information in the web era. Even digitized historical newspaper archives have been very careful to offer access to the entire scanned page image to allow scholars to access advertisements and layout information, rather than just focusing on the article text. Excluding dynamic content will make it impossible for scholars of the future to understand how advertisements were used on the web. Yet, simply saving a copy of a Flash or AJAX widget may not be sufficient, as technical dependencies may render them unexecutable 20 years from now. One possibility might be creating a screen capture of each page as it is archived, to provide at least a coarse snapshot of what that page looked like to a visitor of the time period.
• Web/Social Linkage. Many sites are making use of social media platforms like Twitter and Facebook as part of their overall information ecosystem. For example, the front page of the Chicago Tribune prominently links to its Facebook page, where editors post a curated assortment of links to Tribune content over the course of each day. Visitors “like” stories and post comments on the Facebook summary of the story, creating a rich environment of commentary that exists in parallel to the original webpage on the Tribune site. Other sites allow commentary directly on their webpages through a “user comments” section. Some sites may only allow comments for a few days after a page is posted, while others may allow comments years later. This social narrative is an integral part of the content seen by visitors of the time, yet how do we properly preserve this material, especially from linked social media platform profiles?

Tomorrow, the third piece in this three-part series will offer conclusions and cover the role of archives.

Add a Comment

Your email address will not be published. Required fields are marked *