The following is a guest post by Kalev H. Leetaru, University of Illinois, who presented these ideas at the 2012 General Assembly of the IIPC. This post is the second in a three-part series. View the first post in here.
For millennia, scholarship in archives and libraries has meant intensive reading of a small number of works. In the past decade the “digital humanities” and computational social sciences has led to the growing use of computerized analysis of archives in which software algorithms are used to identify patterns and point to areas of interest in the data. Digital archives have largely been built around this earlier access modality of deep reading, while computational techniques need rapid access to vast volumes of content, often encompassing the entire archive. New programming interfaces and access policies are needed to enable this new generation of scholarship using web archives.
Informal discussions with web archivists suggest a chicken-or-the-egg dilemma in this regard: data miners want to analyze archives, but can’t without the necessary programmatic interfaces, and archives for the most part want to encourage use of their archives, but don’t know what interfaces to support without working with data miners. Few archives today support the necessary programmatic interfaces for automated access to their collections, and those that do tend to be aimed at metadata, rather than fulltext content, and use library-centric protocols and mindsets. Some have fairly complex interfaces, with very fine-grained toolkits for each possible use scenario. The few that offer data exports offer an either-or proposition: you either download a ZIP file of the entire contents of everything in the archive or you get nothing: there is no in-between. Though there are some bright spots: the National Endowment for the Humanities has made initial steps towards helping archivists and data miners work together through grand challenge programs like its Digging into Data initiative where a selection of archives made their content available to awardees for large data mining.
Yet, one only has to look at Twitter for a model of what archives could do. Twitter provides only a single small programming interface with a few very basic options, but through that interface it has been able to support an ecosystem of nearly every imaginable research question and tool. It even offers a tiered cost-recovery model: users needing only small quantities of data (a “sip”) can access the feed for free, while the rest are charged at a tiered pricing model based on the quantity of data they need, up to the entirety of all 340M tweets at the highest level. Finally, the interfaces provided by Twitter are compatible with the huge numbers of analytical, visualization, and filtering tools provided by the Google’s and Yahoo’s of the world with their open cloud toolkits. If archives took the same approach with a standardized interface like Twitter’s, researchers could leverage these huge ecosystems for the study of the web itself.
For some archives, the bottleneck has become the size of the data, which has become too large to share via the network. Through a partnership with Google, data miners can request from the HathiTrust a copy of the 1800-1924 Google Books archive, consisting of around 750 million pages of material. Instead of receiving a download link, users must pay the cost of purchasing and shipping a box full of USB drives, because networks, even between research universities, simply cannot keep up with the size of datasets used today. In the sciences, some of the largest projects, such as the Large Synoptic Survey Telescope, are going as far as to purchase and house an entire computing cluster in the same machine room as the data archive and allowing researchers to submit proposals to run their programs on the cluster, because even with USB drives the data is simply too large to copy.
Not all of the barriers to offering bulk data mining access to archives are technical: copyright and other legal restrictions can present significant complications. Though even here technology can provide a possible alternative: “nonconsumptive analysis,” in which software algorithms perform surface-level analyses rather than “deep reading” of text, may satisfy the requirements of copyright. In other cases, transformations of copyright material to another form, such as to a wordlist, as was done with the Google Books Ngrams dataset, may provide possible solutions.
Not everyone appreciates or understands the value web archives provide society and they are constantly under pressure just to find enough funds to keep the power running. This is an area where partnering with researchers may help: there are only a few sources of funding for the creation and operation of web archives compared with the myriad funding opportunities for research. The increased bandwidth, hardware load, and other resource requirements of large data mining projects comes at a real cost, but at the same time, it directly demonstrates the value of those archives to new audiences and disciplines that may be able to partner with those archives on proposals, potentially offering new funding opportunities.
While some archives cannot offer access to their holdings for legal reasons and instead serve only as an archive-of-last-resort, most archives would hold little value to their constituents if they were not able to provide some level of access to the content they archived. User interfaces as a whole today are designed for casual browsing by non-expert users, with simplicity and ease of use as their core principles. As archives become a growing source for scholarly research, archives must address several key areas of need in supporting these more advanced users:
• Inventory. There is a critical need for better visibility into the precise holdings of each archive. With most digital libraries of digitized materials a visitor can browse through the collection from start to end, though even there one usually can’t export a CSV file containing a master list of everything in that collection. Most web archives, on the other hand, are accessible only through a direct lookup mechanism where the user types in a URL and gets back any matching snapshots. Archives only store copies of material, they don’t provide an index to it or even a listing of what they hold: it is assumed that this role is provided elsewhere. For domains that have been deleted or now house unrelated content, this is not always the case. This would be akin to libraries dropping their reading rooms, stacks, and card catalogs, and storing all of their books in a robotic warehouse. Instead of browsing or requesting a book by title or category, one could only request a book by its ISBN code, which had to be known beforehand, and it was someone else’s responsibility to store those codes. A tremendous step forward would be a list from each archive of all of the root domains that it has one or more pages from, but ultimately having a list of all URLs, along with the number of snapshots and a list of the dates of those snapshots would really enable an entirely new form of access to these archives. This data could be used by researchers and others to come up with new ways of accessing and interacting with the data held by these archives.
• Meta Search. With better inventory data, we could build metasearch tools that act as the digital equivalent of WorldCat for web archives. Web archives today operate more like document archives than libraries: they hold content, but they themselves often have no idea the full extent of what they hold. A scholar looking for a particular print document might have to spend months or even years scouring archives all over the world looking for one that holds a copy of that document, whereas if she was looking for a book, a simple search on WorldCat would turn up a list of every participating library that held a copy in their electronic catalog. This is possible because libraries have invested in maintaining inventories of their holdings and standardizing the way in which those inventories are stored so that third parties can aggregate and develop services that allow users to search across those inventories. Imagine being able to type in a URL and see every copy from every web archive in the world, rather than just the copies held by any one archive.
• Specialty Archives. Metasearch would allow federated search across all archives, but this also raises the concern about backups of smaller specialty archives. Larger whole-web archives like the Internet Archive still can’t possibly archive everything that exists. Specialty archives fill this niche, often with institutional focuses or through a researcher creating an archive of material on a particular niche topic for her own use. Often these archives are created for a particular research project and then discarded when that paper is published. How do we bring these “into the fold”? Perhaps some mechanism is needed for allowing those archives to submit to a network of web archives and say essentially “if you’re interested, here you go?” They would need to be marked separately, since their content was produced outside of the main archive’s processes, but as web crawlers become easier to use and more researchers create their own specialty curated collections, should we have mechanisms to allow them to be archived, to leverage their resources to penetrate areas of the web we might not be able to?
In the modern era libraries and archives have existed decoupled from their researchers: a professional class collected and curated their collections and scholars traveled about to whichever institutions held the materials they needed. Few records exist as to why a given library collected this work rather than that one, and as scholars we simply accept this. Yet, perhaps in the digital era we can do better, as most of these decisions are stored in emails, memos, and other materials, all of them searchable and indexable. Web crawlers are seeded with starting URLs and crawl based on deterministic software algorithms, both of which can be documented for scholars.
Most web archives operate as “black boxes” designed for casual browsing and retrieval of individual objects, without asking too many questions about how that object got there. This is in stark contrast to digitized archives, in which every conceivable piece of metadata is collected. A visitor to the Internet Archive today encounters an odd experience: retrieving a digitized book yields a wealth of information on how that digital copy came to be, from the specific library it came from to the name of the person who operated the scanner that photographed it, while retrieving a web page yields only a list of available snapshot dates.
• Snapshot Timestamps. All archives store an internal timestamp recording the precise moment when a page snapshot was downloaded, but their user interfaces often mask this information. For example, when examining changes in White House press releases, we found that clicking on a snapshot for “April 4, 2001” in the Internet Archive would always take us to a snapshot of the page we requested, but if we looked in the URL bar (Internet Archive includes the timestamp of the snapshot in the URL), we noticed that occasionally the snapshot we were ultimately given was from days or weeks before or after our requested date. Upon further research, we found that some archives automatically redirect a user to the nearest date when a given snapshot date becomes unavailable due to hardware failure or other reasons. This is an ideal behavior for a casual user, but for an expert user tracing how changes in a page correspond to political events occurring each day, this is problematic. Archives should provide a notice when a requested snapshot is not available, allowing the user to decide whether to proceed to the closest available one, or select another date.
• Page Versus Site Timestamps. Some archives display only a single timestamp for all pages collected from a given site during a particular crawl: usually the time at which the crawlers started archiving that site. Even a medium-sized site may take hours or days to fully crawl when rate-limiting and other factors are taken into account, and for some users it is imperative to know the precise moment when each page was requested, not when the crawlers first entered the site. Most archives store this information, so it is simply a matter of providing access to it via the user interface for those users requesting it.
• Crawl Algorithms. Not every site can be crawled to its entirety: some sites may simply be too large or have complex linking structures that make it difficult to find every page, or they may be dynamically generated. Some research questions may be affected by the algorithm used to crawl the site (depth-first vs breadth-first), the seed URLs used to enter the site (the front page, table of contents pages, content pages, etc), where it aborted the crawl (if it did), which pages errored during the crawl (and thus whose links were not crawled), etc. If, for example, one wishes to estimate the size of a dynamic database-driven website, such factors can be used to draw estimates on its total size and composition, but only if users can access these technical characteristics of the crawl.
• Crawler Physical Location. Several major foreign news outlets embargo content or present different selections or versions of their content depending on where the visitor’s computer is physically located. A visitor accessing such a site will see a very different picture depending on whether she is in the United States, the United Kingdom, China, or Russia. This is actually growing as an issue, as more sites adopt content management systems that dynamically adjust the structure and layout of the site for each individual visitor based on their actions as they click through the site. Analyses of such sites require information on where the crawlers were physically located and the exact order of pages they requested from the site. As with the other recommendations listed above, this information is already held by most archives, it is simply a matter of making it more available to users.
Fidelity and Linkage
• Web/Social Linkage. Many sites are making use of social media platforms like Twitter and Facebook as part of their overall information ecosystem. For example, the front page of the Chicago Tribune prominently links to its Facebook page, where editors post a curated assortment of links to Tribune content over the course of each day. Visitors “like” stories and post comments on the Facebook summary of the story, creating a rich environment of commentary that exists in parallel to the original webpage on the Tribune site. Other sites allow commentary directly on their webpages through a “user comments” section. Some sites may only allow comments for a few days after a page is posted, while others may allow comments years later. This social narrative is an integral part of the content seen by visitors of the time, yet how do we properly preserve this material, especially from linked social media platform profiles?
Tomorrow, the third piece in this three-part series will offer conclusions and cover the role of archives.