The following is a guest post from Sarah Weissman, a second year student in the MLS program at University of Maryland’s iSchool.
This past semester as part of the course Information Access in the Humanities, my classmates and I studied current trends in humanities scholarship. Under the guidance of Kari Kraus we learned about the availability of digital resources for humanities scholars and the use of these resources by researchers in humanities fields. For our final projects, we worked in groups to create humanities-based web collections, an act of curation similar to what many libraries and archives around the world are now undertaking. These student-developed collections are publicly available through The University of Maryland iSchool on the Archive-It website.
Each group of students was responsible for choosing a collection topic, appraising and selecting web sites to be included in the collection, and working with the Archive-It tool to collect the chosen websites. My group chose the then upcoming release of The Hobbit movie as the topic of our collection, foremost because of our shared interests in archiving multimedia content, and also because the movie was a highly anticipated event of some cultural significance, both as a popular film series and in the larger context of adaptations of the literary works of J.R.R. Tolkien. Our collection aimed to capture the digital artifacts associated with the movie, such as trailers, teaser videos, and artwork published by the movie’s creators in preparation for the movie release, as well as commentary published by its creators, actors and fans. To this end we selected both official and unofficial/fan sites for our collection, and focused the majority of our collection on social media (including social network sites like Twitter and Google+, as well as web forums and blogs). The timing of the release was also an important factor in our choice, since the design of social media websites and limitations of crawling software often means having to capture these pages virtually in real time.
As social media sites grow in popularity and become more and more important in our day-to-day lives, users increasingly face related concerns about privacy, security and content ownership. At the same time this added relevancy makes social media content an important target for archiving. Unfortunately, archiving social media content can be challenging. Reasons for this include the prohibitive use of robots.txt files (files which provide access restriction guidelines for web crawlers enforced using the honor system), inacessibility of password protected content, inaccessibility of content for reasons of monetization, and fear or threat of litigation from content and/or website owners. This post aims to document some of the technical difficulties in archiving social media sites, in particular the challenges we faced during our student project in archiving web discussion boards, but first I will give a brief background on web archiving with Archive-It and what scoping web collections means.
When creating web collections, scoping is an important concept. In essence, scoping is making choices about what will and won’t be included in the final collection. Topic and website choice are both important factors in scoping, as is availability of resources, including any limitations in time or data storage, or a dollar budget for archiving tool subscription costs or hosting fees for online collections. The Archive-It web crawling application, which is a fee subscription tool, lets users limit scope by time (both frequency and duration of crawls) and data (number of documents collected) and also gives users finer control of scope through seed specification and scoping rules.
A seed is a URL given to Archive-It’s web crawler (the open source Heritrix crawler) as a starting point for web collection. The seed can be a top level domain, such as “www.loc.gov” or a document or directory below a top-level domain, such as “blogs.loc.gov/digitalpreservation/”.
To give you some idea what this looks like, here is the list of seeds for our project:
- http://thehobbit.com/
- http://thehobbitblog.com
- http://www.mckellen.com/cinema/hobbit-movie
- http://twitter.com/TheHobbitMovie
- https://plus.google.com/116428360629190654024
- http://the-hobbit-movie.com/
- http://www.council-of-elrond.com/forums/forumdisplay.php?f=53
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17
How a seed is specified can have drastic implications for collection scope. For example, “www.loc.gov/” with a trailing slash means something different to the web crawler than www.loc.gov. Specifying a url with or without a subdomain also makes a difference in crawling (e.g. “loc.gov” vs “www.loc.gov” vs. “blog.loc.gov”). Scoping rules in Archive-It can be used to both expand and limit scope. They can be specified in several ways, including URL prefixes, regular expressions, or SURT rules. Once a seed set is specified, Archive-It users can schedule crawls on one or more of the seeds either as repeated crawls with a specified frequency or as one time crawls started on demand. Archive-It’s web crawler works by storing a copy of the seed page, then following URLs from that page to find more pages to archive. By default it archives embedded content (such as images and video, javascript and stylesheets) and web pages that fall under the lowest-level directory in the seed URL. To further aid in scoping, Archive-It lets users run test crawls that mimic the behavior of a real web crawl but only record the URLs that would have been crawled, not web page content.
Despite the utilities available to the Archive-It user, scoping of web collections can be quite challenging. As an example, below is a snapshot of the top results from our first production crawl, a one-time crawl of 3 days on all 8 of our seeds:
The report lists the number of URLs collected for each host (which you will recall includes the hosts for our seed sites as well as hosts for embedded content). By default the report is sorted by URLs, but the two hosts at the top of the list on the left, www.council-of-elrond.com and newboards.theonering.net, are also at the top when the list is sorted by data and queued documents, finishing with about 500k and 4.7 million queued docs respectively. This is surprising, given that all other hosts in the crawl finished with no queued documents, but, as we learned during this project, it’s hard to guess how much data any one seed will bring in. Nevertheless, 4.7 million URLs is a lot of URLs, about 67 times more documents than we crawled during our 3-day crawl, and thus way more than we could hope to crawl in a semester. Even if we could collect that many documents, assuming the data scaled in proportion we would be looking at about half a terabyte of data for just one host.
A little detective work helps here. Looking more closely at theonering.net, we see that the specific forum we were trying to crawl reports having 163995 posts in 5356 threads. This means that we have both not crawled enough URLs, but also queued too many documents. So what is going on?
In order to understand what is happening, it helps to have a basic idea of how a web-based forum is constructed. Much like a wiki, a web forum is not stored as a collection of web pages, but is a database-driven, dynamically generated website. Posts live in a backing data store and are displayed to website users through scripts. (Note that a web crawler does not preserve the underlying architecture of the website it is crawling, but only the web pages that are loaded when it visits a URL.) For example, take a look at a typical URL from theonering.net below:
If you are unfamiliar with dynamically generated web content, what we see here is a URL that points to a script “http://newboards.theonering.net/forum/gforum/perl/gforum.cgi.” Everything after the “?” is an argument to gforum.cgi. We can guess that “post=528337” means load post number 528337, “sb=post_time;so=DESC” means sort by post time in descending order, “forum_view=forum_view_collapsed” means display the rest of the post in the forum in a collapsed style, and “guest=53819654” means that we are browsing the forum as a guest with a particular session ID.
As mentioned above, Heritrix is designed to scope to the directory level of your seed. This means that anything not falling under “http://newboards.theonering.net/forum/gforum/perl/” is considered out of scope. Unfortunately, because of the dynamic nature of the site, everything falls under the same directory and only the trailing arguments change.
Here’s how the original page looks in a web browser.
Just looking at the page one can identify some potential problems for the web crawler. One is the “Forum Nav” frame that appears on every page. In addition, other navigational links appear above the forum text. This means that starting with the seed:
http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17
the crawler gives us not only everything in forum #17 (Movie Discussion: The Hobbit), but also everything else linked to in the side nav:
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=7
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=9
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=8
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=10
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=11
- http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=15
Luckily, as mentioned before, Archive-It gives us a way to limit crawl scope using scoping rules. Here are the scoping rules we used for our crawl:
They might be a little hard to read, but the gist is that we blocked all URLs that matched any of the forums in the side nav. (Regex-savvy readers will probably note that these rules could be more concisely expressed as a single regular expression. This is just how they evolved over time.)
So why do we have at least 4 million extra documents in our queue?
To answer this question we have to dive into the data. Archive-It lets you download lists of crawled, queued or blocked URLs as text files that you can put into the data analysis tool of your choice. Since I needed to sort URLs and match them against patterns for this analysis, my tool of choice was a Perl, a scripting language that is useful for text manipulation.
- Did we get the scoping rules wrong? No. Only one URL in 4.7 million came from a side nav forum (and that is probably because I left the trailing “;” in the regex rules.)
- Did we miss some forums with our scoping rules? Yes, but that only accounts for about 11k of the crawled or queued URLs.
- Did we collect 4 million images? No. Only about 200 of the URLs link to forum post attachments, which resolve to images, and no .jpg, .png, or .gif files are in the list.
So what is really going on? Remember that session ID (“guest=53819654”)? When Archive-It crawls the web, it uses numerous threads, each of which gets assigned a different ID by the forum. These IDs do not change the content of the web page, but they do change the URLs. This means that URLs are not unique, and the same document might get queued multiple times with different URLs. How big of a problem is this? Well, of the documents crawled and queued for theonering, there are 203,526 unique documents out of 4,813,987. That’s just 4%. (If you look at just the crawled documents it’s a little better, there are 32,512 out of 71,839, which is 45%.)
Now that we’ve discovered the problem, can we fix it with more scoping rules? Maybe… but probably not. Looking at the list of URLs crawled or queued sorted by number of repeats (where I only show the arguments part of the URL):
- do=login;guest=nnnnn 22579
- guest=nnnnn 22578
- do=search;guest=nnnnn 22578
- do=whos_online;guest=nnnnn 22578
- guest=nnnnn;category=3 19148
- do=search;search_forum=forum_17;guest=nnnnn 18675
- forum=17;guest=nnnnn 14952
- username=DanielLB;guest=nnnnn 11147
- do=message;username=DanielLB;guest=nnnnn 9958
- username=Shelob%27sAppetite;guest=nnnnn 8898
- username=Captain%20Salt;guest=nnnnn 8329
- do=message;username=Shelob%27sAppetite;guest=nnnnn 7764
- username=Carne;guest=nnnnn 7174
- username=dormouse;guest=nnnnn 7151
- do=message;username=Captain%20Salt;guest=nnnnn 6923
- username=Ataahua;guest=nnnnn 6846
- username=Silverlode;guest=nnnnn 6500
- username=Estel78;guest=nnnnn 6458
- username=dave_lf;guest=nnnnn 6267
- do=message;username=Carne;guest=nnnnn 6234
- do=message;username=dormouse;guest=nnnnn 6206
- username=Bombadil;guest=nnnnn 6073
- username=Ardam%EDr%EB;guest=nnnnn 5855
- username=Faenoriel;guest=nnnnn 5811
- username=Voronw%EB_the_Faithful;guest=nnnnn 5756
- username=Kangi%20Ska;guest=nnnnn 5737
…
We see that the most repeated URLs correspond to the navigation links at the top of the page, which are found on every page and which weren’t blocked by our scoping rules. These could generally be excluded with additional rules, except for the URLs that point to the forum we want to crawl, of course. Additionally, we can block all user pages. Unfortunately this list has a looooong tail, and includes many URLs that link directly to posts that we want to crawl. The longer we crawl the more links we visit and the more (potentially) different session IDs we generate, which means that the crawl might never finish. Adding more scoping rules may limit the redundant URL queuing enough to get all the URLs we want to capture, but this does not solve the problem in general and there is an obvious trade-off with data usage. More scoping rules may also lead to slower crawling, since the underlying software must test each URL against more patterns in order to make scoping determinations.
Unfortunately, we ran out of time for our project before coming up with a solution to this problem. So what might we have done? One option would have been to work with the web archiving service developers to improve their system. For example, more advanced regular expression rules that allow for both pattern matching and substitution might have let us replace the IDs generated by the website with a fixed value. The addition of a user option that limits the number of threads might have let us crawl without generating multiple guest IDs. For certain types of sites it just might be more effective to work directly with website owners to develop local mirrors or to archive content in other formats, rather than relying exclusively on general-purpose web-archiving services.
As a student, creating a real web-archiving collection has been a great opportunity, both to get hands on experience with tools that librarians and archivists are using on the job, and to witness first-hand the challenges of digital preservation. My unexpected journey through the world of web forums has taught me that the problem of web-archiving is by no means solved. Librarians and archivists have an opportunity to lead the way, to create the necessary technology and policy so that social media and other at risk web content can be preserved.
Comments (5)
Great post! Crawling forums is always a difficult task, with halls-of-mirrors and other traps abound.
When dealing with content like this, it may be best to use more powerful/granular tools than Archive-It. I don’t know if Archive-It supports crawling a range of single pages, but it may be useful to break the site up into post and forum pages. Since both are linearly indexed, they can be crawled with simple shell scripting; you can exhaust the numberspaces:
posts: http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?post=X for X in {1…max}
forums: http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?do=forum_view_collapsed;forum=Y;page=Z for Y in {1..forums} and Z in {1..pages}
Stitch it back together afterwards (sed is good for this) and you should have a complete, if static, content dump. Getting a single WARC out of all this can be done by feeding your crawler a textfile listing the exhausted numberspaces, and running it as a single command.
The nicest thing about a granular approach is that you can easily tell when a crawl is out of control, as you know exactly what it’s supposed to be doing. Instead of starting from the top and scoping down, by starting from the bottom and scoping up, it gives you greater control over the crawl. Unfortunately, this approach isn’t very scalable – while the general solution will be applicable to any NewBoard-driven sites, you would have to craft specifics for each forum. Hence why web archiving is so critical for libraries and archives: robots can do a lot, but they can’t contact site admins or craft site-specific solutions.
When crawling with Heritrix, stripping out session IDs is usually handled by the URL Canonicalization Rules (https://webarchive.jira.com/wiki/display/Heritrix/URI+Canonicalization+Rules), rather than the scoping rules. Unfortunately, if you change the canonicalisation rules, you also have to change your playback system (e.g. Wayback) to use the same rules so that the original links can be resolved.
Which is all rather disappointingly cumbersome.
A prospective approach that requires no other tools than Archive-It itself would be to append the session ID string to the seed URL. So, instead of setting the seed URL as:
http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17
set the seed URL as:
http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17;do=forum_view_collapsed;guest=56938819
This way, each of the crawler threads ends up traversing the same link structure and you don’t end up re-capturing the same content as many times as unique session IDs are generated.
Great articles and posts. Very useful forum
I was curious if you ever thought of changing the layout of your blog?
Its very well written; I love what youve got to say. But maybe you
could a little more in the way of content so people could connect with it better.
Youve got an awful lot of text for only having one or two images.
Maybe you could space it out better?