Co-Hosting a Datathon at the Library of Congress

Photo of about 20 people sitting at computers in a meeting room.

Archives Unleashed teams at wrap-up, day one. Photo by Jaime Mears.

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline is up. This was the second instance of this event- the University of Toronto hosted the first in March 2016- in what organizers plan to be a regular occurrence.

Why host a datathon?

For organizers Matthew Weber, Ian Milligan and Jimmy Lin, seasoned data scholars and educators, Archives Unleashed is an exercise in balancing discussion and practice — or what Milligan calls yacking and hacking — to help improve web archive research. The text on the Archives Unleashed website states, ”This event presents an opportunity to collaboratively unleash our web collections, exploring cutting-edge research tools while fostering a broad-based consensus on future directions in web archive analysis.”

Photo of writing on a white board about file types.

Team Museum’s URL text analysis of mimetypes found on museum websites. Photo by Jaime Mears.

But what is the value for the host institution – the Library of Congress or any other? There are actually many unique benefits; here are a few:

  • New patterns of information emerged from the web archives.
  • We networked with data scholars and, perhaps even more important, learned what we can do to support sophisticated technological research. (It helps the discovery process if staff members participate alongside researchers)
  • We collaborated across divisions within the Library of Congress and discovered areas of shared common interests. National Digital Initiatives was the main point of contact for event hosting, Library Services (which makes Library of Congress collections available) provided the data sets and the John W. Kluge Center and the Law Library provided content expertise and additional support.
  • We got our colleagues excited about the potential use of our collections and this emerging research service.

Even if none of these points are relevant for your institution, think of this exposure as a way to begin familiarizing your institution with the future of historical research. As Milligan said in his pre-workshop presentation, you “can’t do a faithful historical study post 1996 without web archives.”

Photo of computer engineers at work.

Team Turtle. Photo by Jaime Mears.

What do you need to host a datathon?

  • Technical experts. Weber, Lin, and Milligan provided support to researchers throughout the process, from feedback on initial proposals to technical support with tools and unruly data sets. If you don’t have anyone on staff to fill this role, look outside your staff for technical experts to partner with you.
  • Data sets. Not to be underestimated, data sets are the heart of the datathon. You don’t need to know everything about the data sets you serve (that’s what the researchers will provide), but the data sets need to be fairly small so they can be moved around easily (ours were no more than 10 GB each). It’s important to prepare effective messaging about the data sets in order to entice attendees to use them. If there are use restrictions associated with the data sets, you  may need to prepare release statements for the researchers to sign.
  • Content experts. This is where your library can shine. Not understanding the context of a data set can make skewed results difficult to untangle. Someone who understands the subject matter can save researchers a lot of time by helping them analyze visualization results. For example, The Law Librarians were able to look at a word cloud from a set of Supreme Court nomination websites and explain some of the dominant words that the researchers were seeing, and they were able to suggest particular buzzwords that researchers could use in their text analysis queries.
Whiteboard with writing.

Team Turtle’s ARCs to WARCs workflow. Photo by Jaime Mears.

  • Infrastructure. Researchers brought their own laptops but reliable and even enhanced broadband was crucial. The bandwidth needed to move or query data sets as large as a terabyte in size, especially when time is an issue, is formidable.Luckily, preventative actions can be taken to mitigate this stress on the network. The Unleashed organizers set size restrictions on our data sets (no more than 10GB each), and pre-loaded applications and all data sets (including those from the Internet Archive and University of Waterloo) onto virtual machines to minimize transfer times and surprises. If that isn’t an option, ask the researchers to download local copies of the data sets they wish to use in advance of the event so time isn’t wasted moving them around. “Infrastructure” also includes tables and chairs, whiteboards and presentation support.
  • Researchers. The datathon participants came from as far as Jakarta, from mixed backgrounds and interests, although the majority are involved in academia with specializations in media studies, history, computer science and political science. Some work for libraries and archives. Although they had varying technical abilities, most of the participants had experience with data-research methodologies and were familiar with the tools. To attract this group of people, organizers used a simple application process for participants and were able to provide some funding for travel and meals, and coordinated the workshop in conjunction with the Saving the Web symposium hosted by Dame Wendy Hall.

This list is scalable and can be tweaked to fit diverse budgets and spaces. Collaboration is essential. Even if you have staff members who are technical experts, even if you have all the money, partnering with other library units and external experts diversifies who might attend, the available data sets they bring and in general raises the potential for creativity and revelation.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.