Co-Hosting a Datathon at the Library of Congress

Photo of about 20 people sitting at computers in a meeting room.

Archives Unleashed teams at wrap-up, day one. Photo by Jaime Mears.

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline is up. This was the second instance of this event- the University of Toronto hosted the first in March 2016- in what organizers plan to be a regular occurrence.

Why host a datathon?

For organizers Matthew Weber, Ian Milligan and Jimmy Lin, seasoned data scholars and educators, Archives Unleashed is an exercise in balancing discussion and practice — or what Milligan calls yacking and hacking — to help improve web archive research. The text on the Archives Unleashed website states, ”This event presents an opportunity to collaboratively unleash our web collections, exploring cutting-edge research tools while fostering a broad-based consensus on future directions in web archive analysis.”

Photo of writing on a white board about file types.

Team Museum’s URL text analysis of mimetypes found on museum websites. Photo by Jaime Mears.

But what is the value for the host institution – the Library of Congress or any other? There are actually many unique benefits; here are a few:

  • New patterns of information emerged from the web archives.
  • We networked with data scholars and, perhaps even more important, learned what we can do to support sophisticated technological research. (It helps the discovery process if staff members participate alongside researchers)
  • We collaborated across divisions within the Library of Congress and discovered areas of shared common interests. National Digital Initiatives was the main point of contact for event hosting, Library Services (which makes Library of Congress collections available) provided the data sets and the John L. Kluge Center and the Law Library provided content expertise and additional support.
  • We got our colleagues excited about the potential use of our collections and this emerging research service.

Even if none of these points are relevant for your institution, think of this exposure as a way to begin familiarizing your institution with the future of historical research. As Milligan said in his pre-workshop presentation, you “can’t do a faithful historical study post 1996 without web archives.”

Photo of computer engineers at work.

Team Turtle. Photo by Jaime Mears.

What do you need to host a datathon?

  • Technical experts. Weber, Lin, and Milligan provided support to researchers throughout the process, from feedback on initial proposals to technical support with tools and unruly data sets. If you don’t have anyone on staff to fill this role, look outside your staff for technical experts to partner with you.
  • Data sets. Not to be underestimated, data sets are the heart of the datathon. You don’t need to know everything about the data sets you serve (that’s what the researchers will provide), but the data sets need to be fairly small so they can be moved around easily (ours were no more than 10 GB each). It’s important to prepare effective messaging about the data sets in order to entice attendees to use them. If there are use restrictions associated with the data sets, you  may need to prepare release statements for the researchers to sign.
  • Content experts. This is where your library can shine. Not understanding the context of a data set can make skewed results difficult to untangle. Someone who understands the subject matter can save researchers a lot of time by helping them analyze visualization results. For example, The Law Librarians were able to look at a word cloud from a set of Supreme Court nomination websites and explain some of the dominant words that the researchers were seeing, and they were able to suggest particular buzzwords that researchers could use in their text analysis queries.
Whiteboard with writing.

Team Turtle’s ARCs to WARCs workflow. Photo by Jaime Mears.

  • Infrastructure. Researchers brought their own laptops but reliable and even enhanced broadband was crucial. The bandwidth needed to move or query data sets as large as a terabyte in size, especially when time is an issue, is formidable.Luckily, preventative actions can be taken to mitigate this stress on the network. The Unleashed organizers set size restrictions on our data sets (no more than 10GB each), and pre-loaded applications and all data sets (including those from the Internet Archive and University of Waterloo) onto virtual machines to minimize transfer times and surprises. If that isn’t an option, ask the researchers to download local copies of the data sets they wish to use in advance of the event so time isn’t wasted moving them around. “Infrastructure” also includes tables and chairs, whiteboards and presentation support.
  • Researchers. The datathon participants came from as far as Jakarta, from mixed backgrounds and interests, although the majority are involved in academia with specializations in media studies, history, computer science and political science. Some work for libraries and archives. Although they had varying technical abilities, most of the participants had experience with data-research methodologies and were familiar with the tools. To attract this group of people, organizers used a simple application process for participants and were able to provide some funding for travel and meals, and coordinated the workshop in conjunction with the Saving the Web symposium hosted by Dame Wendy Hall.

This list is scalable and can be tweaked to fit diverse budgets and spaces. Collaboration is essential. Even if you have staff members who are technical experts, even if you have all the money, partnering with other library units and external experts diversifies who might attend, the available data sets they bring and in general raises the potential for creativity and revelation.

FADGI MXF Video Specification Moves Up an Industry-organization Approval Ladder

The following is a guest post by Carl Fleischhauer, who organized the FADGI Audio-Visual Working Group in 2007. Fleischhauer recently retired from the Library of Congress. The Federal Agencies Digitization Guidelines Initiative Audio-Visual Working Group is pleased to announce a milestone in the development of the AS-07 MXF video-preservation format specification. AS-07 has taken shape […]

DPOE Program Harnesses the Spirit of Kentucky Librarians

This is a guest post by Barrie Howard. The Library of Congress’s Digital Preservation Outreach and Education program delivered a train-the-trainer workshop on June 10, providing professional development in digital preservation to library professionals from Kentucky and West Virginia. The workshop was held at Northern Kentucky University and sponsored by the State Assisted Academic Library […]

Library of Congress Advisory Team Kicks off New Digitization Effort at Eckerd College

This is a guest post by Eckerd College faculty David Gliem, associate professor of Art History, and Nancy Schuler, librarian and assistant professor of Electronic Resources, Collection Development and Instructional Services. On June 3rd, a meeting at Eckerd College in St. Petersburg, Florida, brought key experts and College departments together to begin plans for the […]

The Radcliffe Workshop on Technology & Archival Processing

This is a guest post from Julia Kim, archivist in the American Folklife Center at the Library of Congress. The annual meeting of the Radcliffe Technology Workshop (April 4th – April 5th, #radtech16) brought together historians, (digital) humanists and archivists for an intensive discussion of the “digital turn” and its effect on our work. The […]

O Email! My Email! Our Fearful Trip is Just Beginning: Further Collaborations with Archiving Email

Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain!  but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey. Email archiving as a distinct research area has […]