Co-Hosting a Datathon at the Library of Congress

Photo of about 20 people sitting at computers in a meeting room.

Archives Unleashed teams at wrap-up, day one. Photo by Jaime Mears.

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline is up. This was the second instance of this event- the University of Toronto hosted the first in March 2016- in what organizers plan to be a regular occurrence.

Why host a datathon?

For organizers Matthew Weber, Ian Milligan and Jimmy Lin, seasoned data scholars and educators, Archives Unleashed is an exercise in balancing discussion and practice — or what Milligan calls yacking and hacking — to help improve web archive research. The text on the Archives Unleashed website states, ”This event presents an opportunity to collaboratively unleash our web collections, exploring cutting-edge research tools while fostering a broad-based consensus on future directions in web archive analysis.”

Photo of writing on a white board about file types.

Team Museum’s URL text analysis of mimetypes found on museum websites. Photo by Jaime Mears.

But what is the value for the host institution – the Library of Congress or any other? There are actually many unique benefits; here are a few:

  • New patterns of information emerged from the web archives.
  • We networked with data scholars and, perhaps even more important, learned what we can do to support sophisticated technological research. (It helps the discovery process if staff members participate alongside researchers)
  • We collaborated across divisions within the Library of Congress and discovered areas of shared common interests. National Digital Initiatives was the main point of contact for event hosting, Library Services (which makes Library of Congress collections available) provided the data sets and the John L. Kluge Center and the Law Library provided content expertise and additional support.
  • We got our colleagues excited about the potential use of our collections and this emerging research service.

Even if none of these points are relevant for your institution, think of this exposure as a way to begin familiarizing your institution with the future of historical research. As Milligan said in his pre-workshop presentation, you “can’t do a faithful historical study post 1996 without web archives.”

Photo of computer engineers at work.

Team Turtle. Photo by Jaime Mears.

What do you need to host a datathon?

  • Technical experts. Weber, Lin, and Milligan provided support to researchers throughout the process, from feedback on initial proposals to technical support with tools and unruly data sets. If you don’t have anyone on staff to fill this role, look outside your staff for technical experts to partner with you.
  • Data sets. Not to be underestimated, data sets are the heart of the datathon. You don’t need to know everything about the data sets you serve (that’s what the researchers will provide), but the data sets need to be fairly small so they can be moved around easily (ours were no more than 10 GB each). It’s important to prepare effective messaging about the data sets in order to entice attendees to use them. If there are use restrictions associated with the data sets, you  may need to prepare release statements for the researchers to sign.
  • Content experts. This is where your library can shine. Not understanding the context of a data set can make skewed results difficult to untangle. Someone who understands the subject matter can save researchers a lot of time by helping them analyze visualization results. For example, The Law Librarians were able to look at a word cloud from a set of Supreme Court nomination websites and explain some of the dominant words that the researchers were seeing, and they were able to suggest particular buzzwords that researchers could use in their text analysis queries.
Whiteboard with writing.

Team Turtle’s ARCs to WARCs workflow. Photo by Jaime Mears.

  • Infrastructure. Researchers brought their own laptops but reliable and even enhanced broadband was crucial. The bandwidth needed to move or query data sets as large as a terabyte in size, especially when time is an issue, is formidable.Luckily, preventative actions can be taken to mitigate this stress on the network. The Unleashed organizers set size restrictions on our data sets (no more than 10GB each), and pre-loaded applications and all data sets (including those from the Internet Archive and University of Waterloo) onto virtual machines to minimize transfer times and surprises. If that isn’t an option, ask the researchers to download local copies of the data sets they wish to use in advance of the event so time isn’t wasted moving them around. “Infrastructure” also includes tables and chairs, whiteboards and presentation support.
  • Researchers. The datathon participants came from as far as Jakarta, from mixed backgrounds and interests, although the majority are involved in academia with specializations in media studies, history, computer science and political science. Some work for libraries and archives. Although they had varying technical abilities, most of the participants had experience with data-research methodologies and were familiar with the tools. To attract this group of people, organizers used a simple application process for participants and were able to provide some funding for travel and meals, and coordinated the workshop in conjunction with the Saving the Web symposium hosted by Dame Wendy Hall.

This list is scalable and can be tweaked to fit diverse budgets and spaces. Collaboration is essential. Even if you have staff members who are technical experts, even if you have all the money, partnering with other library units and external experts diversifies who might attend, the available data sets they bring and in general raises the potential for creativity and revelation.

Keeping Our Tools Sharp: Approaching the Annual Review of the Library of Congress Recommended Formats Statement

The following post is by Ted Westervelt, head of acquisitions and cataloging for U.S. Serials in the Arts, Humanities & Sciences section at the Library of Congress. Since first launching its Recommended Formats Statement (then called Recommended Format Specifications in 2014), the Library of Congress has committed to treating it as an important part of […]

ODF: The Open Document Format

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager at the Library of Congress. During December 2015, the Library’s Format Sustainability website added descriptions of eleven members of the Open Document Format family, aka OpenDocument and ODF. These eleven join a number of other format descriptions mounted in 2015, many […]

APIs: How Machines Share and Expose Digital Collections

Kim Milai, a retired school teacher, was searching on ancestry.com for information about her great grandfather, Amohamed Milai, when her browser turned up something she had not expected: a page from the Library of Congress’s Chronicling America site displaying a scan of the Harrisburg Telegraph newspaper from March 13, 1919. On that page was a story […]

Acquiring at Digital Scale: Harvesting the StoryCorps.me Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the StoryCorps.me mobile app and interviews Kate Zwaard and […]

The Veterans History Project Marks 15 Years of Service

“The willingness with which our young people are likely to serve in any war, no matter how justified, shall be directly proportional to how they perceive the Veterans of earlier wars were treated and appreciated by their nation.” — George Washington The Veterans History Project honors the lives and service of all American veterans –not […]

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online. The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also […]

Cooking Up a Solution to Link Rot

This post is cross posted on the blog of the Law Library of Congress, In Custodia Legis, which is an excellent source of information on current legal trends and materials from the Library’s collections pertaining to the law. It is a guest post by the Law Library’s managing editor, Charlotte Stichter. When Charlotte is not […]

Keeping Up With the Joneses: The New Recommended Formats Statement

The following post is by Ted Westervelt, head of acquisitions and cataloging for U.S. Serials in the Arts, Humanities & Sciences section at the Library of Congress. Issuing the Recommended Format Specifications When the Recommended Format Specifications were issued last summer, the Library of Congress was making an attempt to come to grips with the […]