Co-Hosting a Datathon at the Library of Congress

Photo of about 20 people sitting at computers in a meeting room.

Archives Unleashed teams at wrap-up, day one. Photo by Jaime Mears.

On June 14 and 15, the Library of Congress hosted Archives Unleashed 2.0, a web archive “datathon” (otherwise known as a “hackathon,” but apparently any term with the word “hack” in it might sound a bit menacing) in which teams of researchers used a variety of analytical tools to query web-archive data sets in the hopes of discovering some intriguing insights before their 48-hour deadline is up. This was the second instance of this event- the University of Toronto hosted the first in March 2016- in what organizers plan to be a regular occurrence.

Why host a datathon?

For organizers Matthew Weber, Ian Milligan and Jimmy Lin, seasoned data scholars and educators, Archives Unleashed is an exercise in balancing discussion and practice — or what Milligan calls yacking and hacking — to help improve web archive research. The text on the Archives Unleashed website states, ”This event presents an opportunity to collaboratively unleash our web collections, exploring cutting-edge research tools while fostering a broad-based consensus on future directions in web archive analysis.”

Photo of writing on a white board about file types.

Team Museum’s URL text analysis of mimetypes found on museum websites. Photo by Jaime Mears.

But what is the value for the host institution – the Library of Congress or any other? There are actually many unique benefits; here are a few:

  • New patterns of information emerged from the web archives.
  • We networked with data scholars and, perhaps even more important, learned what we can do to support sophisticated technological research. (It helps the discovery process if staff members participate alongside researchers)
  • We collaborated across divisions within the Library of Congress and discovered areas of shared common interests. National Digital Initiatives was the main point of contact for event hosting, Library Services (which makes Library of Congress collections available) provided the data sets and the John W. Kluge Center and the Law Library provided content expertise and additional support.
  • We got our colleagues excited about the potential use of our collections and this emerging research service.

Even if none of these points are relevant for your institution, think of this exposure as a way to begin familiarizing your institution with the future of historical research. As Milligan said in his pre-workshop presentation, you “can’t do a faithful historical study post 1996 without web archives.”

Photo of computer engineers at work.

Team Turtle. Photo by Jaime Mears.

What do you need to host a datathon?

  • Technical experts. Weber, Lin, and Milligan provided support to researchers throughout the process, from feedback on initial proposals to technical support with tools and unruly data sets. If you don’t have anyone on staff to fill this role, look outside your staff for technical experts to partner with you.
  • Data sets. Not to be underestimated, data sets are the heart of the datathon. You don’t need to know everything about the data sets you serve (that’s what the researchers will provide), but the data sets need to be fairly small so they can be moved around easily (ours were no more than 10 GB each). It’s important to prepare effective messaging about the data sets in order to entice attendees to use them. If there are use restrictions associated with the data sets, you  may need to prepare release statements for the researchers to sign.
  • Content experts. This is where your library can shine. Not understanding the context of a data set can make skewed results difficult to untangle. Someone who understands the subject matter can save researchers a lot of time by helping them analyze visualization results. For example, The Law Librarians were able to look at a word cloud from a set of Supreme Court nomination websites and explain some of the dominant words that the researchers were seeing, and they were able to suggest particular buzzwords that researchers could use in their text analysis queries.
Whiteboard with writing.

Team Turtle’s ARCs to WARCs workflow. Photo by Jaime Mears.

  • Infrastructure. Researchers brought their own laptops but reliable and even enhanced broadband was crucial. The bandwidth needed to move or query data sets as large as a terabyte in size, especially when time is an issue, is formidable.Luckily, preventative actions can be taken to mitigate this stress on the network. The Unleashed organizers set size restrictions on our data sets (no more than 10GB each), and pre-loaded applications and all data sets (including those from the Internet Archive and University of Waterloo) onto virtual machines to minimize transfer times and surprises. If that isn’t an option, ask the researchers to download local copies of the data sets they wish to use in advance of the event so time isn’t wasted moving them around. “Infrastructure” also includes tables and chairs, whiteboards and presentation support.
  • Researchers. The datathon participants came from as far as Jakarta, from mixed backgrounds and interests, although the majority are involved in academia with specializations in media studies, history, computer science and political science. Some work for libraries and archives. Although they had varying technical abilities, most of the participants had experience with data-research methodologies and were familiar with the tools. To attract this group of people, organizers used a simple application process for participants and were able to provide some funding for travel and meals, and coordinated the workshop in conjunction with the Saving the Web symposium hosted by Dame Wendy Hall.

This list is scalable and can be tweaked to fit diverse budgets and spaces. Collaboration is essential. Even if you have staff members who are technical experts, even if you have all the money, partnering with other library units and external experts diversifies who might attend, the available data sets they bring and in general raises the potential for creativity and revelation.

O Email! My Email! Our Fearful Trip is Just Beginning: Further Collaborations with Archiving Email

Apologies to Walt Whitman for co-opting the first line of his famous poem O Captain! My Captain!  but solutions for archiving email are not yet anchor’d safe and sound. Thanks to the collaborative and cooperative community working in this space, however, we’re making headway on the journey. Email archiving as a distinct research area has […]

Demystifying Digital Preservation for the Audiovisual Archiving Community

The following is a guest post by Kathryn Gronsbell, Digital Asset Manager, Carnegie Hall; Shira Peltzman, Digital Archivist, UCLA Library; Ashley Blewer, Applications Developer, NYPL; and Rebecca Fraimow, Archivist and AAPB NDSR Program Coordinator, WGBH. The intersection of digital preservation and audiovisual archiving has reached a tipping point. As the media production and use landscape […]

Avoid Jitter! Measuring the Performance of Audio Analog-to-Digital Converters

The following is a guest post by Carl Fleischhauer, a Project Manager in the National Digital Initiatives unit at the Library of Congress. It’s not for everyone, but I enjoy trying to figure out specialized technical terminology, even at a superficial level. For the last month or two, I have been helping assemble a revision […]

Digitizing Motion Picture Film: FADGI Report on Current Practices and Future Directions

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager at the Library of Congress. More often than not, the Federal Agencies Digitization Guidelines Initiative Working Groups (one for still images, one for audio-visual) find themselves walking a line between codifying widely adopted practices and exploring new ideas and new technologies […]

Acquiring at Digital Scale: Harvesting the StoryCorps.me Collection

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the StoryCorps.me mobile app and interviews Kate Zwaard and […]

Access Historic Audio and Video Programs: AAPB Launches Online Reading Room

The following is a guest post by Karen Cariani, AAPB Project Director and Director WGBH Media Library and Archive, Alan Gevinson, AAPB Project Director and Special Assistant to the Packard Campus Chief, and Casey Davis, Project Manager, American Archive of Public Broadcasting, WGBH Educational Foundation. The American Archive of Public Broadcasting (AAPB) Project Team at […]

Announcing the 2015 Innovation Award Winners

On behalf of the National Digital Stewardship Alliance Innovation Working Group, I am excited to announce the 2015 NDSA Innovation Award winners! This year, the annual innovation awards committee reviewed over thirty exceptional nominations from across the country. Awardees were selected based on how their work or their project’s whose goals or outcomes represent an […]

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online. The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also […]