Top of page

Acquiring at digital scale: Harvesting the collection

Share this post:

GTL_wordmark_orange-background_no-SC-logo-02Thanks to The Great Thanksgiving Listen, the StoryCorps collection of interviews has doubled! Since the launch of mobile app in March, more than 68,000 interviews have been uploaded as of today—the vast majority of them in the few days following Thanksgiving.

The American Folklife Center at the Library of Congress is the archival home for StoryCorps, one of the largest oral history projects in existence, and the interviews are regularly featured on NPR’s Morning Edition. Since 2003, StoryCorps has collected more than 50,000 interviews via listening booths throughout the country.

However, with the advent of the mobile app, StoryCorps has created a global platform where anyone in the world can record and upload an oral history interview. This effort is a wish come true for StoryCorps founder Dave Isay, the 2015 recipient of the TED Prize. The prize comes with $1 million to invest in a powerful idea. Dave’s was to create an app, with a companion website at

The surge in mobile interviews is the result of StoryCorps’ efforts to partner with major national education organizations and a half dozen of the nation’s biggest school districts. More than 20,000 teacher toolkits were downloaded, the initiative was featured on the homepage of Google, and in both the Apple and Google app stores, not to mention remarkable media coverage. (There is already talk of The Great Thanksgiving Listen 2016.)

storycorpsapp2015At the Library, we are able to meet the challenge of acquiring tens of thousands of interviews at a time thanks to the ability to harvest them via the web. The process involves using StoryCorps’ application programming interface (API) to download the data. For the last several months, Kate Zwaard and David Brunton, who manage the Library of Congress’s Digital Library software development, have been working with Dean Haddock,’s lead developer, to perfect this means of transfer. This interview with Zwaard and Brunton explains that process, provides advice for those who want to do similar projects, and ponders the future of scaling archival acquisitions.

Q: Can you explain, in layman’s terms, the technology that makes this automated acquisition possible?

David: To get collections out of we use a Python program called Fetcher. Part of what made this project so fun was that we got to use our lessons learned from past experiences. We worked with the fine folks at StoryCorps and you guys in Folklife to define the StoryCorps API. We worked iteratively with StoryCorps developers on customizing the API.

The Python program connects to the StoryCorps API and downloads the content into a bagged format and gets handed off to the Library’s ingest service, which we use to move and inventory digital collections in the Library. Curatorial staff kick off automated processes that verify collections, make copies onto a durable storage system, then verify the copies were made and release the space on the virtual loading docks. The ingest of collections happens once a month. After we’re collecting successfully for a while, we will turn off the manual parts of the process. We’re now manually kicking off the retrieval and the rest is automated. Eventually, we will make the retrieval automated and the system will notify us if there is a problem.

Q: You worked with’s developer to customize the way the collection is exposed on the web. What special requirements had to be met to accommodate the Library’s needs and why?

Kate: At the Library of Congress, we have to make sure that items we are collecting are exactly what the producer intended them to be. That’s why we ask partners to provide checksums, a digital fingerprint, a sequence of characters that’s unique to a digital file. This way, we can make sure no mistakes have been made in the transmission, and it allows us to establish a chain of custody.

David: The best moment for me during this process was when we asked StoryCorps for checksums, and they told us they could do so for all but one item type hosted by a third-party vendor. StoryCorps told us they told the vendor that the Library of Congress asked, and the third-party vendor added checksums. It was really satisfying.

Q: What take-aways do you have for donors and archivists who want to collaborate in this way?

Kate: We’ve done digital collecting in a number of ways. Fetcher is a good model.

David: Yes, don’t let people push content to us, let us pull it. Another take-away is that in order to accession something, it had to be fixed, and it’s not how thought about their content. They thought about it as fluid.

Kate: With the popularity of digital photography and recent news events, the concept of metadata has moved from being library jargon into the common lexicon. Most people understand now that metadata (information about the content, like the name of a subject or the time of a phone call) can be just as important as the data it describes. I would encourage donors to think not only about their content but about what metadata they would like preserved alongside it. We had a discussion about whether to archive the comments about the interviews and the “likes.” Likes are so fluid that a snapshot at an arbitrary time might not say much; tags, however, could be useful to researchers.

Q: While public access to this collection is available through, what happens to the interviews when they arrive at the Library to ensure long-term preservation and access?

David: Copies are stored in multiple locations with multiple copies. Curatorial staff manages the collection with our inventory system, which keeps track of every StoryCorps file. There are two big classes of errors we try to protect against: 1) a single file or tape gets destroyed—the checksum reveals this, and then we can replace the bad copy with a good copy; and 2) losing access to collections through correlated errors – a class of mistakes where somebody followed a bad process or relied on bad code. In that case, we can use our inventory to identify those kinds of problems.

Kate: Tooling allows us to establish repeatable and automated processes that allow us to identify mistakes. Another thing cultural heritage institutions are worried about is the usability of digital objects over time. We’ve seen file formats go obsolete in a few years. We’re lucky that this collection is in commonly used formats.

David: Yes, the file formats JPG and MP3 are all 30 years old and continually available and have broad use.

Q: This is the second time that you have helped AFC acquire a collection in this way. (The earlier effort was harvesting Flickr photos showing how people celebrate Halloween.) How prevalent is this acquisition method and how do you see it shaping 21st century archival collecting?

Kate: We would like to establish an area of practice. Born-digital material is currently a small part of our collection that will grow explosively. We can make tools available to enable its processing. We already have a robust web archiving program, which focuses on collecting websites as a publication itself. Born digital collections differ from that in that they are focused on collecting items (photographs, tweets, books, blogs) that are published on the web. There are huge economies of scale in this kind of acquisition, and the results can be extraordinarily useful to researchers.


Add a Comment

Your email address will not be published. Required fields are marked *