Minimal Digital Processing at the American Folklife Center

This is a guest post by former American Folklife Center intern Annie Schweikert on her work to develop a minimal digital processing workflow. Annie is an MA candidate at NYU’s Moving Image Archiving and Preservation program who interned at the Library of Congress American Folklife Center in the summer of 2018. Other recent internships include digital audiovisual settings at CUNY TV and NYU’s Fales Library & Special Collections. Prior to her graduate education, she worked as a project media archivist at the Smithsonian’s Human Studies Film Archives and as a student worker at the Yale Film Study Center.

The American Folklife Center’s incoming digital material grows exponentially every year. In the last few years, digital materials have even become the majority of AFC’s incoming collections—and the AFC staff responsible for digital processing have had to shift their time accordingly. The ever-growing stream of acquisitions has kept AFC staff so busy processing that they have had little time to sit down and strategize about their workflow. As an intern at AFC this summer, my goal was to work with staff—specifically, Julia Kim, Digital Assets Specialist, and Maya Lerman, Processing Archivist—to lessen their processing burden by codifying a minimal digital processing workflow.

This process was a balancing act. We hoped to define a workflow that could be applied to all incoming digital materials, regardless of content or provenance, but we needed it to store adequate information for long-term preservation. We also wanted this workflow to require less staff time to execute and monitor, but more machine time. Various digital processing workflows at AFC already incorporated one-line commands (executed on a virtual machine command line interface), so we decided to string together many of these commands.

Two scripts emerged. One, called “reportmd,” generates metadata such as checksums and directory structure trees without human intervention. The other, “processfiles,” safely manages the reorganization of files and directory structures. Both lean heavily upon the “ingestfile” script written largely by Dave Rice, with significant contributions and documentation by Dinah Handel and many other contributors, in use at CUNY TV, which I was familiar with from my internship at CUNY TV over the 2017-2018 academic year.

In this process, we decided to strengthen the information AFC collects and stores to describe each file’s technical characteristics. This information is particularly crucial for the complex audiovisual media that makes up a major part of AFC’s collections. We chose to run and store reports from several tools, including ExifTool for image and text files, MediaInfo and MediaTrace for audiovisual files, and QCTools for audiovisual files that had been digitized in-house or by a vendor. Each tool generates its metadata reports in a CSV or XML file, which AFC now stores alongside the digital objects themselves, with corresponding filenames.

A chart showing an example directory tree before metadata extraction, and then the same tree with many more metadata files added after extraction.

An example bag before and after metadata extraction.

We also saw an opportunity to store more complete information about the actions we were taking on the files. AFC does have records of much of this information: once files are inventoried by AFC’s long-term storage platform, that platform automatically records events such as fixity checks. But AFC’s workflow means that most processing actions, such as filename changes or metadata extraction, actually take place outside of the repository. Staff must therefore enter the events by hand in an external content management database. We eliminated this time-consuming step by generating a spreadsheet to record preservation events. As the script completes each task, it records the timestamp, action, and operator name in a new spreadsheet row, leaving an easy-to-read record of actions and responsibility.

An image of a spreadsheet where a row is a preservation event, and the columns are data about the event. The columns are Date, File or Directory Name, Event, Event Outcome, Related filename, Operator, and Notes.

An example preservation action spreadsheet.

One of our biggest challenges in this process was sorting the files into different categories. AFC receives diverse digital files from a wide range of donors, and different file formats call for different tools. We ended up creating a long list of extensions that typically indicate audio files (e.g. .mp3, .wav) and video files (e.g. .mov, .mpg, .mkv). The script sends files with these extensions to be processed with audiovisual tools. Any files whose extensions do not match these audio or video extensions receive the default treatment for text and image files. It’s not a perfect solution, but it does draw out a good percentage of unambiguous audiovisual files for more comprehensive metadata.

In addition, to prevent our new custom-built tools from failing silently, the scripts incorporate several checks to ensure that every step completes successfully. For example, whenever logs are created, the script checks to see that the log exists and is larger than zero bytes. The scripts announce these failures in red text in the terminal window. It also writes the text of these failures to a text file that records every decision and interaction the operator makes, which can be referenced long after the terminal session has ended. These checks prevent obvious failures and provide a record of how the scripts worked, what errors they produced, when these errors occurred, and which staff member might have more information.

My internship was only a summer long, so we were only able to scratch the surface of possibilities, but we made a start. I’m indebted to Julia Kim and Maya Lerman for their time, guidance, and testing help, and to the AFC at large for welcoming me for a whirlwind ten weeks!

Born to Be 3D: Born-Digital Data Stewardship

Today’s post is from Jesse Johnston and Jon Sweitzer-Lamme. Jon is the Librarian in Residence at The Library of Congress’ Preservation Directorate. He is a 2017 graduate of the University of Illinois at Urbana-Champaign’s iSchool, receiving a MSLIS with a minor in Museum Studies and a certificate in Special Collections. On November 2, the Library hosted […]

Memory XFR

This is a guest post by Siobhan C. Hagan reporting on the Memory XFR event hosted by the American Folklife Center and the DC Public Library. Siobhan is the Memory Lab Network Project Manager at DC Public Library, where she leads the IMLS National Leadership Grant project to embed digital preservation tools and education in […]

Let’s go! Explore, transcribe, and tag at

This is a guest post from Lauren Algee, LC Labs Senior Innovation Specialist. Connect with Lauren and her fellow Community Managers Elaine Kamlley and Victoria Van Hyning via History Hub and on Twitter, as well as GitHub. What yet-unwritten stories lie within the pages of Clara Barton’s diaries, writings of Civil Rights pioneer Mary Church Terrell, or […]

Foreign Law Web Archives

Law and government are major areas of web archiving at the Library of Congress, and feature prominently among the event and thematic collections available on The Law Library, which holds the largest collection of legal materials in the world, also coordinates the collection of Law websites through five significant collections: the Federal Courts Web […]

New strategy! New crowd! New team!

Big news! We’ll launch a crowdsourcing program at the Library of Congress on October 24. We’re asking everyone to join us as we improve discovery and access across our diverse collections through transcription and tagging. The program is grounded in what we’ve learned through our previous experiences with participatory projects at the Library, including image […]

Science Blogs Web Archive

This guest post is an interview with Lisa Massengale, Head of the Science Reference Section, with contributions by the Web Archive’s creator Jennifer Harbster, a Science Reference and Research Specialist for the Science, Technology and Business Division from Oct. 2001- Dec. 2015.  Along with her reference duties for the Library’s Science Reference Service, she created […]

Exploring Late 1800s Political Cartoons through Interactive Data Visualizations

This is a guest blog post by Jeffrey Shen, a high-school Innovation Intern with LC Labs. Over the course of my three month internship with the LC Labs team, I developed a website/interactive data visualization which allows users to explore the late 1800s through political cartoons contained in the Cartoon Drawings collection. The main feature of […]