This is a guest post by former American Folklife Center intern Annie Schweikert on her work to develop a minimal digital processing workflow. Annie is an MA candidate at NYU’s Moving Image Archiving and Preservation program who interned at the Library of Congress American Folklife Center in the summer of 2018. Other recent internships include digital audiovisual settings at CUNY TV and NYU’s Fales Library & Special Collections. Prior to her graduate education, she worked as a project media archivist at the Smithsonian’s Human Studies Film Archives and as a student worker at the Yale Film Study Center.
The American Folklife Center’s incoming digital material grows exponentially every year. In the last few years, digital materials have even become the majority of AFC’s incoming collections—and the AFC staff responsible for digital processing have had to shift their time accordingly. The ever-growing stream of acquisitions has kept AFC staff so busy processing that they have had little time to sit down and strategize about their workflow. As an intern at AFC this summer, my goal was to work with staff—specifically, Julia Kim, Digital Assets Specialist, and Maya Lerman, Processing Archivist—to lessen their processing burden by codifying a minimal digital processing workflow.
This process was a balancing act. We hoped to define a workflow that could be applied to all incoming digital materials, regardless of content or provenance, but we needed it to store adequate information for long-term preservation. We also wanted this workflow to require less staff time to execute and monitor, but more machine time. Various digital processing workflows at AFC already incorporated one-line commands (executed on a virtual machine command line interface), so we decided to string together many of these commands.
Two scripts emerged. One, called “reportmd,” generates metadata such as checksums and directory structure trees without human intervention. The other, “processfiles,” safely manages the reorganization of files and directory structures. Both lean heavily upon the “ingestfile” script written largely by Dave Rice, with significant contributions and documentation by Dinah Handel and many other contributors, in use at CUNY TV, which I was familiar with from my internship at CUNY TV over the 2017-2018 academic year.
In this process, we decided to strengthen the information AFC collects and stores to describe each file’s technical characteristics. This information is particularly crucial for the complex audiovisual media that makes up a major part of AFC’s collections. We chose to run and store reports from several tools, including ExifTool for image and text files, MediaInfo and MediaTrace for audiovisual files, and QCTools for audiovisual files that had been digitized in-house or by a vendor. Each tool generates its metadata reports in a CSV or XML file, which AFC now stores alongside the digital objects themselves, with corresponding filenames.
We also saw an opportunity to store more complete information about the actions we were taking on the files. AFC does have records of much of this information: once files are inventoried by AFC’s long-term storage platform, that platform automatically records events such as fixity checks. But AFC’s workflow means that most processing actions, such as filename changes or metadata extraction, actually take place outside of the repository. Staff must therefore enter the events by hand in an external content management database. We eliminated this time-consuming step by generating a spreadsheet to record preservation events. As the script completes each task, it records the timestamp, action, and operator name in a new spreadsheet row, leaving an easy-to-read record of actions and responsibility.
One of our biggest challenges in this process was sorting the files into different categories. AFC receives diverse digital files from a wide range of donors, and different file formats call for different tools. We ended up creating a long list of extensions that typically indicate audio files (e.g. .mp3, .wav) and video files (e.g. .mov, .mpg, .mkv). The script sends files with these extensions to be processed with audiovisual tools. Any files whose extensions do not match these audio or video extensions receive the default treatment for text and image files. It’s not a perfect solution, but it does draw out a good percentage of unambiguous audiovisual files for more comprehensive metadata.
In addition, to prevent our new custom-built tools from failing silently, the scripts incorporate several checks to ensure that every step completes successfully. For example, whenever logs are created, the script checks to see that the log exists and is larger than zero bytes. The scripts announce these failures in red text in the terminal window. It also writes the text of these failures to a text file that records every decision and interaction the operator makes, which can be referenced long after the terminal session has ended. These checks prevent obvious failures and provide a record of how the scripts worked, what errors they produced, when these errors occurred, and which staff member might have more information.
My internship was only a summer long, so we were only able to scratch the surface of possibilities, but we made a start. I’m indebted to Julia Kim and Maya Lerman for their time, guidance, and testing help, and to the AFC at large for welcoming me for a whirlwind ten weeks!
Comments (3)
I’m curious what the updated manifest-md5 file contains. Are the scripts hosted online anywhere to check out?
Yes, we’ll post it soon! It’s just hash values of payload files. This was meant to address significant changes to the payload pre-ingest. I may pull it or rename it so the date stamp goes first. As is, the file is not accepted by our digital repository.
thanks for your interest!
…and here are the scripts:
https://github.com/LibraryOfCongress/data-exploration/tree/master/americanfolklifecenter