This is a guest post by Julia Kim, archivist in the American Folklife Center at the Library of Congress.
The American Folklife Center just celebrated 40 years since it was founded by Congressional mandate. But its origins far predate 1976; its earlier incarnation was the Archive of Folk Song,which was founded in 1928 and was part of the Library’s Music Division.
Its collections included many early analog audio recordings, like the Alan Lomax Collection and the Federal Cylinder Project’s Native and Indigenous American recordings. [See also the CulturalSurvival.org story.]
While the Library is well known for its work with different tools, guidelines and recommendations, less is known about its systems and workflows. I’ve been asked about my work in these areas and though I’ve only been on staff a relatively short while, I’d like to share a little about digital preservation at AFC.
As part of the Nation’s Library, AFC has a mandate to collect in the areas of “traditional expressive culture.” Of its digital collections, AFC maintains ongoing preservation of 200 TB of content but we project a 50% increase of approximately 100 TB of newly digitized or born-digital content this year. In our last fiscal year, the department’s acquisitions were 96% digital, spanning over 45 collections. StoryCorp’s 2015 accessions alone amounted to approximately 50,000 files (8 TB).
It has been a tremendous challenge to understand AFC’s past strategies with an already mature — but largely dark — repository, as well as how to refine them with incoming content. We have not yet had to systemically migrate large quantities of born-digital files but preserving the previously accessioned collections is a major challenge. More often than not, AFC processors apply the terms migration and remediation to older servers and databases rather than to files. This is an inevitable result of the growing maturity of our digital collections as well as others within the field of Digital Preservation.
The increasing amount of digital content also means that instead of relegating workflows to a single technical person (me), digital content is now handled by most of the archival processors in the division. AFC staff now regularly use a command line interface and understand how to navigate our digital repository. This is no small feat.
Similarly, staff training in core concepts is also ongoing. A common misconception is that ingest is a singular action when, in its fullest definition, it’s an abstraction that encompasses many actions, actors and systems. Ingest is one of the core functions in the OAIS framework. The Digital Preservation Coalition defines ingest as “the process of turning a Submission Information Package into an Archival Information Package, i.e. putting data into a digital archive.” Ingest, especially in this latter definition, can be contingent on relationships and agreements with external vendors, as well as arrangements with developers, project managers, processing staff and curators.
Transferring content is a major function of ingest and it is crucial to ensure that the many preservation actions down the line are done on authentic files. While transferring content involves taking an object into a digital repository, and it may seem to be a singular, discrete process, the transfer can involve many processes taking place over multiple systems by many different actors.
The flexibility inherent throughout the OAIS model requires systematic and clear definitions and documentation to be of any real use. This underscores the need for file verification and creating hash values at the earliest opportunity, as there is no technical ability to guarantee authenticity without receiving a checksum at production.
Ingest can then include validating the SIP, implementing quality assurance measures, extracting the metadata, inputting descriptive administrative metadata, creating and validating hash values and scanning for viruses. In our case, after establishing some intellectual control, AFC copies to linear tape before doing any significant processing and then re-copies again after any necessary renaming, reorganizing and processing.
Our digital preservation ecosystem relies on many commonly used open-source tools (bwfmetaedit, mediainfo, exiftool, Bagit, JHOVE, Tesseract), but one key tool is our modular home-grown repository, our Content Transfer Services ((see more about the development of CTS in this 2011 slide deck), which supports all of the Library of Congress, including the Copyright division and the Congressional Research Services.
CTS is primarily an inventory and transfer system but it continues to grow in capacity and it performs many ingest procedures, including validating bags upon transfer and copy, file-type validations (JHOVE2) and — with non-Mac filesystems — virus scanning. CTS allows users to track and relate copies of content across both long-term digital linear tape as well as disk-based servers used for processing and access. It is used to inventory and control access copies on other servers and spinning disks, as well as copies on ingest-specific servers and processing servers. CTS also supports workflow specifications for online access, such as optical character recognition, assessing and verifying digitization specifications and specifying sample rates for verifying quality.
Each grouping in CTS can be tracked through a chronology of PREMIS events, its metadata and its multiple copies and types. Any PREMIS event, such as a “copy,” will automatically validate the md5 hash value associated with each file, but CTS does not automatically or cyclically re-inventory and check hash values across all collections. Curators and archivists can use CTS for single files or large nested directories of files: CTS is totally agnostic. Its only requirement is that the job must have a unique name.
CTS’s content is handled by file systems. Historically, AFC files are arranged by AFC collections in hierarchical and highly descriptive directories. These structures can indicate quality, file/content types, collection groupings and accessioning groupings. It’s not unusual, for example, for an ingested SIP directory to include as much as five directory levels with divisions based on content types. This requires specific space projections for the creation and allocation of directory structures.
Similarly, AFC relies on descriptive file-naming practices with pre-pended indications of a collection identifier — as well as other identifiers — to create, in most cases, unique IDs. CTS does not, however, require unique file names, just a unique naming of the grouping of files. CTS, then, accepts highly hierarchical sets of files and directories but is unable to work readily at the file level. It works within curated groupings of files with a reasonable limitation of no more than 5,000 files and 1 TB for each grouping.
AFC plans to regularly ingest off-site to tape storage at the National Audio Visual Conservation Center in Culpepper, Virginia (see the PowerPoint overviews by James Snyder and Scott Rife). While most of our collections are audio and audiovisual, we don’t currently send any digital content to NAVCC servers except when we request physical media to be digitized for patrons to access. We’re in the midst of a year-long project to explore automating ingest to NAVCC in a way that integrates with our CTS repository systems on Capitol Hill.
This AFC-led project should support other divisions looking for similar integration and will also help pave the way to support on-site digitization and then ingest to NAVCC. The project has been fruitful in engaging conversations on different ingest requirements for NAVCC and its reliance, for example, on Merged AudioVisual Information System (MAVIS) xml records, previously used to track AFC’s movement of analog physical media to cold storage at NAVCC. AFC also relies heavily on department-created Oracle APEX databases and Access databases.
One pivotal aspect of ingest is data transfer. We receive content from providers in a variety of ways: hard drives and thumb drives sent through the mail, network transfer over cloud services like Signiant Exchange and Dropbox, and API harvesting of the StoryCorps.Me collection. Each method carries some form of risk, from network failures and outages to hard drive failures. And, of course, human error.
AFC also produces and sponsors lots of content production, including in-house lectures and concerts and its Occupational Folklife Collections, which involve many non-archival processing staff members and individuals.
Another aspect that determines our workflows involves the division between born-digital collections and accessions versus our digitized collections meant for online access. As part of my introduction to the Library, I jumped into AFC’s push to digitize and provide online access to 25 ethnographic field projects collected from 1977-1997 (20 TB multi-format). AFC has just completed and published a digitized collection, the Chicago Ethnic Arts project.
These workflows can be quite distinct but in both the concept of “processing” is interpreted widely. In the online access digitization workflows, which have involved the majority of our staff’s processing time over the past six months, we must assess and perform quality control measures on different digital specifications as well as create and inventory derivatives at a mass scale across multiple servers. These collections, which we will continue to process over the years, test the limits of existing systems.
The department quickly maxed out the server space set aside for reorganizing content, creating derivatives and running optical character recognition software. Our highly descriptive directory structures were incompatible with our online access tools and required extensive reorganization. We also realized that there was a very high learning curve to working with a command line interface for many staff and many ongoing mistakes were not found until much later. Also later in the project, we determined that our initial vendor specifications were unsupported by some of the tools we relied on for online display. The list goes on but the processing of these collections served as an intensive continuous orientation to historical institutional practices.
There are many reasons for the road blocks we encountered and some were inevitable. At the time that some of the older AFC practices had been established, CTS and other Library systems could not support our now current needs. However, like many new workflows, the field project digitization workflows are ongoing. Each of these issues required extensive meetings with stakeholders across different departments that will continue to over the coming months. These experiences have been essential in refining stakeholder roles and responsibilities as well as expectations around the remaining unprocessed 24 ethnographic field projects. Not least of all, there is a newer shared understanding of the time, training and space needed to move, copy and transform large quantities of digitized files. Like much of digital preservation itself, this is an iterative process.
As the year winds down, priorities can soon shift to revisiting our department’s digital preservation guidelines for amendment, inventorying unclearly documented content on tape and normalizing and sorting through the primarily descriptive metadata of our digital holdings.
Additionally AFC is focusing on re-organizing and understanding complex, inaccessible collections that are on tape. In doing so, we’ll be pushing our department to focus on areas of our self audit from last year that are most lacking, specifically in metadata. Another summer venture for us is to test and create workflows for identifying fugitive media left mixed in with paper with hybrid collections.
This summer, I’ll work with interns to develop a workflow to label, catalog, migrate and copy to tape, using the Alliance of American Quilts Collection as our initial pilot collection. AFC has also accumulated a digital backlog of collections that has not been processed or ingested in any meaningful way during our focus on digitization workflows. These need to be attended to in the next several months.
While this is just a sampling of our current priorities, workflows, systems and tools, it should paint a picture of some of the work being done in AFC’s processing room. AFC was an early adopter of digital preservation at the Library of Congress and as its scope has expanded over the past few decades, its computer systems and workflows have matured to keep up with its needs. The American Folklife Center continues to pioneer and improve digital preservation and access to the traditional expressive culture in the United States.