To better understand how organizations working to ensure long-term access to digital content are meeting the challenges of digital stewardship, the NDSA Infrastructure Working Group is running a new series of interviews. In each of these, we ask individuals to answer questions about their organization and the technologies and tools they use to serve as case studies in digital preservation systems. In this first of the series, I interviewed Mike Shallcross and Max Eckard from the Bentley Historical Library at the University of Michigan.
Abbey: Could you tell us a bit about your organization or organizational unit? What kinds of content to you work with and why is long-term access to that content an important issue?
Bentley Staff: The Bentley Historical Library was established in 1935 by order of the University of Michigan Regents with a mandate to serve as the official archives of the university and a repository for the state of Michigan. Our library holdings and 11,000 research collections include 45,000 linear feet of primary source material, 10,000 maps, 80,000 printed volumes, 1.5 million photographs, and approximately 20 TB of digital content. In 2014, the Bentley underwent a reorganization that resulted in a new Curation unit responsible for the arrangement and description of paper, analog, and digital archives as well as the implementation of digital preservation strategies and resources.
Digital archives at the Bentley document key policies, functions, and activities of the University of Michigan in regards to its administration, academics, research, athletics, culture, and social life. The preservation of the University’s administrative and historical record is essential for its institutional knowledge, accountability, and continued operations. In addition, the Bentley also holds the personal papers of notable individuals and the records of voluntary associations, organizations, and government offices from the state of Michigan. These materials document the interests, activities, and impact of citizens and groups, providing important information and evidence for scholars as well as professional and privatel researchers.
In terms of file formats and content types, the Bentley’s digital collections include a wide variety of born-digital materials (Office documents, digital images, datasets, websites and social media, sound recordings, moving images, etc.) in addition to content digitized from original primary source materials by the Bentley and private donors.
Abbey: Specifically, what technologies are you using in your preservation system? In this case, we would be curious to hear about the whole stack from acquisition to description to preservation and access.
Bentley Staff: The Bentley’s current preservation system utilizes infrastructure provided by the University of Michigan as well as a variety of open-source and proprietary tools that are knitted together in our homegrown ‘AutoPro’ ingest and processing tool. Archivists established a largely manual workflow as part of the MeMail Project, a 2010-2011 initiative funded by a grant from the Mellon Foundation. Subsequent development work resulted in AutoPro, a collection of Windows shell and Visual Basic scripts that provide a unified command line interface to more than 20 separate applications and utilities to automate procedures and guide processors through a six-step workflow. A full overview of our ingest, descriptive, and preservation practices is available in our processing manual. The following will provide a high-level view of current procedures and practices, many (if not most) of which will be revised as part of our ongoing ‘ArchivesSpace-Archivematica-DSpace Workflow Integration‘ project.
When content is received on physical media (alone or as part of a larger accession of physical materials), we transfer content on a dedicated workstation in a ‘forensics-lite’ workflow in which we apply write-blockers (a Tableau T8-R2 for USB drives or software/media-specific blocking for other storage devices) to avoid altering files or system metadata and use TeraCopy to securely copy content to ensure data integrity and maintain timestamps. We generally only create disk images in the cases of 5.25″ floppies (using Device Side Data’s FC5025 floppy controller, after which content is extracted from the image using FTK Imager) or in cases where the preservation of key features or functionality require the creation of a disk image (our most common use case is DVD-formatted video that contains a menu and/or special features), in which case the image will be preserved alongside extracted content. For more information on these procedures, see our Removable Media workflow.
Once received, content is assigned a digital deposit identifier, bagged, and placed in a secure interim repository supported by the University of Michigan’s Information and Technology Services, which employs the Tivoli Storage Manager (TSM) service to create hourly, daily, and weekly snapshots in addition to nightly disaster recovery backups. When a digital accession is assigned for processing, a copy of the (unbagged) content is placed in a staff member’s workspace on a local NAS device and our AutoPro ingest and processing workflow is initiated.
Upon starting AutoPro, staff add basic collection-level metadata (creator, collection title, donor, a general copyright statement, etc.) and enter/verify their name (for the purposes of creating an audit trail). If this is an initial processing session for the deposit, AutoPro will run four preliminary procedures, which includes a virus scan, creation of a manifest, archive file extraction, and identification of paths that exceed 255 characters (a limitation with Windows operating systems). Subsequent processing sessions will not repeat these procedures and processors will be taken to the application’s main menu, which includes the following steps (with all actions that modify or impact files being recorded in log files):
- Initial Review: If the contents of the deposit are not adequately known (or if preliminary appraisal during migration from removable media has not taken place), the processor may elect to perform an initial review of the files using TreeSize Professional, Quick View Plus, and other applications.
- Personally Identifiable Information (PII) Scan: In order to protect the identities of record creators and limit its exposure to risk, the Bentley Historical Library has established policies in regard to PII such as credit card numbers and U.S. Social Security numbers. AutoPro previously employed Identity Finder Data Discovery to scan for PII, but after a change in the licensing structure, we have switched to bulk_extractor. After running a scan, processors verify search results and–if true positive hits are found–manually delete nonessential files or assign appropriate access restrictions to manage any risks.
- File Extension Identification: To facilitate patron access to content, processors may add file extensions to content with missing or incorrect extensions. AutoPro runs the UK National Archives’s DROID utility to search for files with missing or mismatched extensions. If any are identified, the processor will review the list and, using information generated by the TrID File Identifier utility and collected from the PRONOM format registry, append correct file extensions as needed.
- File Format Conversion: In transforming the SIP to an AIP, the Bentley Historical Library relies upon file format conversion as a primary preservation strategy. Based upon the Library of Congress’ work on the “Sustainability of Digital Formats” and documentation from the Florida Center for Library Automation and other peer institutions, the library has identified a number of at-risk (i.e. proprietary or potentially obsolete) file formats and developed conversion pathways to sustainable formats with various open source and freeware tools. AutoPro searches for these at risk formats (based upon extension) and then employs the following tools (the selection of which was informed by the Archivematica digital preservation system):
- ImageMagick: raster images to .TIFF
- Ghostscript: .PS, .EPS and .PDF to .PDF/A (JHOVE used to verifies if the original PDF is already identified as PDF/A)
- Inkscape: vector images to .SVG
- ffmpeg: audio to .WAV; video to MP4 with H.264 encoding
- Aid4Mail: various email formats to .MBOX
- Libre Office: legacy word processing documents to .ODT
- Microsoft Office File Converter (with the Office Migration Planning Manager utility): legacy MS Office formats to Open Office XML
These preservation versions are stored alongside the original and denoted by a suffix consisting of ‘_bhl-‘ and (where possible) the CRC32 hash of the original file (i.e. oralHistoryProject_bhl-0fbc2cc7.wav).
- Arrangement, Packaging, and Description: Processors may then conduct additional appraisal before proceeding to arrange and describe content in a manner that is both convenient to end users and in accordance with the intellectual arrangement of material in the finding aid. Arrangement may include packaging one or more files and/or folders into .zip files with custom batch files that maintain the directory-tree structure of content to preserve the context and original order of materials (and which also contain copyright statements and manifest of the .zip contents for the benefit of researchers). A MS Excel user form then helps standardize the creation of metadata and allows digital objects to be associated with the archival description. The resulting spreadsheet will be maintained by the library for administrative purposes but can also be employed for batch uploads to Deep Blue by University of Michigan Library staff.
- Once content has been arranged and described, AutoPro calls DROID to extract technical metadata and create md5 checksums (even for files in .zip archives) and then employs BagIt to transfer a copy of all material (and metadata) to a secure dark archives. An additional copy is deposited in our Deep Blue DSpace repository, which serves as the main access mechanism for the library’s born-digital archives. As a final step, AutoPro securely shreds the working directory and temporary files. Processors record information about the completed digital deposit in the Bentley’s collections management database.
Throughout the above workflow steps, AutoPro generates an audit trail, which includes the output of tools (when available) or the creation of logs. For instance, during file format creation, we document the original and resulting filenames/extensions, conversion timestamp, and associated software. We also create basic PREMIS preservation metadata for each action performed upon content.
Abbey: What parts of your digital infrastructure are generally performing the best? That is, what tools and services do you think are particularly solid and what is it about them that impresses you?
Bentley Staff: First and foremost, workflow automation with our AutoPro tool has been very helpful in empowering more staff and graduate student employees to work with digital archives while at the same time increasing efficiency and reducing opportunities for user errors, particularly in the operation of command line utilities, creating and naming log files, and recording preservation metadata.
In addition, the Bentley’s workflows and procedures were developed in accordance with (or with awareness of) recognized standards related to file formats for preservation and access; descriptive, administrative, technical, and preservation metadata schema; and functional entities of the OAIS reference model. The library has also sought to employ open-source (and, in some cases, widely-used proprietary) software applications that have been adopted and supported by a diverse community of users. Adhering to best practices and standards has been both helpful and reassuring to us (especially in the early stages of developing our digital curation program) as well as our donors. By relying upon proven technologies and well-established metadata creation practices, we can avoid reinventing the wheel (or move beyond the limits of our technical expertise) and focus on archival functions of appraisal, arrangement, description, and so forth.
Abbey: What are your biggest pain points in your digital workflows? Or said differently, if there was one choke point in managing digital content that you wish someone would solve for you what is it?
Bentley Staff: While AutoPro has been very helpful in standardizing our procedures and establishing a framework for ingest and processing activities, the tool has a number of limitations. The command line interface is not especially user friendly or intuitive (especially for new trainees) and a graphical user interface would make many operations much simpler (for instance, by permitting users to drag/drop files or select multiple files or folders to be packaged together in .zip files instead of copying/pasting paths into text files and then running a separate command). The CMD.EXE scripts also have poor error handling and string manipulation capabilities; we’re able to accomplish tasks, but a real programmer would be able to produce much more efficient code with a language like Python or Ruby, which we’re finally developing more in-house knowledge of via bi-weekly hackathons. Updating and maintaining AutoPro means adjusting programs and scripts on all workstations, a process that has been simplified by the Bentley’s move to a managed desktop computing environment but is still far than ideal. We also have concerns about AutoPro’s scalability, particularly in regards to large, media-rich accessions in our backlog. Running such collections through AutoPro can be resource-intensive and if an event, such as a loss of power, disrupts the process, we would have to start again at the beginning of that process to ensure it completes as expected.
Another major issue involves the creation and replication of archival description. Our current procedures require adding the same descriptive and administrative metadata multiple times in different systems: first, in our MS Word finding aid templates (which are then converted to EAD and posted online); again during AutoPro’s arrangement and description workflow step (so that the metadata can be associated with the data in our dark archives); and finally, when content is manually uploaded to our Deep Blue repository. In addition to being labor intensive, this duplication of effort introduces possibilities for irregularities in description, particularly between information entered into the finding aids and AutoPro’s Excel user form.
Abbey: What technologies (software and hardware) are you currently evaluating for use in your system? What sorts of issues are you hoping that these new tools can address?
Bentley Staff: We are currently in the middle of a Mellon-funded project to expedite the ingest, description and overall curation of digital archives by facilitating the creation and reuse of descriptive and administrative metadata among Archivematica and ArchivesSpace and streamlining the deposit of fully processed content into DSpace, our digital preservation repository.
To achieve our goals, the Bentley has contracted with Artefactual Systems, Inc. (the developers of Archivematica) for development work in the following areas:
- Introduce functionality into Archivematica that will permit users to review, appraise, deaccession, and arrange content in a new “Appraisal and Arrangement” tab in the system dashboard. (You can follow developments to the “Appraisal and Arrangement” tab on the Archivematica wiki:
- Load (and create) ASpace archival object records in the Archivematica “Appraisal and Arrangement” tab and then drag and drop content onto the appropriate archival objects to define Submission Information Packages (SIPs) that will in turn be described as ‘digital objects’ in ASpace and deposited as discrete ‘items’ in DSpace. This work will build upon the SIP Arrangement panel developed for Simon Fraser University and the Rockefeller Archives Center’s Archivematica-Archivists’ Toolkit integration.
- Create new archival object and digital object records in ASpace and associate the latter with DSpace handles to provide URIs/’href’ values for <dao> elements in exported EADs. While our grant focuses on creating AIPs for deposit into DSpace, we aim to create repository-agnostic AIPs using metadata sharing protocols such as ResourceSync. This will accommodate the varied practices of the community as well as, in our own case, the future adoption of a Hydra repository environment.
We’re very hopeful that Archivematica can address some of the issues enumerated in our response to the previous question. Because it is graphical, the interface is more intuitive and processors will not need to learn the command line. Because it is web-based, there will be no need to install clients on individual machines, and system updates will only need to happen once. Archivematica also improves upon AutoPro’s ability to handle errors. While some errors result in a process being halted and the transfer or SIP being moved to the failed directory, for others, processing can continue. Finally, we’re confident that Archivematica’s “pipeline” approach will scale better than Autopro–we need to start chipping away at those large, media-rich accessions in our backlog!
Archivematica also produces a very rich METS.xml file full of technical and preservation metadata about digital objects, which is indexed in the Archivematica Storage Service. Inspired in part by Binder (developed for MOMA by Artefactual), we’re exploring how best to make use of this information, whether that’s depositing it alongside the digital object in DSpace or using something like Blacklight to provide a “curatorial view” for technical and preservation metadata to inform future preservation actions (or possibly a combination of both approaches).
Because we’ll be using ArchivesSpace as our “system of record” for descriptive and administrative metadata, we’re hoping to overcome the irregularities in description caused by entering metadata in multiple places (e.g., the finding aids, the EAD XML and the Excel user form mentioned above). ArchivesSpace will also allow us to easily export EAD or MARC XML to our discovery system and library catalog.
By integrating ArchivesSpace with Archivematica, we’ll be able to allow each system to do what it does best without having to switch back-and-forth between them. Our workflows will become more efficient, we’ll have less opportunity to introduce human error, and we’ll be able to train new processors faster (since they’ll only need to learn one system, not two).
For both Archivematica and ArchivesSpace, we’re excited to adopt what have emerged as industry standard, open source initiatives. This will allow us to take advantage of the strong communities and lively discussion lists behind Archivematica and ArchivesSpace (in addition to the pieces of software themselves), and will enable us to focus our efforts on the curation of digital archives, rather than maintenance and support of homegrown systems. We’re also excited to contribute back to these communities; we hope that the development we’re sponsoring to integrate Archivematica and ArchivesSpace, to add AutoPro’s ability to appraise and arrange SIPs to the Archivematica dashboard Ingest tab, and to generate repository-agnostic AIPs, which will all be incorporated back into the Archivematica source code, will benefit the broader community as much as it will benefit us!
You can follow the project’s progress on Bentley’s blog and on Twitter. You can also email staff at the Bentley at bhl-mellon-grant at umich dot edu.
Updated 6/25/15 to include Bentley staff in the introduction and contact information.