Tool Time, or a Discussion on Picking the Right Digital Preservation Tools for Your Program: An NDSR Project Update

The following is a guest post by John Caldwell, National Digital Stewardship Resident at the United States Senate Historical Office.

Who remembers Home Improvement? Tim the “Tool Man” Taylor was always trying to show the “Tool Time” audience how to build things, make repairs and of course, demo new tools made by the show’s sponsor, Binford. In true sitcom fashion, he broke more things than he fixed, thanks to his “more power” mantra. But I’m reminded of the episode “Be True To Your Tool,” where Tim tested a new saw, took it apart to analyze its construction and refused to endorse it on the show because it wasn’t good enough.

I see this as an object lesson for the digital preservation community. There are lots of tools out there, from checksum validators to digital forensics suites and wholesale preservation solutions. Many people feel it’s important to have the latest and greatest (as someone who regularly upgrades his cell phone, I can sympathize), but in this impulse for the new and the now, we sometimes forget to ask the big question: For your institution, is tool X good enough? Or, to put it another way, is this the right tool for you?

I’m trying to answer that question right now in the U.S. Senate Historical Office.

Commercial Break: My NDSR Project

44 USC §2118 and Senate Standing Rules XI and XXVI require the Secretary of the Senate to transfer non-current Senate committee records to NARA’s Center for Legislative Archives for long-term storage and preservation. The Senate Historical Office, specifically the Senate Archivist and her team, takes on the task of working with Senate offices and their archivists to collect, describe and prepare records for transfer to NARA, who takes on the responsibility of long-term preservation and storage.

Since the beginning of the 111th Congress in 2009, Senate archivists have transferred nearly 12 TB of Senate Committee digital records to NARA, focusing primarily on gathering digital records and describing their informational content. In 2015, now that the archivists are more experienced in managing electronic records, it’s time to better align the Senate with the digital preservation best practices that have developed over the last six years.

And this is where I come in. My project (PDF) is to help the Senate archivists by:

  1. studying current Senate workflows;
  2. benchmarking current policies against best practices;
  3. reviewing and testing potential digital curation applications;
  4. proposing sustainable workflows that align with current digital curation standards; and
  5. producing a white paper to sum up current processes and propose next steps.

So, where are we now in this project? I’m starting step number 3, “reviewing and testing potential digital curation applications.” In other words… it’s tool time!

Back to our Tool Talk

In order to determine what the right tool, or tool box, is for the Senate archivists, we’re following a modified version of Regine Heberlein’s “Gospel of Metadata” presented at the Introduction to Metadata Power Tools for the Curious Beginner pop-up session during the 2015 Society of American Archivists Annual Meeting in Cleveland. The session was designed to present case studies of archivists with limited IT experience sharing the tools and techniques they’ve found successful in processing existing collections and “cleaning up” digital object metadata. Though the tools demo-ed during the session were designed to assist archivists managing existing data, Heberlein’s process can be applied to help inform the selection of tools used to generate unique descriptive and preservation metadata.

The first step in answering that question is to know your records. Here, that meant learning more about how electronic records are being managed in the Senate committees, how committee archivists are processing electronic records, and what NARA does with the materials when they receive the transferred records.

We decided that surveying the archivists, IT staff, and committee clerks would be the best way to ask about electronic records management and electronic records archiving in various office environments. I spent the first few months meeting with committee staff and learning about their particular processes. I also met with the staff of the Center for Legislative Archives to find out what happens when our bytes leave the Hill.

Once we learned about our records, we addressed the second step in Heberlein’s gospel, what we want the end result to be. For the last six years, there have been conscious efforts made to increase the variety and quality of metadata that accompanies records transferred from the Senate to NARA. To date, the focus has been on better content and contextual description, and these efforts have reaped much benefit, especially for committees needing to recall records. The Center’s staff doesn’t have the time to augment records description, so it falls to the Senate archivists to generate the descriptive metadata which accompanies the records. There is an automatic 20 year closure on all Senate committee records (and 50 for investigations, nominations, and records with PII), so it will be at least 2035 before today’s work product may become accessible to researchers; the more that can be done up front, the more discoverable content will be later.

One of the many things we’re trying to do now is add preservation metadata to the records, establishing their integrity as early in the lifecycle as possible. A lot can happen to a piece of paper in 20 years’ time, but for a digital file, 20 years of benign neglect is tantamount to destruction due to technological obsolesence. Two aspects of integrity that have been identified are file format identification and fixity in the form of cryptographic hashes. Other tasks we’re hoping automated tools will improve include identifying PII, getting more accurate volume information on transfers, de-duplication, and anything to help conquer “the email problem.”

Once you know what you want, you need to find the tool for the task. This brings our conversation full circle: do we get the tool we need, or do we find something for the tool to do?

A lot goes into trying to find that perfect fit:

  • Placement: Where does the tool fit into your process?
  • Purpose: What does the tool actually do? Is it replacing a process (making it more efficient) or are we using if for a new process?
  • Utility: How easy is the tool to use and does its output make sense?
  • Viability: Is the tool a long-term solution or a quick fix for today?

These are the questions I’m in the middle of answering right now. Here’s what I have so far:

Placement: Where does the tool fit into your process?

Since there are nine archivists working in eight different offices, there is no single process. I created a workflow for all of the archivists to document how they process electronic records. With the workflows, we can identify specific processes that can be automated, where and when to incorporate tools, and how to make their integration as seamless as possible.

Electronic Records Processing Workflow. Credit: John Caldwell

Electronic Records Processing Workflow. Credit: John Caldwell

Purpose: What does the tool actually do?

This is where research skills come into play: identifying all the possible tools you think will get the job done, learning what they do best, hearing from other professionals’ experiences, reading the manuals to see if they’re actually usable, and winnowing down that list to three or four that seem, on the surface, to be a good fit. (This is very reminiscent of my college search, actually.) If you want to test multiple tools that perform the same primary function, an important question to consider is: what else do they do? For example, NARA File Analyzer and DROID are both principally designed to examine digital files and identify the file type, but they also have the ability to generate checksums at the same time; on the other hand, Karen’s Directory Printer only generates checksums. The workflow analysis is also important. Knowing what steps in the process each tool affects will help you decide whether to test and how best to test a tool.

Utility: How easy is the tool to use and does its output make sense?

If the tool is too complicated to use or the output too complex, then the tool is unusable. A tool that is easy to use but tries to do too many things or is too specialized may be just as impractical. Using our earlier example of fixity and format identification, is it better to use a tool only for its intended purpose (e.g. DROID only for format identification) and add multiple tools to the process, each doing a specific task (such as Karen’s Directory Printer for checksums)? Or do you try to maximize the multiple functionality of a tool (NARA File Analyzer’s dual fixity/format identification features), even if it makes the resulting data more complicated to use long term? Do you install a full suite of programs (e.g. BitCurator) because there are one or two individual tools that look promising (Bulk Extractor to identify PII in large data sets)? Or do you try to isolate the specific tools you think you want? Even if there isn’t a strong argument now for installing and learning how to use the full BitCurator environment, might there be a situation in a year or two where its full functionality will be useful?

Screenshots of various tools for the digital preservation toolbox. Credit: John Caldwell

Screenshots of various tools for the digital preservation toolbox. Credit: John Caldwell

These are just some of the issues that we are confronting as we enter the testing phase of tool selection. The seemingly straightforward question of utility is fundamentally tied to the question of purpose, and also the viability question: is the tool a long-term solution or a quick fix for today?

As the testing phase gets underway, we’re developing a procedure that can be replicated with every potential tool for each specific purpose, identifying the essential criteria, and figuring out the logistics for implementation in a production environment. It will be some time before we can hope to make final selections, but we’re following a necessary and invaluable sequence of events that will be beneficial to the digital archivists and the institution as a whole.

Diving into this process has given me a new appreciation for the Tool Man. Maybe if he had taken his time in every episode, instead of just rushing ahead, “More Power” may have led to better results. But then, that can make for boring TV. Fortunately, my project is anything but boring!

Access Historic Audio and Video Programs: AAPB Launches Online Reading Room

The following is a guest post by Karen Cariani, AAPB Project Director and Director WGBH Media Library and Archive, Alan Gevinson, AAPB Project Director and Special Assistant to the Packard Campus Chief, and Casey Davis, Project Manager, American Archive of Public Broadcasting, WGBH Educational Foundation. The American Archive of Public Broadcasting (AAPB) Project Team at […]

The Veterans History Project Marks 15 Years of Service

“The willingness with which our young people are likely to serve in any war, no matter how justified, shall be directly proportional to how they perceive the Veterans of earlier wars were treated and appreciated by their nation.” — George Washington The Veterans History Project honors the lives and service of all American veterans –not […]

Digital Library Federation to Host National Digital Stewardship Alliance

The National Digital Stewardship Alliance announced that it has selected the Digital Library Federation (DLF), a program of the Council on Library and Information Resources (CLIR), to serve as NDSA’s institutional home starting in January 2016. The selection and announcement follows a nationwide search and evaluation of cultural heritage, membership, and technical service organizations, in […]

Announcing the 2015 Innovation Award Winners

On behalf of the National Digital Stewardship Alliance Innovation Working Group, I am excited to announce the 2015 NDSA Innovation Award winners! This year, the annual innovation awards committee reviewed over thirty exceptional nominations from across the country. Awardees were selected based on how their work or their project’s whose goals or outcomes represent an […]

The World As Seen Through Books: An Interview with Kalev Hannes Leetaru

Kalev Leetaru, a senior fellow at George Washington University Center for Cyber and Homeland Security, has written for The Signal in previous posts. I recently had the chance to ask him about his latest work, processing and analyzing digitized books stretching back two centuries. Erin: You recently completed research and analysis on large datasets of […]

DPOE Plants Seed for Statewide Digital Preservation Effort in California

The following is a guest post by Barrie Howard, IT project manager at the Library of Congress. The Digital Preservation Outreach and Education (DPOE) program is pleased to announce the successful completion of another train-the-trainer workshop in 2015. The most recent workshop took place in Sacramento, California, from September 22th–25th. This domestic training event follows […]

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online. The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also […]

Five Questions for the Smithsonian Institution Archives’ Lynda Schmitz Fuhrig

The following is a guest post from Michael Neubert, a supervisory digital projects specialist at the Library of Congress. In February of this year I wrote a post here about an collaborative effort of representatives of the National Archives and Records Administration (NARA), the Government Publishing Office (GPO), and the Library of Congress to work […]

Stewarding Academic and Research Content: An Interview with Bradley Daigle and Chip German about APTrust

The following is a guest post by Lauren Work, digital collections librarian, Virginia Commonwealth University. In this edition of the Insights Interview series for the NDSA Innovation Working Group, I was excited to talk with Bradley Daigle, director of digital curation services and digital strategist for special collections at the University of Virginia, and R. […]