Boxes of Hard Drives and Other Challenges at WGBH: An NDSR Project Update

The following is a guest post by Rebecca Fraimow, National Digital Stewardship Resident at WGBH in Boston

Rebecca

Rebecca Fraimow

I have a pretty comprehensive list of goals to accomplish over the course of my time as the National Digital Stewardship Resident at WGBH’s Media, Library and Archives. That is:

  1. Document WGBH’s existing ingest workflow for production media and make recommendations for improvement.
  2. Design, implement, and complete an ingest process for over 70 hard drives worth of content created for the American Archive project, which needs to be backed up on LTO with appropriate metadata.
  3. Research the file failures that WGBH discovered last summer when initially pulling video files out of networked storage and putting them on aforementioned hard drives.
  4. Create a video webinar (or series of video webinars) putting together a set of digital media recommendations to share with other public media stations, based on everything above.

The scope of the work could have been overwhelming, but the structure of the projects has actually flowed very naturally. Starting with phase one – working with WGBH’s workflow – allowed me to ease into the daily operations of the archive. This involves QCing (or, checking for Quality Control) and editing the Filemaker metadata that WGBH receives along with the assets from every production, checking for consistency across the different deliverables and getting familiar with the master database and the institutional workflow as it currently stands.

At the time I arrived, the archives staff had only recently acquired a set of LTO-6 decks, the latest version of the Linear Tape-Open magnetic tape data storage technology.  As I got more comfortable understanding the needs of the archive, I also started looking at ways to make the LTO-6 workflow more standardized and eventually wrote a script to automate the generation and comparison of checksums for files during batch ingest. These smaller-scale projects served as building blocks for the creation of a set of workflow diagrams showing the ingest process as it exists at WGBH now, and the ingest process as we think it should exist in the future.

As for phase two, I knew that was ready to kick off when I walked into my cubicle one day and saw that it was entirely filled with boxes of hard drives…photo

At this point, I’d worked extensively with WGBH’s LTO system during phase one, so I was familiar with the possibilities and limitations of the technology and could put that knowledge to use in designing my own personal workflow for this massive ingest process. An LTO-6 tape, uncompressed and formatted as a Linear Tape File System (which allows the computer to directly access the tape as it would a hard drive), holds about 2.5 TB of data. To write this much data from hard drive to tape, using a high-speed connection such as a SATA or USB 3, takes about 4-5 hours. When you’re stuck with a slower connection, such as USB 2, it takes exponentially longer. We also had to generate metadata to live with the files and be confident that all the information that went onto the tape was 100% accurate, because data cannot truly be deleted from an LTO without erasing the whole tape.

Taking all these factors into account, we designed a workflow that included removing all hard drives from their casings, (many of which did not include the kind of connections that we needed), barcoding them, and accessing them using high-speed docking stations. I also wrote a script that would allow me to batch-generate metadata in the Dublin Core-based PBCore standard for audiovisual material, incorporating technical information provided by the media file analysis program MediaInfo as well as MD5 checksums, before transferring files to LTO.   While it’s not all running perfectly smoothly yet and there are always new complications to discover, at this point the workflow is streamlined enough that I can start using the hours when the computer is processing checksums or transfers to dedicate to working on phase three of the project.

Phase three is a new addition to the project plan as initially outlined by WGBH – this became part of the task list last summer, when WGBH started receiving very worrisome reports that a high percentage of the video files they were sending to be included in the American Archive project were failing.  These video files were either showing severe signs of corruption in QC or failing to open at all. The persistent difficulties that the archives department had in pulling these files off of WGBH’s institutional LTO 4 tapes over network storage was part of the impetus for WGBH acquiring their own, directly connected LTO 6 decks, which led directly to all the work I’ve done above. Now my job is to analyze the failures and see if I can figure out why they occurred.

While I’m still in the beginning phases of this research, so far I’ve managed to rule out the idea that the files are getting corrupted directly on tape; checksum analysis of the files stored on the tapes reveal that they still have the same unique signature as they did before they were written to LTO. I’ve also discovered that the variety of different failure types represented are most likely due to the different structures of the video files themselves – specifically, differences in the placement of the “moov atom,” a section of the media file that contains structural metadata for the file as a whole, and without which the file cannot be read.

Atomic structure of a Quicktime video file, showing the various different elements of data that make up the file structure (generated by Apple's Atom Inspector).

Atomic structure of a Quicktime video file, showing the various different elements of data that make up the file structure (generated by Apple’s Atom Inspector).

I recently gave a presentation about these failures and my research into the problem at Code4Lib 2015 and will be sharing my slides, as well as my planned next steps for investigation, on the NDSR Boston blog.

As for phase four  – creating a video webinar – well, that’s not coming up for another two months, which is probably a good thing given that this post is already getting too long.

My residency work so far  has involved everything from graphing out workflows to writing bash scripts to batch editing XML metadata to taking apart hard drives (and putting them back together, and then taking them apart again…). There’s a new and unexpected challenge to conquer every day – it keeps me on my toes in the best possible way. One of the greatest things about digital preservation as a field is how it forces us all to constantly keep learning and pushing our work forward, and I’m incredibly excited to be a part of it.

One Comment

  1. Tod Robbins
    February 13, 2015 at 7:20 pm

    Rebecca,

    As someone who regularly is backing up lots of media to LTO 6, I’d love to have access to that script you wrote for automating MD5 generation.

    Feel free to reach out to me on Twitter: /todrobbins

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.