Boxes of Hard Drives and Other Challenges at WGBH: An NDSR Project Update

The following is a guest post by Rebecca Fraimow, National Digital Stewardship Resident at WGBH in Boston

Rebecca

Rebecca Fraimow

I have a pretty comprehensive list of goals to accomplish over the course of my time as the National Digital Stewardship Resident at WGBH’s Media, Library and Archives. That is:

  1. Document WGBH’s existing ingest workflow for production media and make recommendations for improvement.
  2. Design, implement, and complete an ingest process for over 70 hard drives worth of content created for the American Archive project, which needs to be backed up on LTO with appropriate metadata.
  3. Research the file failures that WGBH discovered last summer when initially pulling video files out of networked storage and putting them on aforementioned hard drives.
  4. Create a video webinar (or series of video webinars) putting together a set of digital media recommendations to share with other public media stations, based on everything above.

The scope of the work could have been overwhelming, but the structure of the projects has actually flowed very naturally. Starting with phase one – working with WGBH’s workflow – allowed me to ease into the daily operations of the archive. This involves QCing (or, checking for Quality Control) and editing the Filemaker metadata that WGBH receives along with the assets from every production, checking for consistency across the different deliverables and getting familiar with the master database and the institutional workflow as it currently stands.

At the time I arrived, the archives staff had only recently acquired a set of LTO-6 decks, the latest version of the Linear Tape-Open magnetic tape data storage technology.  As I got more comfortable understanding the needs of the archive, I also started looking at ways to make the LTO-6 workflow more standardized and eventually wrote a script to automate the generation and comparison of checksums for files during batch ingest. These smaller-scale projects served as building blocks for the creation of a set of workflow diagrams showing the ingest process as it exists at WGBH now, and the ingest process as we think it should exist in the future.

As for phase two, I knew that was ready to kick off when I walked into my cubicle one day and saw that it was entirely filled with boxes of hard drives…photo

At this point, I’d worked extensively with WGBH’s LTO system during phase one, so I was familiar with the possibilities and limitations of the technology and could put that knowledge to use in designing my own personal workflow for this massive ingest process. An LTO-6 tape, uncompressed and formatted as a Linear Tape File System (which allows the computer to directly access the tape as it would a hard drive), holds about 2.5 TB of data. To write this much data from hard drive to tape, using a high-speed connection such as a SATA or USB 3, takes about 4-5 hours. When you’re stuck with a slower connection, such as USB 2, it takes exponentially longer. We also had to generate metadata to live with the files and be confident that all the information that went onto the tape was 100% accurate, because data cannot truly be deleted from an LTO without erasing the whole tape.

Taking all these factors into account, we designed a workflow that included removing all hard drives from their casings, (many of which did not include the kind of connections that we needed), barcoding them, and accessing them using high-speed docking stations. I also wrote a script that would allow me to batch-generate metadata in the Dublin Core-based PBCore standard for audiovisual material, incorporating technical information provided by the media file analysis program MediaInfo as well as MD5 checksums, before transferring files to LTO.   While it’s not all running perfectly smoothly yet and there are always new complications to discover, at this point the workflow is streamlined enough that I can start using the hours when the computer is processing checksums or transfers to dedicate to working on phase three of the project.

Phase three is a new addition to the project plan as initially outlined by WGBH – this became part of the task list last summer, when WGBH started receiving very worrisome reports that a high percentage of the video files they were sending to be included in the American Archive project were failing.  These video files were either showing severe signs of corruption in QC or failing to open at all. The persistent difficulties that the archives department had in pulling these files off of WGBH’s institutional LTO 4 tapes over network storage was part of the impetus for WGBH acquiring their own, directly connected LTO 6 decks, which led directly to all the work I’ve done above. Now my job is to analyze the failures and see if I can figure out why they occurred.

While I’m still in the beginning phases of this research, so far I’ve managed to rule out the idea that the files are getting corrupted directly on tape; checksum analysis of the files stored on the tapes reveal that they still have the same unique signature as they did before they were written to LTO. I’ve also discovered that the variety of different failure types represented are most likely due to the different structures of the video files themselves – specifically, differences in the placement of the “moov atom,” a section of the media file that contains structural metadata for the file as a whole, and without which the file cannot be read.

Atomic structure of a Quicktime video file, showing the various different elements of data that make up the file structure (generated by Apple's Atom Inspector).

Atomic structure of a Quicktime video file, showing the various different elements of data that make up the file structure (generated by Apple’s Atom Inspector).

I recently gave a presentation about these failures and my research into the problem at Code4Lib 2015 and will be sharing my slides, as well as my planned next steps for investigation, on the NDSR Boston blog.

As for phase four  – creating a video webinar – well, that’s not coming up for another two months, which is probably a good thing given that this post is already getting too long.

My residency work so far  has involved everything from graphing out workflows to writing bash scripts to batch editing XML metadata to taking apart hard drives (and putting them back together, and then taking them apart again…). There’s a new and unexpected challenge to conquer every day – it keeps me on my toes in the best possible way. One of the greatest things about digital preservation as a field is how it forces us all to constantly keep learning and pushing our work forward, and I’m incredibly excited to be a part of it.

DPOE Interview: Three Trainers Launch Virtual Courses

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress. This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program.  Today I’ll focus on some exceptional individuals, who among other things, have completed one of […]

From the Field: More Insight Into Digital Preservation Training Needs

The following is a guest post by Jody DeRidder, Head of Digital Services at the University of Alabama Libraries.  This post reports on efforts in the digital preservation community that align with the Library’s Digital Preservation Outreach & Education (DPOE) Program. Jody, among many other accomplishments, has completed one of the DPOE Train-the-Trainer workshops and […]

Digital Audio Preservation at MIT: an NDSR Project Update

The following is a guest post by Tricia Patterson, National Digital Stewardship Resident at MIT Libraries This month marks the mid-way point of my National Digital Stewardship Residency at MIT Libraries, a temporal vantage point that allows me to reflect triumphantly on what has been achieved so far and peer fearlessly ahead at all that […]

Web Archive Management at NYARC: An NDSR Project Update

The following is a guest post by Karl-Rainer Blumenthal, National Digital Stewardship Resident at the New York Art Resources Consortium (NYARC). A tipping point from traditional to emergent digital technologies in the regular conduct of art historical scholarship threatens to leave unprepared institutions and their researchers alike in a “digital black hole.” NYARC–the partnership of […]

Report Available for the 2014 DPOE Training Needs Assessment Survey

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress. In September, the Digital Preservation Outreach and Education (DPOE) program wrapped up the “2014 DPOE Training Needs Assessment Survey” in an effort to get a sense of current digital preservation practice, a better understanding about what capacity exists […]

Managing Research Data at Tufts University: An NDSR Project Update

The following is a guest post by Samantha DeWitt, National Digital Stewardship Resident at Tufts University. Hello readers and a happy winter solstice from Medford, Massachusetts. It’s hard to believe I am already in my third month of the National Digital Stewardship Residency. There’s a chill in the air and the vivid fall colors that […]

NDSR Applications Open, Projects Announced!

The Library of Congress, Office of Strategic Initiatives and the Institute of Museum and Library Services are pleased to announce the official open call for applications for the 2015 National Digital Stewardship Residency, to be held in the Washington, DC area.  The application period is from December 17, 2014 through January 30, 2015. To apply, […]

Preserving Carnegie Hall’s Born-Digital Assets: An NDSR Project Update

The following is a guest post by Shira Peltzman, National Digital Stewardship Resident at Carnegie Hall in New York City. As the National Digital Stewardship Resident placed at Carnegie Hall, I have been tasked with creating and implementing policies, procedures and best practices for the preservation of our born-digital assets. Carnegie Hall produces a staggering […]