The following is a guest post by Rebecca Fraimow, National Digital Stewardship Resident at WGBH in Boston
I have a pretty comprehensive list of goals to accomplish over the course of my time as the National Digital Stewardship Resident at WGBH’s Media, Library and Archives. That is:
- Document WGBH’s existing ingest workflow for production media and make recommendations for improvement.
- Design, implement, and complete an ingest process for over 70 hard drives worth of content created for the American Archive project, which needs to be backed up on LTO with appropriate metadata.
- Research the file failures that WGBH discovered last summer when initially pulling video files out of networked storage and putting them on aforementioned hard drives.
- Create a video webinar (or series of video webinars) putting together a set of digital media recommendations to share with other public media stations, based on everything above.
The scope of the work could have been overwhelming, but the structure of the projects has actually flowed very naturally. Starting with phase one – working with WGBH’s workflow – allowed me to ease into the daily operations of the archive. This involves QCing (or, checking for Quality Control) and editing the Filemaker metadata that WGBH receives along with the assets from every production, checking for consistency across the different deliverables and getting familiar with the master database and the institutional workflow as it currently stands.
At the time I arrived, the archives staff had only recently acquired a set of LTO-6 decks, the latest version of the Linear Tape-Open magnetic tape data storage technology. As I got more comfortable understanding the needs of the archive, I also started looking at ways to make the LTO-6 workflow more standardized and eventually wrote a script to automate the generation and comparison of checksums for files during batch ingest. These smaller-scale projects served as building blocks for the creation of a set of workflow diagrams showing the ingest process as it exists at WGBH now, and the ingest process as we think it should exist in the future.
As for phase two, I knew that was ready to kick off when I walked into my cubicle one day and saw that it was entirely filled with boxes of hard drives…
At this point, I’d worked extensively with WGBH’s LTO system during phase one, so I was familiar with the possibilities and limitations of the technology and could put that knowledge to use in designing my own personal workflow for this massive ingest process. An LTO-6 tape, uncompressed and formatted as a Linear Tape File System (which allows the computer to directly access the tape as it would a hard drive), holds about 2.5 TB of data. To write this much data from hard drive to tape, using a high-speed connection such as a SATA or USB 3, takes about 4-5 hours. When you’re stuck with a slower connection, such as USB 2, it takes exponentially longer. We also had to generate metadata to live with the files and be confident that all the information that went onto the tape was 100% accurate, because data cannot truly be deleted from an LTO without erasing the whole tape.
Taking all these factors into account, we designed a workflow that included removing all hard drives from their casings, (many of which did not include the kind of connections that we needed), barcoding them, and accessing them using high-speed docking stations. I also wrote a script that would allow me to batch-generate metadata in the Dublin Core-based PBCore standard for audiovisual material, incorporating technical information provided by the media file analysis program MediaInfo as well as MD5 checksums, before transferring files to LTO. While it’s not all running perfectly smoothly yet and there are always new complications to discover, at this point the workflow is streamlined enough that I can start using the hours when the computer is processing checksums or transfers to dedicate to working on phase three of the project.
Phase three is a new addition to the project plan as initially outlined by WGBH – this became part of the task list last summer, when WGBH started receiving very worrisome reports that a high percentage of the video files they were sending to be included in the American Archive project were failing. These video files were either showing severe signs of corruption in QC or failing to open at all. The persistent difficulties that the archives department had in pulling these files off of WGBH’s institutional LTO 4 tapes over network storage was part of the impetus for WGBH acquiring their own, directly connected LTO 6 decks, which led directly to all the work I’ve done above. Now my job is to analyze the failures and see if I can figure out why they occurred.
While I’m still in the beginning phases of this research, so far I’ve managed to rule out the idea that the files are getting corrupted directly on tape; checksum analysis of the files stored on the tapes reveal that they still have the same unique signature as they did before they were written to LTO. I’ve also discovered that the variety of different failure types represented are most likely due to the different structures of the video files themselves – specifically, differences in the placement of the “moov atom,” a section of the media file that contains structural metadata for the file as a whole, and without which the file cannot be read.
I recently gave a presentation about these failures and my research into the problem at Code4Lib 2015 and will be sharing my slides, as well as my planned next steps for investigation, on the NDSR Boston blog.
As for phase four – creating a video webinar – well, that’s not coming up for another two months, which is probably a good thing given that this post is already getting too long.
My residency work so far has involved everything from graphing out workflows to writing bash scripts to batch editing XML metadata to taking apart hard drives (and putting them back together, and then taking them apart again…). There’s a new and unexpected challenge to conquer every day – it keeps me on my toes in the best possible way. One of the greatest things about digital preservation as a field is how it forces us all to constantly keep learning and pushing our work forward, and I’m incredibly excited to be a part of it.
As someone who regularly is backing up lots of media to LTO 6, I’d love to have access to that script you wrote for automating MD5 generation.
Feel free to reach out to me on Twitter: /todrobbins