Format Migrations at Harvard Library: An NDSR Project Update

The following is a guest  post by Joey Heinen, National Digital Stewardship Resident at Harvard University Library.

Joey Heinen

Joey Heinen

As has been famously outlined by the Library of Congress on their website on sustainability factors for digital formats, digital material is just as susceptible to obsolescence as analog formats. Within digital preservation there are a number of strategies that can be employed in order to protect your data including refreshing, emulation or migration, to name a few. As the National Digital Stewardship Resident at Harvard Library, I am responsible for developing a format migration framework which can be continuously adapted for migration projects at Harvard.

In order to test the viability of this framework, I am also planning for migration of three obsolete formats within the Digital Repository Service (DRS) – Kodak PhotoCD, SMIL playlists and RealAudio. While each format will have its own challenges for a standard workflow, there are certain processes which will always be incorporated into the overall migration framework. In a sense I am helping to create a series of incantations that must be uttered in order to raise these much-cherished digital materials back from the dead. No sage-burning necessary.

Migration is the chosen digital preservation strategy for this project since the aim of migration is to move content from its previously tenuous origins to a format with much greater promise in terms of support and usage. Our overall goal is to continue to provide remote access on modern platforms in a way that best matches the original format.

A Framework Emerges – First Steps

I began my residency by performing a broad literature review on the status of migration projects across the library field. This was a great way to acquaint myself with the terrain, but greater depth would be needed by using some real examples and understanding the institutional context of Harvard – its staff structure, its resources, its policies and its digital repository. Bouncing back and forth between the broader framework and the individual format plans, some patterns began to emerge. After further processing, we have arrived upon some core attributes that will inform the overall framework. The specifics of this framework are still in development and are much too large to narrate here, but I’ll discuss some of the most distinct themes.

Stakeholder involvement

The mention of “stakeholder involvement” first is deliberate – without gaining a sense for the “who,” the project cannot commence. Depending on the type of content, the exact cast of characters may vary but the types of roles will stay somewhat consistent. For the framework, we identified the following key areas of responsibility and corresponding responsible parties:

  • Project Management (that’s me!).
  • Technical Guidance/Format Experts (those who understand the format best).
  • Documentation (that’s me too! Though gathering provenance and creation of documentation throughout the migration may originate from other departments, depending).
  • Quality Assurance/Plan Approval (that’s pretty much everyone but at different points in the process).
  • Systems Conformance/Technical Infrastructure (this is almost always our friends in Library IT staff and Metadata who inform us of how the plan does or does not comply with current technological procedures and infrastructure).
  • Content Ownership (curators or collection managers, involvement is generally just to be informed of major decisions).

Defined Project Phases

In general, our migration plans can be broken down into these essential phases:

  • Planning for the Test.
  • Testing.
  • Refining the Plan.
  • Executing the Plan.
  • Verifying Results and Project Wrap-Up.

From these project phases, we then defined the following within each phase:

  • Workflow Activities – essential steps in the migration workflow.
  • Workflow Components – ways of grouping the more granular activities.
  • Project Deliverables – this could take on the form of: the migrated content itself; documentation or metadata generated along the way; diagrams of the workflow and the migration path (e.g. how the content in relation to the Harvard repository will change from pre- to post-migration); or new revelations in digital preservation policies e.g. storage and retention plans.

Last but not least, we want to consider how other projects within the library might impact the migration plan, whether in terms of timing and staff availability, as well as projects that might impact the infrastructure upon which migration is supported. For example, the metadata from Harvard’s DRS is being migrated to a new version of the DRS which includes changes to how relationships between files and objects are described. The relationship structure of still image objects will be completely different before and after this metadata migration so a plan to migrate the Kodak PhotoCD files will need to take this into consideration.

Format Specifics – Examples

In terms of how this framework has been used on the actual formats, we have made the most progress on Kodak PhotoCD, mostly because it’s less complex and less staff intensive than the SMIL/RealAudio formats. So far we have completed the analysis, creation of the test, the testing itself and are beginning to define how the old image objects will be changed relative to the inclusion of migrated content, additional artifacts (e.g. metadata) and the new content model structuring. The details of our decisions around successfully migrating PhotoCD content is too verbose for this post (though more information can be found on the NDSR blog). However, the Migration Workflow and Migration Pathway diagrams shown here help to show “how the sausage is made.”

Migration Workflow

Migration Workflow

The Migration Workflow demonstrates every step of the process from gathering documentation for initial analysis to ingest of the migrated content into the repository. In the example at left, we see the first two components of Phase 1 of the Migration Workflow – Format/Tools Research and Confirming Migration Criteria. As is shown in the corresponding legend, stakeholder involvement is determined based on a colored box which names the stakeholder group within each component. These roles were designed based off the RACI Responsibility Assignment Matrix which defines 4 levels of responsibility.

Migration Pathway

Migration Pathway

The Migration Pathway diagram (at right) shows how content will be transformed by a migration. A diagram is produced for each “bucket” of content for which the same tools, settings and outputs can be used unanimously based on shared technical characteristics. This example, from the Horblit Collection, a collection of daguerreotypes initially digitized in PhotoCD form, shows the ways in which the original PhotoCD content as found within the DRS will be converted and newly packaged and ingested into the repository. It considers how the image objects look now (DRS1), how they will look after the metadata migration (DRS2) and how the object will look after the content is migrated.

In the two months remaining for my residency I will be completing the overall framework, and working on the Kodak PhotoCD and SMIL/RealAudio plans (though execution of these plans will certainly fall outside of this timeline). After planning for the format-specific migration and going through several passes at the overall framework, we are getting closer to an actionable model for ongoing migration projects.

It has been fascinating to oscillate between deep analysis of the technical and infrastructural challenges faced with each format and finding ways to abstract these processes into a template that can be continuously adapted. The result will certainly be of use to Harvard, and our hope is that in sharing it with the larger digital preservation field that it will be useful to others as well. For the finalized spells and incantations, check the NDSR blog or Harvard website at the end of May. Presto Change-o!

Tracking Digital Collections at the Library of Congress, from Donor to Repository

When Kathleen O’Neill talks about digital collections, she slips effortlessly into the info-tech language that software engineers, librarians, archivists and other information technology professionals use to communicate with each other.  O’Neill, a senior archives specialist in the Library of Congress’s Manuscript Division, speaks with authority about topics such as file signatures, hex editors and checksums even […]

Mapping Words: Lessons Learned From a Decade of Exploring the Geography of Text

The following is a guest post by Kalev Hannes Leetaru, Senior Fellow, George Washington University Center for Cyber & Homeland Security. It is hard to imagine our world today without maps. Though not the first online mapping platform, the debut of Google Maps a decade ago profoundly reshaped the role of maps in everyday life, […]

Residents Chosen for NDSR 2015 in Washington, DC

We are pleased to announce that the Washington, DC National Digital Stewardship Residency class for 2015 has now been chosen! Five very accomplished people have been selected from a highly competitive field of candidates. The new residents will arrive in Washington, DC this June to begin the program. Updates on the program, including more information […]

Many Goals for One Residency: An NDSR Project Update

The following is a guest post by Jen LaBarbera, National Digital Stewardship Resident at Northeastern University Library. It’s hard to believe that I only have two and a half months left in this residency. Despite Boston’s interminable winter (officially the snowiest on record), my time as a National Digital Stewardship Resident at Northeastern University has […]

Reaching Out and Moving Forward: Revising the Library of Congress’ Recommended Format Specifications

The following post is by Ted Westervelt, head of acquisitions and cataloging for U.S. Serials in the Arts, Humanities & Sciences section at the Library of Congress. Nine months ago, the Library of Congress released its Recommended Format Specifications. This was the result of years of work by experts from across the institution, bringing their […]

Creating Workflows for Born-Digital Collections: An NDSR Project Update

The following is a guest post by Julia Kim, National Digital Stewardship Resident at New York University Libraries. I’m now into the last leg of my nine-month residency, and I’m amazed by what has been accomplished and the major steps still ahead of me. In this post, I’ll give a project update on my primary […]

Introducing the Federal Web Archiving Working Group

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. “Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access […]

Boxes of Hard Drives and Other Challenges at WGBH: An NDSR Project Update

The following is a guest post by Rebecca Fraimow, National Digital Stewardship Resident at WGBH in Boston I have a pretty comprehensive list of goals to accomplish over the course of my time as the National Digital Stewardship Resident at WGBH’s Media, Library and Archives. That is: Document WGBH’s existing ingest workflow for production media […]

DPOE Interview: Three Trainers Launch Virtual Courses

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress. This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program.  Today I’ll focus on some exceptional individuals, who among other things, have completed one of […]