Top of page

Format Migrations at Harvard Library: An NDSR Project Update

Share this post:

The following is a guest  post by Joey Heinen, National Digital Stewardship Resident at Harvard University Library.

Joey Heinen
Joey Heinen

As has been famously outlined by the Library of Congress on their website on sustainability factors for digital formats, digital material is just as susceptible to obsolescence as analog formats. Within digital preservation there are a number of strategies that can be employed in order to protect your data including refreshing, emulation or migration, to name a few. As the National Digital Stewardship Resident at Harvard Library, I am responsible for developing a format migration framework which can be continuously adapted for migration projects at Harvard.

In order to test the viability of this framework, I am also planning for migration of three obsolete formats within the Digital Repository Service (DRS) – Kodak PhotoCD, SMIL playlists and RealAudio. While each format will have its own challenges for a standard workflow, there are certain processes which will always be incorporated into the overall migration framework. In a sense I am helping to create a series of incantations that must be uttered in order to raise these much-cherished digital materials back from the dead. No sage-burning necessary.

Migration is the chosen digital preservation strategy for this project since the aim of migration is to move content from its previously tenuous origins to a format with much greater promise in terms of support and usage. Our overall goal is to continue to provide remote access on modern platforms in a way that best matches the original format.

A Framework Emerges – First Steps

I began my residency by performing a broad literature review on the status of migration projects across the library field. This was a great way to acquaint myself with the terrain, but greater depth would be needed by using some real examples and understanding the institutional context of Harvard – its staff structure, its resources, its policies and its digital repository. Bouncing back and forth between the broader framework and the individual format plans, some patterns began to emerge. After further processing, we have arrived upon some core attributes that will inform the overall framework. The specifics of this framework are still in development and are much too large to narrate here, but I’ll discuss some of the most distinct themes.

Stakeholder involvement

The mention of “stakeholder involvement” first is deliberate – without gaining a sense for the “who,” the project cannot commence. Depending on the type of content, the exact cast of characters may vary but the types of roles will stay somewhat consistent. For the framework, we identified the following key areas of responsibility and corresponding responsible parties:

  • Project Management (that’s me!).
  • Technical Guidance/Format Experts (those who understand the format best).
  • Documentation (that’s me too! Though gathering provenance and creation of documentation throughout the migration may originate from other departments, depending).
  • Quality Assurance/Plan Approval (that’s pretty much everyone but at different points in the process).
  • Systems Conformance/Technical Infrastructure (this is almost always our friends in Library IT staff and Metadata who inform us of how the plan does or does not comply with current technological procedures and infrastructure).
  • Content Ownership (curators or collection managers, involvement is generally just to be informed of major decisions).

Defined Project Phases

In general, our migration plans can be broken down into these essential phases:

  • Planning for the Test.
  • Testing.
  • Refining the Plan.
  • Executing the Plan.
  • Verifying Results and Project Wrap-Up.

From these project phases, we then defined the following within each phase:

  • Workflow Activities – essential steps in the migration workflow.
  • Workflow Components – ways of grouping the more granular activities.
  • Project Deliverables – this could take on the form of: the migrated content itself; documentation or metadata generated along the way; diagrams of the workflow and the migration path (e.g. how the content in relation to the Harvard repository will change from pre- to post-migration); or new revelations in digital preservation policies e.g. storage and retention plans.

Last but not least, we want to consider how other projects within the library might impact the migration plan, whether in terms of timing and staff availability, as well as projects that might impact the infrastructure upon which migration is supported. For example, the metadata from Harvard’s DRS is being migrated to a new version of the DRS which includes changes to how relationships between files and objects are described. The relationship structure of still image objects will be completely different before and after this metadata migration so a plan to migrate the Kodak PhotoCD files will need to take this into consideration.

Format Specifics – Examples

In terms of how this framework has been used on the actual formats, we have made the most progress on Kodak PhotoCD, mostly because it’s less complex and less staff intensive than the SMIL/RealAudio formats. So far we have completed the analysis, creation of the test, the testing itself and are beginning to define how the old image objects will be changed relative to the inclusion of migrated content, additional artifacts (e.g. metadata) and the new content model structuring. The details of our decisions around successfully migrating PhotoCD content is too verbose for this post (though more information can be found on the NDSR blog). However, the Migration Workflow and Migration Pathway diagrams shown here help to show “how the sausage is made.”

Migration Workflow
Migration Workflow

The Migration Workflow demonstrates every step of the process from gathering documentation for initial analysis to ingest of the migrated content into the repository. In the example at left, we see the first two components of Phase 1 of the Migration Workflow – Format/Tools Research and Confirming Migration Criteria. As is shown in the corresponding legend, stakeholder involvement is determined based on a colored box which names the stakeholder group within each component. These roles were designed based off the RACI Responsibility Assignment Matrix which defines 4 levels of responsibility.

Migration Pathway
Migration Pathway

The Migration Pathway diagram (at right) shows how content will be transformed by a migration. A diagram is produced for each “bucket” of content for which the same tools, settings and outputs can be used unanimously based on shared technical characteristics. This example, from the Horblit Collection, a collection of daguerreotypes initially digitized in PhotoCD form, shows the ways in which the original PhotoCD content as found within the DRS will be converted and newly packaged and ingested into the repository. It considers how the image objects look now (DRS1), how they will look after the metadata migration (DRS2) and how the object will look after the content is migrated.

In the two months remaining for my residency I will be completing the overall framework, and working on the Kodak PhotoCD and SMIL/RealAudio plans (though execution of these plans will certainly fall outside of this timeline). After planning for the format-specific migration and going through several passes at the overall framework, we are getting closer to an actionable model for ongoing migration projects.

It has been fascinating to oscillate between deep analysis of the technical and infrastructural challenges faced with each format and finding ways to abstract these processes into a template that can be continuously adapted. The result will certainly be of use to Harvard, and our hope is that in sharing it with the larger digital preservation field that it will be useful to others as well. For the finalized spells and incantations, check the NDSR blog or Harvard website at the end of May. Presto Change-o!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.