Format Migration: More Launching Points for Applied Research

In June I did a post highlighting segments of the digital stewardship universe that could use applied research attention. I looked at the “what” of email archiving here and the “how” of email archiving here and now I turn my attention to format migration.

Migration is not a crime by user pshab on <a href="http://www.flickr.com/photos/pshab/1366448271/in/photolist-35KpsX-dkStqe-AkeFU-vKwRb-7LRn37-7LRoGf-7LMrn6-7LMpH4-7LRmcf-7LRkNJ-7LRmA9-7LMqaP-7LMnJa-5i5Sr1-BpDWQ-dJTeCQ-a6E3eK-7GR4fx-brcNvp-CJXYR-bk6o5A-5qZGRU-by1gsc-88DAmF-9fibZ6-Ac7GA-crX8P5-c4De57-bpgcFN-6bau23-dqKcPe-dqKkLY-dqKnNb-dqKbMZ-dqKmHY-dqKc9T-dqKdDi-dqKo9E-dqKmkj-dqKddr-dqKcC2-aufPRg-crXaeU-9T2J7A-6vGMgg-daNhr6-daNmEd-daNpYb-9oJWGd-8heKtM-6ecQuW/" target="_blank">Flickr</a>

Migration is not a crime by user pshab on Flickr

The need to migrate file formats arises out of concerns about format obsolescence. As I mentioned in my original post, there are ongoing discussions about how acute the format obsolescence problem might be, but for the purpose of this post we’re going to assume that migration is a possible solution to digital stewardship challenges and concentrate on useful resources that support the activity.

In my original post I proposed a series of largely technical questions that a researcher might ask regarding format migration, mostly about what happens to files and the information they contain in a migration process. This time around we’ll look at the infrastructure needed to do format migration and in a future post we’ll look at the results of a few migration experiments.

The first step in the infrastructure are the format registries. Format registries, such as PRONOM, developed by the UK National Archives, and the Unified Digital Format Registry developed by the University of California, provide detailed documentation about data file formats and their supporting software products. The format registries are important because we need to know as much information about the documented state of a format before we can understand what changes take place in a transformation.

[And while it’s not a format registry, the Library’s Sustainability of Digital Formats site has a lot of useful information in this area.]

The next step are tools that draw on the registry information to support the automated  identification of file formats. Some interesting tools include FIDO, the Format Identification for Digital Objects Python command-line tool; the DROID Digital Record Object Identification tool; and JHOVE and JHOVE2. Each of these tools support file format identification, validation and characterization to varying degrees, though I’m not qualified to discuss their significant differences (I’ll let the developers point them out in the comments!).

They’re all similarly interesting for our purposes in that they allow the “identification” process to be incorporated into automated workflows along with a suite of other identification/characterization/migration/evaluation tools.

Not this kind of migration. "Tracy (vicinity), California..." Photo courtesy of the Library of Congress.

Not this kind of migration. “Tracy (vicinity), California…” Photo courtesy of the Library of Congress.

The next thing you need are files to migrate. I’m sure you’ve got plenty of your own, but if you’re working at scale you may want to access large corpora of data such as that provided by Biomed central. The Planets testbed was a very effective research environment hosted by the European Planets project to facilitate practical experimentation in digital stewardship and to enable users to repeat experiments in order to validate the results, but I’m still trying to clarify its current status. The successor to Planets, the Open Planets Foundation, does maintain a Formats Corpus.

On a side note, the National Software Reference Library has a research computing environment containing some 18,000,000 unique original files, along with a database containing metadata about the files. They do allow researchers to run an algorithm against the file collection by submitting a job (in code form) to the NSRL who run it for you.

Last but not least you need software tools to do the migrations. Here is where it starts to get complicated. A great place to start is the work being done by SCAPE, the SCAlable Preservation Environments project funded by the European Union and coordinated by the Austrian Institute of Technology. They’ve authored a report that looks at what they call “preservation action tools” developed under the Planets, CRiB and RODA projects. The paper introduces models for assessing the appropriateness of any particular piece of software for preservation migration purposes.

Another useful site is the Conversion Software Registry maintained by the Image and Spatial Data Analysis Group at the University of Illinois at Urbana-Champaign National Center for Supercomputing Applications. The registry is a repository of information about software packages that are capable of file format conversions, particularly tools to help identify conversion paths between formats.

There are proprietary tools already used in some domains (such as the geospatial community) that support the mass transformation of data across multiple formats, but they’re designed more to support the movement of data between databases and applications. It’s not clear to what degree (if any) they’ve considered preservation as a significant use for their tools, but it’s an area for future exploration.

In a future post we’ll take a closer look at the outputs from some migration efforts. Feel free to identify experiments or other migration tools and services in the comments.

3 Comments

  1. Tibaut Houzanme
    August 15, 2013 at 3:28 pm

    This is a well documented summary of format migration as a one preservation option. The area where I see an opportunity relates to the quality of conversion. There, available tools ought to be compared, based on the quality of their output, so a ranking is not just based on how many view or downloads they garner, but also on how well they also perform specific functions such as proper object migration, resolution fidelity, integrity of object metadata migration (system, physical characteristics, intellectual characteristics and any added context).
    Ways to validate the fidelity and/or integrity of the migration also appear to me, to be an interesting area of research.

  2. Butch Lazorchak
    August 22, 2013 at 10:08 am

    Tibaut,

    Thanks for the kind words. You’ve definitely identified what seems to be the next step in exploring the viability of format migration as a preservation option. That is, taking a rigorous look at the migration tools and documenting their effectiveness across a range of content types.

    I am looking to expose these types of research in a future blog post and want to encourage readers to share their knowledge of efforts like this, if they exist.

  3. Tibaut Houzanme
    August 28, 2013 at 4:55 pm

    Butch,

    Thanks to you too for your openness and appreciation of contributions.

    A couple of publications I am aware of that deal with migration reliability:
    – Assessing migration risk for scientific data formats (IJDC, 2012)/Chris Frisz et. al.
    – Using a multi-criteria decision making approach to evaluate format migration solutions (Feng Luan et. al.)
    – A mathematical framework for modeling and analyzing migration time (Feng Luan et. al.)
    – There is one study I came across last year that dealt with the migration of photograps and how well the diffferent softwares performed at technical level. I can’t seem to find it now, but will keep looking.

    Hope these are useful somehow,

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.