Every Format on the Face of the Planet

Some weeks ago I gave a presentation that I jokingly titled “The Challenges of Preserving Every Digital Format on the Face of the Planet.”

Except it’s not really a joke.

When File Formats Become Graffiti... Bethnal Green, London, from Flickr User DG Jones

When File Formats Become Graffiti... Bethnal Green, London, from Flickr User DG Jones, http://www.flickr.com/photos/dgjones/1225183400/

We often have little or no control over what comes into the Library of Congress Digital Collections, and we manage and preserve a wide variety of formats.  One collection brings in TIFF, JPEG, JPEG2000, and XML.  Another brings in MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial media formats.  Another brings in JPEG, PDF, XML and a variety of metadata formats.  One is all JSON.  One is every flavor of GIS data.  The Web Archives include every format which has ever appeared online.  And yet another collection has, in 18 months, included almost 50 different file extensions (but not that many actual file types) with a huge number of metadata variations.

So how are we making this easier for the Library of Congress to manage?

On the technology side, the Library is expanding it preservation infrastructure. The Library jointly developed the BagIt transfer specification to reduce the risk of file corruption in the movement of files between and within organizations. The Library inventories all incoming files, and is inventorying all digital content so we know what formats we have. We maintain multiple copies of files on servers and on tape, in geographically distributed locations.  We’re also extending the use of content characterization tools in our internal workflows.

The Library is better documenting its internal processes for making digital preservation decisions.  As part of its decision-guiding resources,  the Library has documented sustainability factors for file formats , and for cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement (PDF), which is currently being updated and expanded.  And the Library is preparing to develop Format Preservation Action Plans.

The Library cannot do it all alone, so is also part of several preservation partnerships, including the National Digital Stewardship Alliance and the International Internet Preservation Consortium, among others.  The Library has supported the development of the soon-to-launch Unified Digital Format Registry for the larger community.

What are your organizations doing to handle the onslaught of file formats?

One Comment

  1. Peter McKinney
    June 27, 2012 at 7:37 pm

    Hi Leslie,

    At National Library NZ we’re currently sitting at around 96 different formats in our repository (this does not take into account the contents of our whole of domain and targetted web harvests). We’ve got three big challenges right now:
    1. the changing face of format identification (we use DROID, which is a bit of a moving landscape in terms of how it’s identifying and grouping stuff)
    2. lack of breadth in technical metadata extraction (we know that perhaps 80% of our content is covered by an extractor, but the other, probably more tricky 20% is not.
    3. the non-standardised nature of files we get in. Should we be normalising to the best version fo PDF, for example…?

    Anyway, we are always very happy to work with others to solve these problems. Next pieces of work for us are: develop more adapters for our Metadata Extractor Tool; re-characterise our entire repository to bring all content up to the same version of format identification; characterise our web harvests.


Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.