Digital Formats, part 1: Lots of ‘Em and More to Come

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.

The Library has presented information about digital formats since 2004, intended to support preservation planning in our institution and in our sister archives.

This is, of course, an open-ended activity: new formats come along with some regularity and, in any case, it is often the old ones (obsolete or heading in that direction) that are of special preservation concern.

We distinguish between what you might call subformats, which will permit preservation specialists to express their format preferences in a precise way and will generally offer targeted information to those who are planning future format migrations and/or system emulations.  For example, our online PDF family currently has fifteen separate descriptions, including entities like versions 1.3, 1.4, 1.6, and 1.7, and the multiple flavors of PDF/A (three now, four more on the way).  The MPEG-4 family is splintered into something like thirty Web-page descriptions, some of which document very narrow distinctions.

Sorting out PDF relationships: the format team's working diagram.

The description pages do not provide a quantitative ranking for how well a given format will serve the needs of long-term management of digital content.  However, each description provides some information about seven sustainability factors that an evaluator ought to consider: disclosure (is there a specification, can you get your hands on it?), adoption (how widely used is this format?), transparency (how easy would it be to decipher the bitstream?), self-documentation (does the format permit the embedding of metadata?), external dependencies (do you need a piece of hardware to open the file?), impact of patents (are there patents that inhibit use or content migration?), and technical protection mechanisms (are there elements that could be used to lock the file?).

The descriptions also include information about quality and functionality factors that characterize each format’s ability to represent the significant characteristics of a given content item required by current and future users.

Our descriptions cover file formats (as indicated by file extensions, MIME type, magic number, etc.), the bitstream encodings that underlie certain file formats (PCM audio, XML, etc.), as well as wrappers and bundling formats.

At the moment, there are 260 format descriptions on the Web site and, if we could wave a magic wand, we’d add another fifty or sixty at once.  We want these informative pages (“human readable”) to provide strong synergy to the important Unified Digital Formats Registry (“machine actionable”).  The UDFR is currently under development at the California Digital Library under the watchful eye of Stephen Abrams and with cooperative support from a number of organizations including NDIIPP.

At our Web site, we don’t categorize formats in terms of their internal structure or method of serialization.  This could be done: a few years ago, we corresponded with a Canadian colleague, Carl Eric Codère, who had a nifty breakdown for file formats.  At high level, Codère described structured and unstructured files, the latter from an earlier day: “data formats that consisted of directly dumping the memory images of one or more structures into the file.”

Structured formats, Codère said, fell into two broad classes: chunk based (“each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data for binary encoded files”) and directory based (“extensible format that closely resembles a file system . . . where the file is composed of ‘directory entries’ that contain the location of the data within the file itself as well as its signatures”).

For preservation planning, however, we have found it preferable to categorize by content category, i.e., in terms of the content domain or domains that a given format serves.  In 2004, we led off with a quartet of categories familiar to us from our digitization of Library of Congress collections: still image, sound, textual, and moving image.  Then, in following years, we added web archive, dataset (still thinly represented), and generic (also thin).  This year, we are pleased to report that we have a reasonable start at geospatial formats.  That will be the topic for tomorrow’s blog post.

Edit added on 12/22/11: You can find part two of this series here.

