Can I Get a Sample of That? Digital File Format Samples and Test Sets

These are my kind of samples! Photo of chocolate mayo cake samples by Matt DeTurck on Flickr

These are my kind of samples! Photo of chocolate mayo cake samples by Matt DeTurck on Flickr

If you’ve ever been to a warehouse store on a weekend afternoon, you’ve experienced the power of the sample. In the retail world, samples are an important tool to influence potential new customers who don’t want to invest in an unknown entity. I certainly didn’t start the day with lobster dip on my shopping list but it was in my cart after I picked up and enjoyed a bite-sized taste. It was the sample that proved to me that the product met my requirements (admittedly, I have few requirements for snack foods) and fit well within my existing and planned implementation infrastructure (admittedly, not a lot of thought goes into my meal-planning) so the product was worth my investment. I tried it, it worked for me and fit my budget so I bought it.

Of course, samples have significant impact far beyond the refrigerated section of warehouse stores. In the world of digital file formats, there are several areas of work where sample files and curated groups of sample files, which I call test sets, can be valuable.

The spectrum of sample files

Sample files are not all created equal. Some are created as perfect ideal example of the archetypal golden file, some might have suspected or confirmed errors of varying degrees while still others are engineered to be non-conforming or just plain bad.  Is it always an ideal “golden” everything-works-perfectly example or do less-than-perfect files have a place? I’d argue that you need both. It’s always good to have a valid and well-formed sample but you often learn more from non-conforming files because they can highlight points of failure or other issues.

Oliver Morgan of MetaGlue, Inc., an expert consultant working with the Federal Agencies Digitization Guidelines Initiative AV Working Group on the MXF AS-07 application specification has developed the “Index of Metals” scale for sample files created specifically for testing purposes during the specification drafting process which range from gold (engineered to be good/perfect) to plutonium (engineered poisonous).

An Index of Metals demonstrating a possible range of sample file qualities from gold (perfect) to plutonium (poisonous). Slide courtesy of Oliver Morgan, MetaGlue, Inc.

An Index of Metals demonstrating a possible range of sample file qualities from gold (perfect) to plutonium (poisonous on purpose). Slide courtesy of Oliver Morgan, MetaGlue, Inc.

Ideally, the file creator would have the capability and knowledge to make files that conform to specific requirements so they know what’s good, bad and ugly about each engineered sample. Perhaps equally as important as the file itself is the accompanying documentation which describes the goal and attributes of the sample. Some examples of this type of test set are the Adobe Acrobat Engineering PDF Test Suites and Apple’s Quicktime Sample Files.

Of course, not all sample files are planned out and engineered to meet specific requirements. More commonly, files are harvested from available data sets, web sites or collections and repurposed as de facto digital file format sample files. One example of this type of sample set is Open Planet’s Format Corpus. These files can be useful for a range of purposes. Viewed in the aggregate, these ad hoc sample files can help establish patterns and map out structures for format identification and characterization when format documentation or engineered samples are either deficient or lacking. Conversely, these non-engineered test sets can be problematic especially when they deviate from the format specification standard. How divergent from the standard is too divergent before the file is considered fatally flawed or even another file format?

Audiences for sample files

In the case of specification drafting, engineered sample files can be useful not only as part of a feedback loop for the specification authors to highlight potential problems and omissions in the technical language, but sample files may be valuable later on to manufactures and open-source developers who want to build tools that can interact with the file type to produce valid results.

At the Library of Congress, we sometimes examine sample files when working on the Sustainability of Digital Formats website so we can see with our own eyes how the file is put together. Reading specification documentation (which, when it exists, isn’t always as comprehensive as one might wish) is one thing but actually seeing a file through a hex viewer or other investigative tool is another. The sample file can clarify and augment our understanding of the format’s structure and behavior.

Other efforts focusing on format identification and characterization issues, such as JHOVE and JHOVE2, the National Archives UK’s DROID,  OPF’s Digital Preservation and Data Curation Requirements and Solutions and Archive Team’s Let’s Solve the File Format Problem, have a critical need for format samples, especially when other documentation about the format is incomplete or just plain doesn’t exist. Sample files, especially engineered test sets, can help efforts such as NARA’s Applied Research and their partners establish patterns and rules, including identifying magic numbers which are an essential component to digital preservation research and workflows. Format registries like PRONOM and UDFR rely on the results of this research to support digital preservation services.

Finally, there are the institutional and individual end users who might want to implement the file type in their workflows or adopt it as a product but first, they want to play with it a bit. Sample files can help potential implementers understand how a file type might fit into existing workflows and equipment, how it might compare on an information storage level with other file format options as well as help assess the learning curve for staff to understand the file’s structure and behavior? Adopting a new file format is no small decision for most institutions so the sample files allow technologists to evaluate if a particular format meets their needs and estimate the level of investment.

7 Comments

  1. Reto Kromer
    December 13, 2013 at 1:46 am

    This is indeed an excellent resource! Thank you!

  2. Kate Murray
    December 13, 2013 at 8:18 am

    Thank you Reto! I’d love to hear about other publicly-available sample sets out there.

    Best wishes – Kate

  3. Kit Arrington
    December 13, 2013 at 8:30 am

    I learned so much from this posting! Though obvious if you really think about it, I hadn’t before pondered the usefulness of file format sample sets – and certainly wasn’t aware of how frequently they are created and are in use. Do any, or all of the format registries also maintain “valid” samples of the formats that they are documenting? I hadn’t considered that as a possibility – which would make for a really interesting resource: “Here’s the file format documentation – click here for a sample!”

  4. Kate Murray
    December 13, 2013 at 2:15 pm

    Thanks Kit! Oh, how I wish that format registries had valid samples widely available! There may be small pockets of them but nothing really comprehensive. It would be great to have such a resource and there have been a few attempts to make it be so but there are all sorts of complications. Some file types are proprietary, it can be hard to get rights-free content samples, who would host, maintain, and validate the submitted samples and with what tools, etc. This doesn’t mean it can’t be done but rather that it’s not something the community could pull together quickly. We have talked about preparing samples that meet the various FADGI star ratings for still images but this isn’t a reality yet.

  5. Mark Evans
    December 13, 2013 at 5:23 pm

    Great post Kate, highlighting an important issue that is at the root of digital preservation. At Tessella we too have compiled a test deck of file formats that we use as part of our system testing, and to validate the PRONOM signatures in DROID.

  6. Libor Coufal
    January 13, 2014 at 6:04 pm

    At the National Library of Australia we are building a test corpus that includes both created and collected sample files in various formats and their versions as part of a project mapping and testing file format-software relationships.

  7. Kate Murray
    January 14, 2014 at 8:16 am

    Thanks for this update, Libor. I’m looking forward to hearing more about this project. Sounds like a great contribution to the community.

    Best wishes – Kate

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.