It’s Not Just Integrity: Fixity Data in Digital Sound and Moving Image Files

This blog post is co-authored by Carl Fleischhauer, Project Manager, Digital Initiatives, Library of Congress.

People who manage audio and video files over time, do create fixity data, aka hash values or checksums, to help monitor the condition of those files in storage and when moved from one system or media to another system or media.  And they often do what others also do: create fixity data for a whole file and allow their data management system to retain and compare that historic data with a fresh calculation, to see if the file has changed and, if so, to take remedial action. The NDSA Infrastructure and Standards and Practices working groups’ Checking Your Digital Content: How, What and When to Check Fixity paper summarizes these concepts. But people who produce audio and video files, and those who manage them, often also create fixity data on parts of files and sometimes that data is embedded in the file.  What gives?

checksums

Flickr Checksum Machine Tag by Mark Longair. Photo courtesy of Flickr

An increasing practice within the audiovisual community is to generate fixity values at the intra-file or part level in addition to the whole file.  Archivist and technologist Dave Rice calls these “little checksums for large files” and explains that “by producing checksums on a more granular level, such as per frame, it is more feasible to assess the extent or location of digital change in the event of a checksum mismatch.” These little checksums allow for monitoring changes in specific parts of the file. Imagine establishing separate fixity value on the whole file and the data only part of a file. Changes made to the embedded metadata would change the whole file fixity value but not that of the video only data, establishing proof that the video data is unchanged.  

Practice varies about the level of granularity but three common parts are the container or “chunk”, frame and smaller-than-frame levels. A well-known example of container or “chunk” level hashing is the audio chunk in RIFF-based Broadcast Wave files. The Federal Agencies Digitization Guideline Initiative report on Embedding Metadata in Digital Audio Files: Introductory Discussion (PDF) explains that creating an “audio-data-only [MD5] checksum (including the entire <data> chunk, excluding the chunk id, size declaration, and any optional padding byte) the audio portion of the file which helps validate the integrity of the audio but allows for alteration of the metadata.”  Even more granular is frame-level fixity data, such as found in FLAC. More granular still are implementations at the V (or data value) of the KLV triplet structure found in digital cinema packages as defined in SMPTE ST 429-6:2006.

checksum2

Checksum image of « La marseillaise » as sung by Processing. Photo by Douglas Edric Stanley, courtesy of Flickr

Options for intra-file fixity values in MXF files are an especially interesting bunch. The BBC’s implementation of MXF (PDF), for example, uses both per-track and per-frame checksums (in addition to whole file checksums). MXF also permits the option for “edit unit” level checksums, a term used in SMPTE MXF (defined by ST 377-1) to define one content package. The FADGI AS-07 MXF application specification permits one checksum value for every V data unit in the KLV triplets that represent progressive scanned picture data frame-wrapped essences. Picture data for interlaced video, including JPEG 2000 compressed video case I2 (frame wrapping, interlaced two fields per KLV triplet, defined by SMPTE ST 422:2013), will very often be carried with the data from both fields represented as a single V in a KLV triplet. The checksum value is calculated on the concatenated values of the two Vs in the pair of KLVs.

Intra-file fixity values in audio and video files serve roles beyond those of whole file checksums  including process monitoring.  The first process monitoring use case is creating the fixity value as early in the process as possible – ideally at the time of initial creation. The FADGI report on audio Interstitial Errors documents the problems of writing to disk after initial creation. In short, errors in the digital data can creep in to the chain almost immediately, starting with the movement from the analog to digital converter to the digital audio workstation  and in the DAW’s writing of the file to a storage medium. A fixity value at time of creation compared with a fixity value created by the DAW or at the storage destination could be used to validate a successful capture of the data and of the file’s movement through the workflow.

Another process monitoring use case is confirming whether a losslessly transcoded file represents the original encoded data authentically. Dave Rice in Reconsidering the Checksum for Audiovisual Preservation explains that “a preservation-suitable lossless audiovisual encoding should decode to the same data that the original source would decode to, meaning that each pixel, frame, and timing decoded from the lossless version should be the same as the decoded original.”  In other words, the frame-level fixity values of an uncompressed file should be identical to the frame-level fixity values of a losslessly compressed version of the same file if the compression is truly mathematically lossless.

Clearly, these little checksums are valuable instruments and it’s getting easier to create and store them. Some workflows store intra-file fixity values within the file itself. This is of course impossible to do with whole file fixity values because the recursive nature of the process. Embedding a new value in the file changes the file itself so the fixity value could never “catch up.” Intra-file level checksums however can be embedded. The FADGI-created BWF MetaEdit tool includes an option to create an MD5 checksum on the audio data chunk of Broadcast Wave file and puts it in the MD5 chunk within the file using the id <MD5 >. The National Archives and Records Administration’s AVI MetaEdit tool does the same for the video data chunk of AVI files. FLAC also allows internal storage of an MD5 checksum value of audio only data in the STREAMINFO block.

The BBC Ingex system embeds (PDF) frame-level checksum values in the MXF Generic Container System Item (PDF). The FADGI AS-07 MXF application specification will recommend storing the fixity value in the Generic Container System Item but will also permit storing the value within the KLV triplet, following the model of DCPs.

Toolsets like FFmpeg include capability for two types of checksums at the frame level. Framecrccreates Adler-32 CRC checksums while framemd5 creates MD5 hash values for each audio and video packet.

Of course, the audiovisual community isn’t the only one using intra-file checksums or even checksums for process monitoring. One such example is the frame check sequence in networked storage just for starters. The concepts behind checksums on parts of files are extensible to other file types too. There’s nothing stopping say, the imaging community from embedding checksums on specifically EXIF metadata in TIFF files to help establish if that data is altered or stripped during processing.

 

5 Comments

  1. Scott Rife
    March 5, 2014 at 8:48 am

    Fixity is useful and necessary in an archive. I’d like to see more use of forward error correction (FEC) in these files (AXF, I believe, will include FEC). It is important in archive/long term storage that we recognize when fixity has been breached, but also important to recover as much of the content as possible. I would prefer a few missing frames to declaring the file a total loss.

    Thanks for sharing the details and background.

  2. Kate Murray
    March 5, 2014 at 9:16 am

    I agree with your comment, Scott. One of the benefits of intra-file fixity data is that it can be pointer to specific failure points. This is helpful for large complex files (like video files for example) where it is sometimes necessary to access parts of the file – perhaps a few frames or a clip. If there’s a data failure in a subpart, just that subpart could be replaced (in theory although this may not be practical) or the whole file.

    Thanks for the discussion! Kate

  3. Jessica Bushey
    March 6, 2014 at 5:01 pm

    Could you provide further explanation of the last sentence: “There’s nothing stopping say, the imaging community from embedding checksums on specifically EXIF metadata in TIFF files to help establish if that data is altered or stripped during processing.” I know what Exif metadata is and where it is stored in the TIFF header, but I’m not sure if you are suggesting adding a checksum as a standard element in the Exif schema or, if a checksum would be run against the Exif metadata and if that metadata changed/altered/corrupted the checksum would no-longer match.

  4. Kate Murray
    March 7, 2014 at 9:07 am

    Thanks for this query Jessica. My comment was meant to be illustrative about how intra-file fixity data might be extensible to other file types where separate fixity values on metadata and data could be helpful. If there’s likely to be change in the Exif data but not the image data (or vice versa), separate fixity values might be helpful to identify change where there ought not to be change.

    Best wishes – Kate

  5. Peter Krogh
    April 9, 2014 at 10:58 am

    Great to see this in the wild in video and audio. Note that DNG has had this for image files for 5 or six years.

    The MD5 checksums in DNG refer to the source image data, and the source image file if embedded. So metadata can change, and even embedded adjusted renderings of the image data can change without invalidating the checksum.

    Also note that Adobe has released an openly licensed SDK with command-line capability to verify the checksums.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.