This blog post is co-authored by Carl Fleischhauer, Project Manager, Digital Initiatives, Library of Congress.
People who manage audio and video files over time, do create fixity data, aka hash values or checksums, to help monitor the condition of those files in storage and when moved from one system or media to another system or media. And they often do what others also do: create fixity data for a whole file and allow their data management system to retain and compare that historic data with a fresh calculation, to see if the file has changed and, if so, to take remedial action. The NDSA Infrastructure and Standards and Practices working groups’ Checking Your Digital Content: How, What and When to Check Fixity paper summarizes these concepts. But people who produce audio and video files, and those who manage them, often also create fixity data on parts of files and sometimes that data is embedded in the file. What gives?
An increasing practice within the audiovisual community is to generate fixity values at the intra-file or part level in addition to the whole file. Archivist and technologist Dave Rice calls these “little checksums for large files” and explains that “by producing checksums on a more granular level, such as per frame, it is more feasible to assess the extent or location of digital change in the event of a checksum mismatch.” These little checksums allow for monitoring changes in specific parts of the file. Imagine establishing separate fixity value on the whole file and the data only part of a file. Changes made to the embedded metadata would change the whole file fixity value but not that of the video only data, establishing proof that the video data is unchanged.
Practice varies about the level of granularity but three common parts are the container or “chunk”, frame and smaller-than-frame levels. A well-known example of container or “chunk” level hashing is the audio chunk in RIFF-based Broadcast Wave files. The Federal Agencies Digitization Guideline Initiative report on Embedding Metadata in Digital Audio Files: Introductory Discussion (PDF) explains that creating an “audio-data-only [MD5] checksum (including the entire <data> chunk, excluding the chunk id, size declaration, and any optional padding byte) the audio portion of the file which helps validate the integrity of the audio but allows for alteration of the metadata.” Even more granular is frame-level fixity data, such as found in FLAC. More granular still are implementations at the V (or data value) of the KLV triplet structure found in digital cinema packages as defined in SMPTE ST 429-6:2006.
Options for intra-file fixity values in MXF files are an especially interesting bunch. The BBC’s implementation of MXF (PDF), for example, uses both per-track and per-frame checksums (in addition to whole file checksums). MXF also permits the option for “edit unit” level checksums, a term used in SMPTE MXF (defined by ST 377-1) to define one content package. The FADGI AS-07 MXF application specification permits one checksum value for every V data unit in the KLV triplets that represent progressive scanned picture data frame-wrapped essences. Picture data for interlaced video, including JPEG 2000 compressed video case I2 (frame wrapping, interlaced two fields per KLV triplet, defined by SMPTE ST 422:2013), will very often be carried with the data from both fields represented as a single V in a KLV triplet. The checksum value is calculated on the concatenated values of the two Vs in the pair of KLVs.
Intra-file fixity values in audio and video files serve roles beyond those of whole file checksums including process monitoring. The first process monitoring use case is creating the fixity value as early in the process as possible – ideally at the time of initial creation. The FADGI report on audio Interstitial Errors documents the problems of writing to disk after initial creation. In short, errors in the digital data can creep in to the chain almost immediately, starting with the movement from the analog to digital converter to the digital audio workstation and in the DAW’s writing of the file to a storage medium. A fixity value at time of creation compared with a fixity value created by the DAW or at the storage destination could be used to validate a successful capture of the data and of the file’s movement through the workflow.
Another process monitoring use case is confirming whether a losslessly transcoded file represents the original encoded data authentically. Dave Rice in Reconsidering the Checksum for Audiovisual Preservation explains that “a preservation-suitable lossless audiovisual encoding should decode to the same data that the original source would decode to, meaning that each pixel, frame, and timing decoded from the lossless version should be the same as the decoded original.” In other words, the frame-level fixity values of an uncompressed file should be identical to the frame-level fixity values of a losslessly compressed version of the same file if the compression is truly mathematically lossless.
Clearly, these little checksums are valuable instruments and it’s getting easier to create and store them. Some workflows store intra-file fixity values within the file itself. This is of course impossible to do with whole file fixity values because the recursive nature of the process. Embedding a new value in the file changes the file itself so the fixity value could never “catch up.” Intra-file level checksums however can be embedded. The FADGI-created BWF MetaEdit tool includes an option to create an MD5 checksum on the audio data chunk of Broadcast Wave file and puts it in the MD5 chunk within the file using the id <MD5 >. The National Archives and Records Administration’s AVI MetaEdit tool does the same for the video data chunk of AVI files. FLAC also allows internal storage of an MD5 checksum value of audio only data in the STREAMINFO block.
The BBC Ingex system embeds (PDF) frame-level checksum values in the MXF Generic Container System Item (PDF). The FADGI AS-07 MXF application specification will recommend storing the fixity value in the Generic Container System Item but will also permit storing the value within the KLV triplet, following the model of DCPs.
Toolsets like FFmpeg include capability for two types of checksums at the frame level. Framecrccreates Adler-32 CRC checksums while framemd5 creates MD5 hash values for each audio and video packet.
Of course, the audiovisual community isn’t the only one using intra-file checksums or even checksums for process monitoring. One such example is the frame check sequence in networked storage just for starters. The concepts behind checksums on parts of files are extensible to other file types too. There’s nothing stopping say, the imaging community from embedding checksums on specifically EXIF metadata in TIFF files to help establish if that data is altered or stripped during processing.
Comments (6)
Fixity is useful and necessary in an archive. I’d like to see more use of forward error correction (FEC) in these files (AXF, I believe, will include FEC). It is important in archive/long term storage that we recognize when fixity has been breached, but also important to recover as much of the content as possible. I would prefer a few missing frames to declaring the file a total loss.
Thanks for sharing the details and background.
I agree with your comment, Scott. One of the benefits of intra-file fixity data is that it can be pointer to specific failure points. This is helpful for large complex files (like video files for example) where it is sometimes necessary to access parts of the file – perhaps a few frames or a clip. If there’s a data failure in a subpart, just that subpart could be replaced (in theory although this may not be practical) or the whole file.
Thanks for the discussion! Kate
Could you provide further explanation of the last sentence: “There’s nothing stopping say, the imaging community from embedding checksums on specifically EXIF metadata in TIFF files to help establish if that data is altered or stripped during processing.” I know what Exif metadata is and where it is stored in the TIFF header, but I’m not sure if you are suggesting adding a checksum as a standard element in the Exif schema or, if a checksum would be run against the Exif metadata and if that metadata changed/altered/corrupted the checksum would no-longer match.
Thanks for this query Jessica. My comment was meant to be illustrative about how intra-file fixity data might be extensible to other file types where separate fixity values on metadata and data could be helpful. If there’s likely to be change in the Exif data but not the image data (or vice versa), separate fixity values might be helpful to identify change where there ought not to be change.
Best wishes – Kate
Great to see this in the wild in video and audio. Note that DNG has had this for image files for 5 or six years.
The MD5 checksums in DNG refer to the source image data, and the source image file if embedded. So metadata can change, and even embedded adjusted renderings of the image data can change without invalidating the checksum.
Also note that Adobe has released an openly licensed SDK with command-line capability to verify the checksums.
Considerable progress has been made with permanent storage media such as micro silica media. There is no degradation in the data. Each edit or modification is re-written to new media and linked to the old data set. Each iteration remains available for resource. With the use of a Block Chain like capability, original up to current and all editors are archived including associated metadata. The storage capacity is far beyond that of commercially available technology. Though nothing can replace the historical objects or physical media, their likeness can be digitized in all respects. The chemical or biological make up can be recorded as well. Though not ignoring the capabilities of DNA data storage which go further in this respect. A single person can generate around 1.5 to 2.0 Terabytes of data in a single year. 85 years of that data can easily be stored in the area of 2.5 mm cu. A small to midsize company can generate in excess of 2 Terabytes per day or ~12.8mm cu per year. Most successful companies’ parish within 25yrs. All of that generated data would likely fit inside a 3cm cube. The US Gov produces ~1.8 Petabytes per hour. That totals a cube of ~100cm cu per year.