The following is a guest post by Jefferson Bailey, Strategic Initiatives Manager at Metropolitan New York Library Council, National Digital Stewardship Alliance Innovation Working Group co-chair and a former Fellow in the Library of Congress’s Office of Strategic Initiatives.
Here on The Signal, members of the NDSA Levels of Digital Preservation team have been providing some additional explication and resources related to each cell in the Levels guidance document. The Levels covers five different topic areas with four different levels of activity for each area, so twenty cells overall. The first post in the series covered Level 1 – Protect Your Data in the area of Storage and Geographic Location. For an overall explanation of the Levels, see this short white paper The NDSA Levels of Digital Preservation: An Explanation and Uses (pdf). This post will cover the second cell in Level 1, File Fixity and Data Integrity. But before we begin, let us revisit the concept of file fixity.
File Fixity
Fixity, in the preservation sense, means the assurance that a digital file has remained unchanged, i.e. fixed. While there have been multiple posts on file fixity on The Signal (see resources below), it bears repeating: knowing that a digital file remains exactly the same as it was when received, as well as through the process of adding it to preservation system and storing it into the future, is a bedrock principle of digital preservation. Digital information, however, poses many challenges to knowing that the digital thing you want to preserve remains unaltered through time. For one, changes to a digital file may not be perceptible to human scrutiny. Also, it would be impossible for any individual to open each individual digital file being preserved in their archive and confirm that each one remains unchanged.
Lucky for us, our machine frenemies are good at these sorts of tasks. The way that they accomplish this is by creating a unique string of letters and numbers that represents the exact bitstream (the sequence of 1s and 0s that make up digital information) of each individual file. This is done using a cryptographic algorithm, such as MD5, SHA1 or SHA256 (Wikipedia has a decent explanation of crypotgraphic hash functions) which then generates the unique sequence of numbers and letters. This unique number is often called a checksum, hash or message digest. One common metaphor for a checksum is to think of it as the “fingerprint” of a digital file.
If any single bit in that file changes, running the same algorithm on the file will produce a drastically different checksum. In matching these two checksums one can determine if a digital file has changed. Generating a current checksum and comparing it to the checksum created originally is known as fixity checking or auditing. By this method, we can ensure that the content we are preserving remains the same (and if it has become corrupted, having multiple copies allows us to replace the corrupted one with an exact copy of the original).
As an example, lets take a look at our good friend, Old Heidelberg, the pipe-smoking, beer-drinking dog.
While these digital files appear the same, I have actually gone in and “flipped” (i.e. changed from 0 to 1 or 0 to 1) two bits in the second image. Running a fixity tool against both files shows that they produce quite different checksums.
The Two Requirements of Row Two Column One
Knowing what fixity means, and the purpose it serves, allows us to understand the content of the Level 1 cell for File Fixity and Data Integrity.
Check file fixity on ingest if it has been provided with the content
There are a variety of tools that can generate and audit fixity information and, since it is a relatively simple function computationally, many of these tools are integrated into other common digital preservation tools. For instance Bagger (a tool built on the Bag-It specification for transferring digital content) generates a “manifest” of each “bag” of files it creates which includes an MD5 checksum. When you validate the files in a bag after you have received it from a sender, Bagger confirms both the contents of the bag as well as the fixity of each file included.
The first requirement is quite simple. If a donor, content contributor, or creator that is sending you digital files or self-depositing them into your repository has generated fixity information prior to delivering this content to you and included it with their content, then you should check to make sure the fixity information of the content you received is the same as that prior to your receipt of it. You do this by “checking” this fixity information — meaning that you should generate new checksums after you receive the content and compare them to the checksum created prior to transfer.
Create fixity info if it wasn’t provided with the content
The second requirement of the cell is, perhaps, even more straightforward. If you, the archivist/preservationist, welcome the not-insignificant responsibility to preserve this digital content that you are accepting (in some cases for a set period, in some in perpetuity), how will you monitor whether this information has changed, be it because of bit rot, intentional or unintentional alteration, or through other means? To preserve is to safeguard from change and guarantee accessibility into the future — no small responsibility with digital content. Thus, if the content you have received was provided without fixity information, you should generate it upon taking custody of the content.
The Levels guidance, in order to remain as broadly applicable and useful as possible, does not provide recommendations on specific tools, processes or policies on generating fixity information, but the resources below should be able to provide information on available tools, current practices and other resources to help institutions plan and implement Level 1 File Fixity and Data Integrity practices.
Resources:
- Checking Your Digital Content: How, What and When to Check Fixity? [PDF]. NDSA Infrastructure & Standards Working Group, February 2012.
- Checksum Programs Evaluation [PDF]. Carol Kussmann, Minnesota Historical Society, May 2012.
- Fixity Checks: Checksums, Message Digests and Digital Signatures [PDF]. Audrey Novak, November 2011.
Other fixity-related posts on The Signal:
- Fixity and Fluidity in Digital Preservation, Bill LeFurgy, October 2012.
- File Fixity and Digital Preservation Storage: More Results from the NDSA Storage Survey. Jefferson Bailey, on The Signal, March 2012.
- Hashing Out Digital Trust, Kate Zwaard, November 2011.
If you know of links to other good fixity resources, please share them in the comments!
Comments (5)
Thanks for the great overview, Jefferson, and a great idea to do some deep dives on the Levels. It would be great if these post were eventually linked or packaged with whatever is the primary home of the Levels. And if it’s not too much shameless self-promotion, I would like to point to an additional resource that AVPreserve has developed — Fixity is a free application we created that can be set to run regular fixity checks on a specific set of folders, hard drive, NAS or other storage device. It looks at not only checksums, but also if files have moved, disappeared, or been added to the directory and sends e-mail notification of changes. Available only for Windows right now, but Mac version should be ready in about 2 weeks http://www.avpreserve.com/avpsresources/tools/
I don’t understand why a previous comment of mine on the issue of cover sheets and checksums was deleted. Cover sheets certainly impact on the reliability of checksums, and the point at which they are added in the repository workflow is critical. And the use of automated cover sheets is becoming more popular, yet little/no thought appears to have been given to them in the context of checksums/fixity.
Joshua Ranger (and the AVPreserve Crew) thank you for posting your resources link! I’ve been looking for something like your Fixity application for a few days now and feel like it is exactly what I needed. THANK YOU!
When using checksums to confirm fixity, if the checksums do not match, is there a way to know how different the files are? For ex, can you tell from the checksums that the second Old Heidelberg dog picture was only 2 bits different?
Should I be running checksums on the original files or the preservation copies (TIFFs)? What about access copies, do these need checksums?