Protect Your Data: File Fixity and Data Integrity

This post is about row two, column one, the second box, in the NDSA Levels of Digital Preservation

This post is about row two, column one, the second box, in the NDSA Levels of Digital Preservation

The following is a guest post by Jefferson Bailey, Strategic Initiatives Manager at Metropolitan New York Library Council, National Digital Stewardship Alliance Innovation Working Group co-chair and a former Fellow in the Library of Congress’s Office of Strategic Initiatives.

Here on The Signal, members of the NDSA Levels of Digital Preservation team have been providing some additional explication and resources related to each cell in the Levels guidance document. The Levels covers five different topic areas with four different levels of activity for each area, so twenty cells overall. The first post in the series covered Level 1 – Protect Your Data in the area of Storage and Geographic Location. For an overall explanation of the Levels, see this short white paper The NDSA Levels of Digital Preservation: An Explanation and Uses (pdf). This post will cover the second cell in Level 1, File Fixity and Data Integrity. But before we begin, let us revisit the concept of  file fixity.

File Fixity

Fixity, in the preservation sense, means the assurance that a digital file has remained unchanged, i.e. fixed. While there have been multiple posts on file fixity on The Signal (see resources below), it bears repeating: knowing that a digital file remains exactly the same as it was when received, as well as through the process of adding it to preservation system and storing it into the future, is a bedrock principle of digital preservation. Digital information, however, poses many challenges to knowing that the digital thing you want to preserve remains unaltered through time. For one, changes to a digital file may not be perceptible to human scrutiny. Also, it would be impossible for any individual to open each individual digital file being preserved in their archive and confirm that each one remains unchanged.

Checksum icon. Public domain image from The Noun Project

Checksum icon. Public domain image from The Noun Project

Lucky for us, our machine frenemies are good at these sorts of tasks. The way that they accomplish this is by creating a unique string of letters and numbers that represents the exact bitstream (the sequence of 1s and 0s that make up digital information) of each individual file. This is done using a cryptographic algorithm, such as MD5, SHA1 or SHA256 (Wikipedia has a decent explanation of crypotgraphic hash functions) which then generates the unique sequence of numbers and letters. This unique number is often called a checksum, hash or message digest. One common metaphor for a checksum is to think of it as the “fingerprint” of a digital file.

If any single bit in that file changes, running the same algorithm on the file will produce a drastically different checksum. In matching these two checksums one can determine if a digital file has changed. Generating a current checksum and comparing it to the checksum created originally is known as fixity checking or auditing. By this method, we can ensure that the content we are preserving remains the same (and if it has become corrupted, having multiple copies allows us to replace the corrupted one with an exact copy of the original).

As an example, lets take a look at our good friend, Old Heidelberg, the pipe-smoking, beer-drinking dog.

The Old Heidelberg we know and love. Library of Congress Prints & Photographs. LC-D4-18852. http://hdl.loc.gov/loc.pnp/det.4a12901

The Old Heidelberg we know and love. Library of Congress Prints & Photographs. LC-D4-18852. http://hdl.loc.gov/loc.pnp/det.4a12901

Not, fixity-wise, the Old Heidelberg we know and love.

Not, fixity-wise, the Old Heidelberg we know and love.

While these digital files appear the same, I have actually gone in and “flipped” (i.e. changed from 0 to 1 or 0 to 1) two bits in the second image. Running a fixity tool against both files shows that they produce quite different checksums.

This image shows the results of running a fixity tool on the two files. The information shown, in order, is: file size [the same for both images], MD5 checksum, SHA256 checksum, and directory path of each of the two files. Notice the difference between the checksums of the two files.

This image shows the results of running a fixity tool on the two files. The information shown, in order, is: file size [the same for both images], MD5 checksum, SHA256 checksum, and directory path of each of the two files. Notice the difference between the checksums of the two files.

The Two Requirements of Row Two Column One

Knowing what fixity means, and the purpose it serves, allows us to understand the content of the Level 1 cell for File Fixity and Data Integrity.

Check file fixity on ingest if it has been provided with the content

bagger_manifest

Manifest file from a Bagger “bag” containing the two Old Heidelberg files.

There are a variety of tools that can generate and audit fixity information and, since it is a relatively simple function computationally, many of these tools are integrated into other common digital preservation tools. For instance Bagger (a tool built on the Bag-It specification for transferring digital content) generates a “manifest” of each “bag” of files it creates which includes an MD5 checksum. When you validate the files in a bag after you have received it from a sender, Bagger confirms both the contents of the bag as well as the fixity of each file included.

The first requirement is quite simple. If a donor, content contributor, or creator that is sending you digital files or self-depositing them into your repository has generated fixity information prior to delivering this content to you and included it with their content, then you should check to make sure the fixity information of the content you received is the same as that prior to your receipt of it. You do this by “checking” this fixity information — meaning that you should generate new checksums after you receive the content and compare them to the checksum created prior to transfer.

Create fixity info if it wasn’t provided with the content

The second requirement of the cell is, perhaps, even more straightforward. If you, the archivist/preservationist, welcome the not-insignificant responsibility to preserve this digital content that you are accepting (in some cases for a set period, in some in perpetuity), how will you monitor whether this information has changed, be it because of bit rot, intentional or unintentional alteration, or through other means? To preserve is to safeguard from change and guarantee accessibility into the future — no small responsibility with digital content. Thus, if the content you have received was provided without fixity information, you should generate it upon taking custody of the content.

The Levels guidance, in order to remain as broadly applicable and useful as possible, does not provide recommendations on specific tools, processes or policies on generating fixity information, but the resources below should be able to provide information on available tools, current practices and other resources to help institutions plan and implement Level 1 File Fixity and Data Integrity practices.

Resources:

Other fixity-related posts on The Signal:

If you know of links to other good fixity resources, please share them in the comments!

2 Comments

  1. Joshua Ranger
    April 8, 2014 at 8:49 am

    Thanks for the great overview, Jefferson, and a great idea to do some deep dives on the Levels. It would be great if these post were eventually linked or packaged with whatever is the primary home of the Levels. And if it’s not too much shameless self-promotion, I would like to point to an additional resource that AVPreserve has developed — Fixity is a free application we created that can be set to run regular fixity checks on a specific set of folders, hard drive, NAS or other storage device. It looks at not only checksums, but also if files have moved, disappeared, or been added to the directory and sends e-mail notification of changes. Available only for Windows right now, but Mac version should be ready in about 2 weeks http://www.avpreserve.com/avpsresources/tools/

  2. bh
    April 9, 2014 at 4:35 pm

    I don’t understand why a previous comment of mine on the issue of cover sheets and checksums was deleted. Cover sheets certainly impact on the reliability of checksums, and the point at which they are added in the repository workflow is critical. And the use of automated cover sheets is becoming more popular, yet little/no thought appears to have been given to them in the context of checksums/fixity.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.