File Fixity and Digital Preservation Storage: More Results from the NDSA Storage Survey

The following is a guest post by Jefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.

A vexing property of digital objects is the difficulties they pose to ensuring their ongoing authenticity and stability. Files can become corrupted by use, bits can rot even when unused, and during transfer the parts essential to an object’s operability can be lost. At the most basic level, digital preservation requires us to be confident that the objects we are working with are the same as they were prior to our interaction with them.

To deal with this problem, those in the digital preservation field often talk about the fixity of digital objects. Fixity, in this sense, is the property of being constant, steady, and stable. Thankfully, there are some great ways that content stewards can check their digital objects to make sure that they maintain these qualities. Fixity checking is the process of verifying that a digital object has not been altered or corrupted. In practice, this is most often accomplished by computing and comparing checksums or hashes. For additional details on methods for checking fixity, see Kate Zwaard’s excellent earlier post, “Hashing out Digital Trust.”

NDSA Members’ Approaches to Fixity Checking

The National Digital Stewardship Alliance (NDSA) is a network of partners dedicated to ensuring enduring access to digital information. The Alliance’s mission is to establish, maintain, and advance the capacity to preserve our nation’s digital resources for the benefit of present and future generations. Over the last few months, the NDSA Infrastructure Working Group has been reporting on the results of the NDSA member survey examining trends in preservation storage. See the note at bottom for details on the survey. Previous posts have discussed the role of access requirements and cloud and distributed storage in building preservation infrastructure. Another key theme that emerged from the survey was the prevalence of fixity checking as a performance requirement and the challenges imposed on storage systems by this activity.

Based on the essential need to know the validity and consistency of the objects we are preserving, it is great to see that 88% of the responding members are doing some form of fixity checking on content they are preserving. The widespread use of fixity checking illustrates its acceptance as an important component in digital preservation workflows.

With that said, NDSA members are taking distinctly different approaches to checking the fixity of their content. The differences are most likely due to a variety of complicated issues including the scalability of fixity-checking software, network limitations and data transfer costs, transaction volume and access requirements, and other contextual factors around the availability and management of specific sets of content. Amongst survey respondents, fixity checking occurs as follows, with some members maintaining multiple practices:

88% (49 of 56) of the organizations report that they are doing some form of fixity checking on content they are preserving.
57% (32 of 56) of the organizations are doing checks before and after transactions such as ingest.
43% (24 of 56) of the organizations are doing checks on some reoccurring fixed schedule.
32% (18 of 56) of the organizations are randomly sampling their content to check fixity.

While fixity checking itself is widespread, NDSA members also take various approaches to scheduling these checks. Of the 24 organizations that run fixity checks on a fixed schedule:

46% (11 of 24) check fixity of content on at least a monthly basis.
21% (5 of 24) check fixity of content on at least a quarterly basis.
29% (7 of 24) check fixity of content on an annual basis.
4% (1 of 24) check fixity of content on a tri-annual basis.

The Future of Fixity

NDSA Infrastructure working group members have frequently noted that the state of the art in fixity checking involves distributed fixity checking and frequent, robust repair to intentional or unintentional corruption. This is done by replacing corrupted data with the distributed, replicated, and verified data held at “mirroring” partner repositories in multi-institutional, collaborative distributed networks. Consortia groups MetaArchive and Data-PASS use LOCKSS for this kind of distributed fixity checking and repair. As well, some individual institutions use a self-maintained distributed repository system that allows them to replace damaged content with a verified, uncorrupted copy.

As previously mentioned, one of the key interests of this NDSA working group was the potential role for cloud storage systems in digital preservation storage architectures. For those using cloud storage systems, complying with fixity requirements can prove problematic. As David Rosenthal has pointed out, cloud services are not able to prove that they are not simply replaying fixity information created and stored at the time of deposit. Rosenthal highlights the need for cloud services to provide a tool or service to verify that the systems hold the content rather than simply caching the fixity metadata. Baring that kind of assurance, it can be prohibitively expensive to run any kind of frequent fixity checks on content in various cloud storage platforms.

As we detailed in the previous survey posts, built-in functionality like automated fixity checking and repair was the most desired feature in future preservation storage systems. This desire, along with the challenges of system-type dependencies and diversity of uniform current practices in fixity checking procedures, show the complex interplay between access, performance, preservation requirements, storage infrastructure, and institutional resources. As practices such as fixity checking become ubiquitous and new options like distributed storage gain further acceptance, the hardware underpinning these requirements will be called upon to meet new demands. Our hope is that preservation stewards navigating these decisions will benefit from the knowledge and experience of other NDSA members as they encounter similar complexities and devise new solutions.

Note on survey data

The NDSA Infrastructure Survey, conducted between August 2011 and November 2011, received responses from 58 members of the 74 NDSA member organizations who are preserving digital content. This represents a 78% response rate. The goal of this survey was to get a snapshot of current storage practices within the organizations of the National Digital Stewardship Alliance.

The original survey was sent out to the then 98 members of the NDSA (current membership is 116 institutions). We confirmed that 24 members do not have content they are actively involved in preserving. These organizations include consortia groups, professional organizations, university departments, funders and vendors. There were 16 organizations that neither responded to the survey nor indicated that they were not preserving digital collections. The 16 non-respondents are distributed across the different kinds of organizations in the NDSA, including state archives, service providers, federal agencies, universities and media producers.

Comments (2)

Peter McKinney says:
March 11, 2012 at 5:31 pm

This is really valuable information and very enlightening. I’m pretty amazed that so many organisations do monthly checks. Is there any correlation between size of collection and frequency of checks?

Also, did the survey collect any information on how many failures are encountered during the checks?

The posts that the Library are putting up here are really excellent.

Cheers,
Pete McKinney,
National Library of NZ
Jefferson Bailey says:
March 13, 2012 at 2:03 pm

Hi Peter and thanks for the kind words! There are more posts forthcoming out of the storage survey, so keep tuning in.

It is hard to answer your first question. One thing that came out of the survey was that many preservation process requirements are very collection-dependent. So one large repository may have different frequencies of fixity checking for different collections. Also, some systems, such as DAITTS and LOCKSS are continuously checking fixity. Breaking out the results a bit more, though, it is fair to say that larger collections/repositories are more likely to have multiple points at which checking occurs, such as both at ingest and scheduled.

Regarding your second question, we did not get to the level of detail of collecting information on file corruption rates, though I agree that would be interesting to know and would be useful information for modeling costs, process requirements, infrastructure needs, etc. Though I can’t point to anything directly, I’d imagine LOCKSS or other distributed networks have at some point published metrics about failure rates. Post here if you find anything and in the meantime I will pose this question to the Infrastructure Working Group.

Thanks again for reading and for the comment.

Add a Comment Cancel reply

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.

Name (no commercial URLs) *

Email (will not be published) *

Comment: