The following is a guest post by Jefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.
A vexing property of digital objects is the difficulties they pose to ensuring their ongoing authenticity and stability. Files can become corrupted by use, bits can rot even when unused, and during transfer the parts essential to an object’s operability can be lost. At the most basic level, digital preservation requires us to be confident that the objects we are working with are the same as they were prior to our interaction with them.
To deal with this problem, those in the digital preservation field often talk about the fixity of digital objects. Fixity, in this sense, is the property of being constant, steady, and stable. Thankfully, there are some great ways that content stewards can check their digital objects to make sure that they maintain these qualities. Fixity checking is the process of verifying that a digital object has not been altered or corrupted. In practice, this is most often accomplished by computing and comparing checksums or hashes. For additional details on methods for checking fixity, see Kate Zwaard’s excellent earlier post, “Hashing out Digital Trust.”
NDSA Members’ Approaches to Fixity Checking
The National Digital Stewardship Alliance (NDSA) is a network of partners dedicated to ensuring enduring access to digital information. The Alliance’s mission is to establish, maintain, and advance the capacity to preserve our nation’s digital resources for the benefit of present and future generations. Over the last few months, the NDSA Infrastructure Working Group has been reporting on the results of the NDSA member survey examining trends in preservation storage. See the note at bottom for details on the survey. Previous posts have discussed the role of access requirements and cloud and distributed storage in building preservation infrastructure. Another key theme that emerged from the survey was the prevalence of fixity checking as a performance requirement and the challenges imposed on storage systems by this activity.
Based on the essential need to know the validity and consistency of the objects we are preserving, it is great to see that 88% of the responding members are doing some form of fixity checking on content they are preserving. The widespread use of fixity checking illustrates its acceptance as an important component in digital preservation workflows.
With that said, NDSA members are taking distinctly different approaches to checking the fixity of their content. The differences are most likely due to a variety of complicated issues including the scalability of fixity-checking software, network limitations and data transfer costs, transaction volume and access requirements, and other contextual factors around the availability and management of specific sets of content. Amongst survey respondents, fixity checking occurs as follows, with some members maintaining multiple practices:
- 88% (49 of 56) of the organizations report that they are doing some form of fixity checking on content they are preserving.
- 57% (32 of 56) of the organizations are doing checks before and after transactions such as ingest.
- 43% (24 of 56) of the organizations are doing checks on some reoccurring fixed schedule.
- 32% (18 of 56) of the organizations are randomly sampling their content to check fixity.
While fixity checking itself is widespread, NDSA members also take various approaches to scheduling these checks. Of the 24 organizations that run fixity checks on a fixed schedule:
- 46% (11 of 24) check fixity of content on at least a monthly basis.
- 21% (5 of 24) check fixity of content on at least a quarterly basis.
- 29% (7 of 24) check fixity of content on an annual basis.
- 4% (1 of 24) check fixity of content on a tri-annual basis.
The Future of Fixity
NDSA Infrastructure working group members have frequently noted that the state of the art in fixity checking involves distributed fixity checking and frequent, robust repair to intentional or unintentional corruption. This is done by replacing corrupted data with the distributed, replicated, and verified data held at “mirroring” partner repositories in multi-institutional, collaborative distributed networks. Consortia groups MetaArchive and Data-PASS use LOCKSS for this kind of distributed fixity checking and repair. As well, some individual institutions use a self-maintained distributed repository system that allows them to replace damaged content with a verified, uncorrupted copy.
As previously mentioned, one of the key interests of this NDSA working group was the potential role for cloud storage systems in digital preservation storage architectures. For those using cloud storage systems, complying with fixity requirements can prove problematic. As David Rosenthal has pointed out, cloud services are not able to prove that they are not simply replaying fixity information created and stored at the time of deposit. Rosenthal highlights the need for cloud services to provide a tool or service to verify that the systems hold the content rather than simply caching the fixity metadata. Baring that kind of assurance, it can be prohibitively expensive to run any kind of frequent fixity checks on content in various cloud storage platforms.
As we detailed in the previous survey posts, built-in functionality like automated fixity checking and repair was the most desired feature in future preservation storage systems. This desire, along with the challenges of system-type dependencies and diversity of uniform current practices in fixity checking procedures, show the complex interplay between access, performance, preservation requirements, storage infrastructure, and institutional resources. As practices such as fixity checking become ubiquitous and new options like distributed storage gain further acceptance, the hardware underpinning these requirements will be called upon to meet new demands. Our hope is that preservation stewards navigating these decisions will benefit from the knowledge and experience of other NDSA members as they encounter similar complexities and devise new solutions.
Note on survey data
The NDSA Infrastructure Survey, conducted between August 2011 and November 2011, received responses from 58 members of the 74 NDSA member organizations who are preserving digital content. This represents a 78% response rate. The goal of this survey was to get a snapshot of current storage practices within the organizations of the National Digital Stewardship Alliance.
The original survey was sent out to the then 98 members of the NDSA (current membership is 116 institutions). We confirmed that 24 members do not have content they are actively involved in preserving. These organizations include consortia groups, professional organizations, university departments, funders and vendors. There were 16 organizations that neither responded to the survey nor indicated that they were not preserving digital collections. The 16 non-respondents are distributed across the different kinds of organizations in the NDSA, including state archives, service providers, federal agencies, universities and media producers.