The following is a guest post from Stephen Abrams Associate Director, UC Curation Center/California Digital Library. Stephen recently represented an action team from NDSA innovation working group in a presentation on this idea at the Designing Storage Architectures for Preservation Collections meeting. His slides from that talk are available online (PDF).
During the recent meeting of the National Digital Stewardship Alliance innovation work group, an interesting question arose: what would be an appropriate metric for characterizing the quality of a preservation storage system or repository? While a preservation repository should support a variety of important functions, its primary obligation is to safeguard the bit-level integrity of the managed digital assets. Without this as a strong foundation, the value of higher level functions is reduced in whole or in part.
When trying to design a new metric for bit-level integrity, it is useful to consider how we evaluate other infrastructural components of a preservation environment. To measure the consistency of a file system we use fsck, scandisk, or a ZFS scrub; to determine server availability we use ping; and to evaluate the availability of online resources we perform link checking. All of these tools share four common characteristics that are would also be useful in a repository quality assurance metric.
- First, they are objective: it is clear what is being measured and what the success and failure criteria are.
- Second, they are repeatable: they can be invoked at will over time to examine systems as conditions change.
- Third, they are independently verifiable: the measurement is made from outside of the system that is being measured. (The importance of this aspect is captured in the adage “trust, but verify”; or in other words, what you say about yourself is less often interesting than what others say about you.)
- Lastly, they are simple, which facilitates widespread adoption.
It is also important to try to exploit information or functions that are already available. Two things are pertinent here:
- Repositories already provide access to managed content through stable URLs. (After all, access is what repositories are for.)
- Content curators already know the sizes and checksums of their content. (A checksum, also known as a hash or message digest can be thought of as a unique “fingerprint” for a piece of digital content.)
Keeping all of this in mind, we can propose our new metric: an external agent that retrieves digital assets on the basis of stable URLs and verifies their bit-level integrity by the comparison of size and a computed checksum with values known to agent to be correct.
There are many tools that could be used as the basis for this type of activity. Ongoing bit-level verification is a core activity of the LOCKSS system, but for the time being that function is available for use only on content within a LOCKSS network. At the CDL we have developed a stand-alone Fixity service as part of the evolving micro-services toolkit underlying the Merritt repository. Another option is the ACE tool from the University of Maryland Institute for Computer Studies.
There is a useful analogy that can be drawn between this type of external validation of a repository’s stewardship obligation and the notion of a community Neighborhood Watch, in which neighbors take responsibility for watching out for the safety and well-being of their neighbors. It has often been remarked upon that effective and sustainable solutions for long-term digital preservation will only come about through concerted community action. We are all neighbors in this community endeavor and should be looking out for each other. A digital neighborhood watch provides us with the tools for the multilateral verification that we are collectively meeting our obligations. It functions as an early warning system for conditions that require preservation intervention and would be a confidence building measure for the clients who have entrusted us with stewardship responsibilities for our collective digital heritage.
Comments (3)
I have yet to figure out why digital preservationists have decided to call checksumming fixity checking. It just muddies up the google searching. 🙂
Sysadmins have been validating files with checksums for a long time, but instead of under the topic “digital preservation” it’s under the topic of “security”–ie. did a hacker alter a file.
I think the idea behind the term fixity checking is to separate what you are conceptually doing with exactly how you are doing it. So in the digital preservationist vocabulary the fixity of a digital object is a property that one would want to check. Check sums are one way to check the fixity of a particular object, but there are other methods checking fixity.
With that said, when most people are talking about fixity checks they are currently talking about checksums. So the point is well taken 🙂
For reference, this paper gives a more in depth take on the idea. http://www.library.yale.edu/iac/DPC/AN_DPC_FixityChecksFinal11.pdf
Good post. Some further thoughts.
1. On QA metrics. These must satisfy two additional characteristics:
– “Objectivity” combines two characteristics which are more typically separated in the testing literature. I recommend using the standard terminology:
(a) Test Reliability — the consistency of repeated measurements of the same unit.
(b) Test Validity. This gets at the idea of “measuring what you think you are measuring”. Technically this latter concept is “construct validity”, but its often impossible to establish construct validity directly. So evidence of criterion and construct validity are usually required for a measure to be convincing.
– Cost. A measure can be simple, but costly. This is particularly relevant to the proposed strategy of retrieving entire objects and computing fixity on them. Costs of this approach may be high, both computationally, and in direct $ charged to the owner of the content, when applying this to large objects stored in cloud storage systems — or any other systems that charge for bandwidth.
2. “Content curators already know the sizes and checksums of their content. ” seems more correct normatively than positively. Clearly, some curators don’t have this information. Moreover, there is a question of independent validation — for example one should trust less fixity information provided by AWS if wants to test the reliability (etc.) of AWS storage. The fixity information used as a reference in the testing should have a provenance independent of the system being tested.