The following is a guest post from Stephen Abrams Associate Director, UC Curation Center/California Digital Library. Stephen recently represented an action team from NDSA innovation working group in a presentation on this idea at the Designing Storage Architectures for Preservation Collections meeting. His slides from that talk are available online (PDF).
During the recent meeting of the National Digital Stewardship Alliance innovation work group, an interesting question arose: what would be an appropriate metric for characterizing the quality of a preservation storage system or repository? While a preservation repository should support a variety of important functions, its primary obligation is to safeguard the bit-level integrity of the managed digital assets. Without this as a strong foundation, the value of higher level functions is reduced in whole or in part.
When trying to design a new metric for bit-level integrity, it is useful to consider how we evaluate other infrastructural components of a preservation environment. To measure the consistency of a file system we use fsck, scandisk, or a ZFS scrub; to determine server availability we use ping; and to evaluate the availability of online resources we perform link checking. All of these tools share four common characteristics that are would also be useful in a repository quality assurance metric.
- First, they are objective: it is clear what is being measured and what the success and failure criteria are.
- Second, they are repeatable: they can be invoked at will over time to examine systems as conditions change.
- Third, they are independently verifiable: the measurement is made from outside of the system that is being measured. (The importance of this aspect is captured in the adage “trust, but verify”; or in other words, what you say about yourself is less often interesting than what others say about you.)
- Lastly, they are simple, which facilitates widespread adoption.
It is also important to try to exploit information or functions that are already available. Two things are pertinent here:
- Repositories already provide access to managed content through stable URLs. (After all, access is what repositories are for.)
- Content curators already know the sizes and checksums of their content. (A checksum, also known as a hash or message digest can be thought of as a unique “fingerprint” for a piece of digital content.)
Keeping all of this in mind, we can propose our new metric: an external agent that retrieves digital assets on the basis of stable URLs and verifies their bit-level integrity by the comparison of size and a computed checksum with values known to agent to be correct.
There are many tools that could be used as the basis for this type of activity. Ongoing bit-level verification is a core activity of the LOCKSS system, but for the time being that function is available for use only on content within a LOCKSS network. At the CDL we have developed a stand-alone Fixity service as part of the evolving micro-services toolkit underlying the Merritt repository. Another option is the ACE tool from the University of Maryland Institute for Computer Studies.
There is a useful analogy that can be drawn between this type of external validation of a repository’s stewardship obligation and the notion of a community Neighborhood Watch, in which neighbors take responsibility for watching out for the safety and well-being of their neighbors. It has often been remarked upon that effective and sustainable solutions for long-term digital preservation will only come about through concerted community action. We are all neighbors in this community endeavor and should be looking out for each other. A digital neighborhood watch provides us with the tools for the multilateral verification that we are collectively meeting our obligations. It functions as an early warning system for conditions that require preservation intervention and would be a confidence building measure for the clients who have entrusted us with stewardship responsibilities for our collective digital heritage.