The following is a guest post from Andrea Goethals, digital preservation and repository services manager at the Harvard University library.
This post is similar to a presentation I gave as part of a panel called “Assessing and Mitigating Bit-Level Preservation Risks” at DigitalPreservation 2012. It grew out of conversations and work within the NDSA Infrastructure Working Group.
As digital preservation practitioners, we often focus on the operational aspects of our work – what I call the “whats” and “hows” – and not enough on the “whys.” We know that it’s a good idea to keep multiple copies of the content we’re preserving, and that it’s a good idea to not house all our copies in the same building, but what are the underlying concepts behind these best practices? As you can guess by the title of this post, they are redundancy and diversity.
When we engage in the preservation of digital content, we are engaging in risk management. It is a certainty that “failures” will happen over time, at some level. Media will decay, files will be lost, and organizations will fail. We have to accept that failures will occur and instead focus on being able to recover from failures, or at least buffer the disturbances they cause. Redundancy and diversity can help mitigate these risks.
Redundancy means having multiple duplicates of something. In ecology there is a theory called redundancy hypothesis that assumes that up to a point, species redundancy (the number of species playing the same ecosystem role) enhances ecosystem resiliency (see Figure 1). Note that this theory acknowledges that after a certain point there are diminished returns to redundancy. Similarly, in digital preservation, we are enhancing the resiliency of our digital collections by keeping duplicate copies, but we should acknowledge that there is such a thing as too many copies. After a certain number of redundant copies, additional copies can be hard to justify because they aren’t contributing enough to the overall risk mitigation effort to warrant the extra cost.
Diversity means having variations of something. In finance, the portfolio effect suggests that risk to an investment portfolio is reduced by investing in a variety of assets. If your portfolio is diverse and the value goes down on one of your assets, you haven’t lost everything. Similarly, the ecological theory called response diversity assumes that diversification stabilizes ecosystem processes. The theory is that diverse species respond differently to disturbances, thus increasing the chances that entire ecosystem processes won’t be wiped out after a disturbance. If you have many different types of trees on your street, chances are they won’t be wiped out by a single bug infestation. In digital preservation, we also use diversity to mitigate the chances of catastrophic failure, for example by storing copies of content on different types of storage media, or by using storage locations with different geographic threats, so that a single disaster can’t harm all copies.
What are the things that can fail, or put another way, that could affect our ability to preserve digital content long-term? We might be tempted to throw our hands up and say what can’t fail? Although it’s true that there are many digital preservation risks, there are some in particular that are likely candidates. In the last decade progress has been made on understanding storage system failures and data corruption (if you want to learn more, see the references below this post). Through this research (and perhaps through our own experience) we know that latent sector errors (physical problems) are a likely storage component fault candidate, and to a lesser extent silent data corruption (higher-level, usually software problems), and to an even lesser extent, whole disk failures. And perhaps through our own observations, another likely candidate for “failure” is organizational disruptions – for example changes in finances, priorities and staffing. Once we’ve identified likely risks like these, we can think about the strategies that could mitigate the risks.
The table below shows examples of how we might use redundancy and/or diversity to mitigate some of the risks to data loss.
Of course each of these strategies comes at a cost; otherwise we would all simply implement all of them and declare victory against these risks. Instead, we need to make decisions on which strategies to put into place, based on our estimation of the likelihood of a risk occurring, its impact if it did occur, the cost of the strategy, and the resources we have available to us.
I’d like to hear from you – are you using any redundancy or diversity strategies for digital preservation that aren’t covered in the table above? And do you know of additional concepts like redundancy and diversity that are used in other domains to mitigate risks and that might be relevant for digital preservation?
For More Reading…
Amir, Y., & Wool, A. (1996). Evaluating quorum systems over the internet. Paper presented at The twenty-sixth annual international symposium on fault-tolerant computing. doi: 10.1109/FTCS.1996.534591
Bairavasundaram, L., Arpaci-Dusseau, A., Arpaci-Dusseau, R., Goodson, G., & Schroeder, B. (2008). An analysis of data corruption in the storage stack. ACM Transactions on Storage, 4(3), 8.1-8.28. doi: 10.1145/1416944.1416947
Haeberlen, A., Mislove, A., & Druschel, P. (2005, May). Glacier: Highly durable, decentralized storage despite massive correlated failures. Paper presented at NSDI ’05: 2nd symposium on networked systems design & implementation, Boston, MA.
Jiang, W., Hu, C., Zhou, Y., & Kanevsky, A. (2008). Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage (TOS), 4(3), doi: 10.1145/1416944.1416946
Pinheiro, E., Weber, W., & Barroso, L. (2007). Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST ’07).
Shah, S. (2004). Disk drive vintage and its effect on reliability. In Reliability and Maintainability, 2004 Annual Symposium – RAMS (pp. 163-167).