Learning to Live With Failures With A Little Help From Redundancy and Diversity

The following is a guest post from Andrea Goethals, digital preservation and repository services manager at the Harvard University library.

This post is similar to a presentation I gave as part of a panel called “Assessing and Mitigating Bit-Level Preservation Risks” at DigitalPreservation 2012. It grew out of conversations and work within the NDSA Infrastructure Working Group.

As digital preservation practitioners, we often focus on the operational aspects of our work – what I call the “whats” and “hows” – and not enough on the “whys.” We know that it’s a good idea to keep multiple copies of the content we’re preserving, and that it’s a good idea to not house all our copies in the same building, but what are the underlying concepts behind these best practices? As you can guess by the title of this post, they are redundancy and diversity.

When we engage in the preservation of digital content, we are engaging in risk management. It is a certainty that “failures” will happen over time, at some level. Media will decay, files will be lost, and organizations will fail. We have to accept that failures will occur and instead focus on being able to recover from failures, or at least buffer the disturbances they cause. Redundancy and diversity can help mitigate these risks.

Redundancy hypothesis

Redundancy means having multiple duplicates of something. In ecology there is a theory called redundancy hypothesis that assumes that up to a point, species redundancy (the number of species playing the same ecosystem role) enhances ecosystem resiliency (see Figure 1). Note that this theory acknowledges that after a certain point there are diminished returns to redundancy. Similarly, in digital preservation, we are enhancing the resiliency of our digital collections by keeping duplicate copies, but we should acknowledge that there is such a thing as too many copies. After a certain number of redundant copies, additional copies can be hard to justify because they aren’t contributing enough to the overall risk mitigation effort to warrant the extra cost.

Diversity means having variations of something. In finance, the portfolio effect suggests that risk to an investment portfolio is reduced by investing in a variety of assets. If your portfolio is diverse and the value goes down on one of your assets, you haven’t lost everything. Similarly, the ecological theory called response diversity assumes that diversification stabilizes ecosystem processes. The theory is that diverse species respond differently to disturbances, thus increasing the chances that entire ecosystem processes won’t be wiped out after a disturbance. If you have many different types of trees on your street, chances are they won’t be wiped out by a single bug infestation. In digital preservation, we also use diversity to mitigate the chances of catastrophic failure, for example by storing copies of content on different types of storage media, or by using storage locations with different geographic threats, so that a single disaster can’t harm all copies.

What are the things that can fail, or put another way, that could affect our ability to preserve digital content long-term? We might be tempted to throw our hands up and say what can’t fail? Although it’s true that there are many digital preservation risks, there are some in particular that are likely candidates. In the last decade progress has been made on understanding storage system failures and data corruption (if you want to learn more, see the references below this post).  Through this research (and perhaps through our own experience) we know that latent sector errors (physical problems) are a likely storage component fault candidate, and to a lesser extent silent data corruption (higher-level, usually software problems), and to an even lesser extent, whole disk failures. And perhaps through our own observations, another likely candidate for “failure” is organizational disruptions – for example changes in finances, priorities and staffing. Once we’ve identified likely risks like these, we can think about the strategies that could mitigate the risks.

The table below shows examples of how we might use redundancy and/or diversity to mitigate some of the risks to data loss.

Risks to Data Loss	Redundancy & Diversity Strategies Environmental factors (e.g. temperature, vibrations affecting multiple devices in same data center)	Replication to different data centers Shared component faults (e.g. power connections, cooling, SCSI controllers)	Replication to different data centers or redundant components or software Large-scale disasters (e.g. earthquakes)	Replication to different geographic regions Malicious attacks (e.g. worms)	Distinct security zones Human error (e.g. accidental deletions)	Different administrative control (e.g. under the management of a different group of system administrators) Organizational faults (e.g. budget cuts)	Different organizational control

Of course each of these strategies comes at a cost; otherwise we would all simply implement all of them and declare victory against these risks. Instead, we need to make decisions on which strategies to put into place, based on our estimation of the likelihood of a risk occurring, its impact if it did occur, the cost of the strategy, and the resources we have available to us.

I’d like to hear from you – are you using any redundancy or diversity strategies for digital preservation that aren’t covered in the table above? And do you know of additional concepts like redundancy and diversity that are used in other domains to mitigate risks and that might be relevant for digital preservation?

For More Reading…

Amir, Y., & Wool, A. (1996). Evaluating quorum systems over the internet. Paper presented at The twenty-sixth annual international symposium on fault-tolerant computing. doi: 10.1109/FTCS.1996.534591

Bairavasundaram, L., Arpaci-Dusseau, A., Arpaci-Dusseau, R., Goodson, G., & Schroeder, B. (2008). An analysis of data corruption in the storage stack. ACM Transactions on Storage4(3), 8.1-8.28. doi: 10.1145/1416944.1416947

Haeberlen, A., Mislove, A., & Druschel, P. (2005, May). Glacier: Highly durable, decentralized storage despite massive correlated failures. Paper presented at NSDI ’05: 2nd symposium on networked systems design & implementation, Boston, MA.

Jiang, W., Hu, C., Zhou, Y., & Kanevsky, A. (2008). Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics. ACM Transactions on Storage (TOS)4(3), doi: 10.1145/1416944.1416946

Pinheiro, E., Weber, W., & Barroso, L. (2007). Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST ’07).

Shah, S. (2004). Disk drive vintage and its effect on reliability. In Reliability and Maintainability, 2004 Annual Symposium – RAMS (pp. 163-167).

One Comment

  1. Susan
    August 28, 2012 at 11:46 am

    Thank you so much for this post. I work with small organizations, and I have been trying to emphasize the importance of digital preservation, but your analogies convey the reasons for this approach in an non-techy fashion much better than I ever could. (The tree analogy is especially apt as people in my community are bemoaning the death of trees. In the 1920s, beautiful trees were planted, but usually all of the same species on the same street, so many are dying now all at the same time.)

    I also appreciate the very simple and straightforward chart on risks and strategies. Too often we overload people with info on why and how to do these things, they figure this is an approach only for the big guys, and throw in the towel.

    Your article is timely because I recently had a library contact me about a donor who wants to give them a big collection and money to digitize. He told the library they didn’t need money for digital preservation because “that’s easy now.” Now we have fodder from two respected sources (LC and Harvard) outlining the questions in a way that trustees and administration can follow.

    Alas, I don’t have any additional redundancy and diversity strategies to share with you. We are having enough trouble implementing just one or two that you have listed. I am curious to hear what others are doing, though.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.