Gossiping About Digital Preservation

ANTI-ENTROPY by user 51pct on <a href="https://flic.kr/p/crq2Ef">Flickr</a>.

ANTI-ENTROPY by user 51pct on Flickr.

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn quite a bit during the meeting and more often than not encounter terms and phrases that I’m not familiar with.

One I found particularly interesting this time around was the term “anti-entropy.”  I’ve been familiar with the term “entropy” for a while, but I’d never heard “anti-entropy.” One definition of “entropy” is a “gradual decline into disorder.” So is “anti-entropy” a “gradual coming-together into order?” Turns out that the term has a long history in information science and is important to get an understanding of some very important digital preservation processes regarding file storage, file repair and fixity checking.

The “entropy” we’re talking about when we talk about “anti-entropy” might also be called “Shannon Entropy” after the legendary information scientist Claude Shannon. His ideas on entropy were elucidated in a 1948 paper called “A Mathematical Theory of Communication” (PDF), developed while he worked at Bell Labs. For Shannon, entropy was the measure of the unpredictability of information content. He wasn’t necessarily thinking about information in the same way that digital archivists think about information as bits, but the idea of the unpredictability of information content has great applicability to digital preservation work.

“Anti-entropy” represents the idea of the “noise” that begins to slip into information processes over time. It made sense that computer science would co-opt the term, and in that context “anti-entropy” has come to mean “comparing all the replicas of each piece of data that exist (or are supposed to) and updating each replica to the newest version.” In other words, what information scientists call “bit flips” or “bit rot” are examples of entropy in digital information files, and anti-entropy protocols (a subtype of “gossip” protocols) use methods to ensure that files are maintained in their desired state. This is an important concept to grasp when designing digital preservation systems that take advantage of multiple copies to ensure long-term preservability, LOCKSS being the most obvious example of this.

gossip_bench by user ricoslounge on Flickr.

gossip_bench by user ricoslounge on Flickr.

Anti-entropy and gossip protocols are the means to ensure the automated management of digital content that can take some of the human overhead out of the picture. Digital preservation systems invoke some form of content monitoring in order to do their job. Humans could do this monitoring, but as digital repositories scale up massively, the idea that humans can effectively monitor the digital information under their control with something approaching comprehensiveness is a fantasy. Thus, we’ve got to be able to invoke anti-entropy and gossip protocols to manage the data.

An excellent introduction to how gossip protocols work can be found in the paper “GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.”  The authors note three key parameters to gossip protocols: monitoring, failure detection and consensus.  Not coincidentally, LOCKSS “consists of a large number of independent, low-cost, persistent Web caches that cooperate to detect and repair damage to their content by voting in “opinion polls” (PDF). In other words, gossip and anti-entropy.

I’ve only just encountered these terms, but they’ve been around for a long while.  David Rosenthal, the chief scientist of LOCKSS, has been thinking about digital preservation storage and sustainability for a long time and he has given a number of presentations at the LC storage meetings and the summer digital preservation meetings.

LOCKSS is the most prominent example in the digital preservation community on the exploitation of gossip protocols, but these protocols are widely used in distributed computing. If you really want to dive deep into the technology that underpins some of these systems, start reading about distributed hash tables, consistent hashing, versioning, vector clocks and quorum in addition to anti-entropy-based recovery. Good luck!

One of the more hilarious anti-entropy analogies was recently supplied by the Register, which suggested that a new tool that supports gossip protocols “acts like [a] depressed teenager to assure data reliability” and “constantly interrogates itself to make sure data is ok.”

You learn something new every day.

We Want You Just the Way You Are: The What, Why and When of Fixity

Fixity, the property of a digital file or object being fixed or unchanged, is a cornerstone of digital preservation. Fixity information, from simple file counts or file size values to more precise checksums and cryptographic hashes, is data used to verify whether an object has been altered or degraded. Many in the preservation community know […]

QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project. In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to […]

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress. The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections […]

Emerging Collaborations for Accessing and Preserving Email

The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign. I’ll never forget one lesson from my historical methods class at Marquette University.  Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast […]

Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library. Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design […]

Untangling the Knot of CAD Preservation

At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — archivists and the […]

Emulation as a Service (EaaS) at Yale University Library

The following is a guest post from Euan Cochrane, ‎Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms. Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards […]

Curating Extragalactic Distances: An interview with Karl Nilsen & Robin Dasler

While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation […]