Gossiping About Digital Preservation

ANTI-ENTROPY by user 51pct on <a href="https://flic.kr/p/crq2Ef">Flickr</a>.

ANTI-ENTROPY by user 51pct on Flickr.

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn quite a bit during the meeting and more often than not encounter terms and phrases that I’m not familiar with.

One I found particularly interesting this time around was the term “anti-entropy.”  I’ve been familiar with the term “entropy” for a while, but I’d never heard “anti-entropy.” One definition of “entropy” is a “gradual decline into disorder.” So is “anti-entropy” a “gradual coming-together into order?” Turns out that the term has a long history in information science and is important to get an understanding of some very important digital preservation processes regarding file storage, file repair and fixity checking.

The “entropy” we’re talking about when we talk about “anti-entropy” might also be called “Shannon Entropy” after the legendary information scientist Claude Shannon. His ideas on entropy were elucidated in a 1948 paper called “A Mathematical Theory of Communication” (PDF), developed while he worked at Bell Labs. For Shannon, entropy was the measure of the unpredictability of information content. He wasn’t necessarily thinking about information in the same way that digital archivists think about information as bits, but the idea of the unpredictability of information content has great applicability to digital preservation work.

“Anti-entropy” represents the idea of the “noise” that begins to slip into information processes over time. It made sense that computer science would co-opt the term, and in that context “anti-entropy” has come to mean “comparing all the replicas of each piece of data that exist (or are supposed to) and updating each replica to the newest version.” In other words, what information scientists call “bit flips” or “bit rot” are examples of entropy in digital information files, and anti-entropy protocols (a subtype of “gossip” protocols) use methods to ensure that files are maintained in their desired state. This is an important concept to grasp when designing digital preservation systems that take advantage of multiple copies to ensure long-term preservability, LOCKSS being the most obvious example of this.

gossip_bench by user ricoslounge on Flickr.

gossip_bench by user ricoslounge on Flickr.

Anti-entropy and gossip protocols are the means to ensure the automated management of digital content that can take some of the human overhead out of the picture. Digital preservation systems invoke some form of content monitoring in order to do their job. Humans could do this monitoring, but as digital repositories scale up massively, the idea that humans can effectively monitor the digital information under their control with something approaching comprehensiveness is a fantasy. Thus, we’ve got to be able to invoke anti-entropy and gossip protocols to manage the data.

An excellent introduction to how gossip protocols work can be found in the paper “GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.”  The authors note three key parameters to gossip protocols: monitoring, failure detection and consensus.  Not coincidentally, LOCKSS “consists of a large number of independent, low-cost, persistent Web caches that cooperate to detect and repair damage to their content by voting in “opinion polls” (PDF). In other words, gossip and anti-entropy.

I’ve only just encountered these terms, but they’ve been around for a long while.  David Rosenthal, the chief scientist of LOCKSS, has been thinking about digital preservation storage and sustainability for a long time and he has given a number of presentations at the LC storage meetings and the summer digital preservation meetings.

LOCKSS are the most prominent example in the digital preservation community on the exploitation of gossip protocols, but these protocols are widely used in distributed computing. If you really want to dive deep into the technology that underpins some of these systems, start reading about distributed hash tables, consistent hashing, versioning, vector clocks and quorum in addition to anti-entropy-based recovery. Good luck!

One of the more hilarious anti-entropy analogies was recently supplied by the Register, which suggested that a new tool that supports gossip protocols “acts like [a] depressed teenager to assure data reliability” and “constantly interrogates itself to make sure data is ok.”

You learn something new every day.

Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify […]

The Library of Congress Wants Your File Format Ideas

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put […]

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress. The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections […]

Library to Launch 2015 Class of NDSR

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015. As you may know (NDSR was well represented on the blog last year), this […]

Hybrid Born-Digital and Analog Special Collecting: Megan Halsband on the SPX Comics Collection

Every year, The Small Press Expo in Bethesda, Md brings together a community of alternative comic creators and independent publishers. With a significant history of collecting comics, it made sense for the Library of Congress’ Serial and Government Publications Division and the Prints & Photographs Division to partner with SPX to build a collection documenting […]

Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library. Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design […]