Planning for Preservation Storage

Every year the Library of Congress hosts a meeting on Designing Storage Architectures for Digital Collections, aka the Preservation Storage Meeting.  The 2013 meeting was held September 23-24, and featured an impressive array of presentations and discussions.

The theme this year was standards. The term applies not just to media or to hardware, but to interfaces as well. In preservation, it is the interfaces – the software and operating system mechanisms through which users and tools interact with stored files – that disappear the most quickly. Or change the least to keep up with changing needs.  The quote of the meeting for me was from Henry Newman of Instrumental, Inc:  “These are not new problems, only new engineers solving old problems.”

Designing Storage Architectures for Digital Collections 2013 Panel on Developments in Storage Media, photo by Michael Ashenfelder

Designing Storage Architectures for Digital Collections 2013 Panel on Developments in Storage Media, photo by Michael Ashenfelder

Library of Congress staff kicked off the meeting by discussing some of the Library’s infrastructure and needs. The Library is recording its files in its inventory service, which includes fixities for future auditing.  We have a wide range of needs, though, which vary with the type of content.  The data center where preservation and access copies of text and images are primarily stored manages millions of files in 10s of petabytes. The data center where preservation and access copies of video and audio are primarily stored manages approximately 700k files in 10s of PB. The different scales of file numbers and sizes mean different requirements for the hardware needed to stage and deliver this content. In terms of the Library’s storage purchases, 30% is purely for capacity expansion, and 70% is for the ongoing refresh of technology, which often also includes adding capacity.

Tape technologies are always a big topic at this meeting. T10K tape migration is ever ongoing. Interfaces to tape environments reach end-of-life and are unsupported within 5-10 years of their introduction, according to Dave Anderson of Seagate.  According to Gary Decad of IBM, rates of areal density increases are slowing down, and the annual rate of petabytes of storage manufacturing is no longer increasing.

Tape is by far and away the highest MSI (millions of square inches) of storage in production use. Tape, hard disk drive, and solid state storage are surface-area intensive technologies.  Many meeting participants believe that solid state improves Hard Disc Drive technology. Less obvious for preservation concerns is the impact of NAND flash storage on the use of hard drive storage. To replace enterprise hard disk drives will be exorbitantly expensive, and is not  happening any time soon.

Across the board there must be technologies licensed to multiple manufacturers and suppliers for stability in the marketplace. But it is extraordinarily expensive to build fabrication facilities for newer technologies such as NAND flash storage. The same is true for LTO  tape facilities, not as much for the expense of building the facilities but for the lack of profitability in manufacturing.  After the presentations at this meeting I more familiar with the licensing of storage standards to manufacturing companies than I was before, and the monopolies that exist.

The panel on “The Cloud” engendered some of the liveliest discussion. Three quotes stood out. The first, from Andy Maltz of the Academy of Motion Picture Arts and Sciences: “Clouds are nice but sometimes it rains.” And from Fenella France at the Library of Congress: “I have conversations with people who say ‘It’s in the cloud.’ And where is that, I ask. The cloud is still on physical servers somewhere.” And Mike Thuman from Tessella, referencing his slides, said “Those bidirectional arrows between the cloud and local? They’re not based on Kryder’s Law or Moore’s Law, it’s based on Murphy’s Law. You will need to bring data back. ”

David Rosenthal of Stanford University pointed out some key topics:

  • When is the cloud better than doing it yourself? When you have spiky demand and not steady use;
  • The use of the cloud is the “Drug Dealer’s algorithm”: The first one is free, and it becomes hard to leave because of the download/exit/migration charges;
  • The cloud is not a technology, it’s a business model. The technology is something you can use yourself.

Jeff Barr of Amazon commented, “I guess I am the official market-dominating drug dealer.” But Amazon very much wants to know from the community what it is looking for in a preservation action reporting system for files stored in the AWS environment.

The session on standards ranged from an introduction to NISO and the standards development process (with a wonderful slide deck based on clip art), to identifiers and file systems, and the specifics of an emerging standard: AXF.

A relatively new topic for this year’s meeting was the use of open source solutions, such as the range of component tools in OpenStackHTTP-based REST is the up-and-coming interface for files – the technology is moving from file system-based interfaces to object-based interfaces. Everything now has a custom storage management layer from the vendor.

Other forms of media were also discussed. Two of the most innovative are a stainless steel tape in a hermetically-sealed cartridge engraved with a laser, and another that is visually-recorded metal alloy media.  Optical media is also not dead.  Ken Wood from Hitachi  pointed out that 30-year-old commercial audio CDs are still supported in the hardware marketplace, and that CDs still play. Technically that has just as much to do with the software interface with error correction still being in play as the hardware still being supported. But mechanical compact disc players and storage are disappearing with the rise of mobile devices and thin laptops which have no optical players or hard discs.

Presentations by representatives of the digital curation and preservation community always make up a large percentage of this meeting. Projects such as the Data Conservancy and efforts at the Los Alamos National Labs, the National Endowment for the Humanities, the Library of Congress were featured. It was noted more than once that content and data creators still do not often feel that preservation is part of their responsibility. The key quote was “You can spend more time figuring out what to save than actually saving it. The cost of curation to assess for retention can be huge.”

You should really check out the agenda and presentations, which are available online.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.