Some weeks ago I gave a presentation that I jokingly titled “The Challenges of Preserving Every Digital Format on the Face of the Planet.”
Except it’s not really a joke.
We often have little or no control over what comes into the Library of Congress Digital Collections, and we manage and preserve a wide variety of formats. One collection brings in TIFF, JPEG, JPEG2000, and XML. Another brings in MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial media formats. Another brings in JPEG, PDF, XML and a variety of metadata formats. One is all JSON. One is every flavor of GIS data. The Web Archives include every format which has ever appeared online. And yet another collection has, in 18 months, included almost 50 different file extensions (but not that many actual file types) with a huge number of metadata variations.
So how are we making this easier for the Library of Congress to manage?
On the technology side, the Library is expanding it preservation infrastructure. The Library jointly developed the BagIt transfer specification to reduce the risk of file corruption in the movement of files between and within organizations. The Library inventories all incoming files, and is inventorying all digital content so we know what formats we have. We maintain multiple copies of files on servers and on tape, in geographically distributed locations. We’re also extending the use of content characterization tools in our internal workflows.
The Library is better documenting its internal processes for making digital preservation decisions. As part of its decision-guiding resources, the Library has documented sustainability factors for file formats , and for cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement (PDF), which is currently being updated and expanded. And the Library is preparing to develop Format Preservation Action Plans.
The Library cannot do it all alone, so is also part of several preservation partnerships, including the National Digital Stewardship Alliance and the International Internet Preservation Consortium, among others. The Library has supported the development of the soon-to-launch Unified Digital Format Registry for the larger community.
What are your organizations doing to handle the onslaught of file formats?