(NOTE: This is an updated article from our digitalpreservation.gov website originally written by Mike Ashenfelder.)
As we discussed in an earlier post, the landscape is changing for the better in terms of the appearance of open source tools to support digital preservation and access.
NDIIPP has contributed by developing tools to transfer large quantities of digital data, because there’s a great need for reliable tools to get important digital content from the “there” where they’re created to the “here” of the stewardship organizations that will care for them for the long-term.
These tools have enable the Library of Congress to transfer and capture a significant amount of digital data over the past few years, and with each digital-collection transfer, the Library refines its tools, simplifying and improving them in iterative stages
The Library’s transfer workflow starts when a sender of a digital collection prepares for a transfer by packaging a collection in a digital “bag” and making it accessible for the Library to download.
“Bags” are based on a concept developed at the University of Tsukuba in Japan called “enclose and deposit” (PDF). “Enclose and deposit” became “bag it and tag it” in black-humored hands, and the Library has worked with partners to codify these method of structuring transferable objects in the BagIt specification (PDF).
Bagit describes a method where a digital collection is packed into a directory (the bag) along with a machine-readable manifest file (the tag) that lists the contents. Bags have a sparse structure that allows the possibility of any institutional data architecture or file format. Bags can hold documents, pictures, music, movies and even other folders. Anything digital can fit into a bag.
The Bagit: Transferring Content for Digital Preservation video is a great place to start to understand how digital content makes its way to the Library.
A bag is like a folder or directory on a computer. It is essentially comprised of three elements: A bag declaration text file, which is like a seal of authenticity; a text-file manifest listing the files in the collection; and a subdirectory – usually titled “data” – filled with the digital content. The manifest is machine readable for automated data ingest. The receiving computer analyzes the manifest and runs checksums on the contents; if the checksums match, the transfer is successful.
A bag can also contain an optional text file, titled “bag-info.txt,” that contains a small amount of administrative metadata, such as contact information for the collection owner and a brief description of the collection. Users can include much more metadata about the collection, but the Library recommends storing it in the “data” directory with the rest of the collection in order to keep the bag root directory uncluttered. Users can note in the “bag-info.txt” file that additional metadata exists and resides in the “data” directory.
A bag variation, called a “holey bag,” is gaining wider acceptance because of its flexibility. A holey bag has the standard bag structure but its “data” directory is empty.
The holey bag contains an additional text file titled “fetch.txt” at the root level that lists the Uniform Resource Locators (URLs) of the files to be fetched (so-called “holes” in the digital collection to be filled in). A script consults the “fetch.txt” file, follows the URLs, downloads the files and aggregates them into the local “data” directory within the bag. The sender’s source files do not need to reside in the same directory or on the same server; they can be retrieved from many different sources. A holey bag becomes complete after the digital collection is entirely downloaded and its manifest file is verified.
An example of a “holey bag” workflow was included in last year’s post on capturing the “end of term” of the George W. Bush presidential administration.
The Library has developed an open source tool based on the BagIt specification called the BagIt Library (BIL), a Java library for the Unix environment. BIL has a range of features, including commands for making and manipulating bags, and the ability manipulate different aspects of the transfer, validation and verification of bags. The Library has also released a “friendlier” desktop version called Bagger.
Are you using the Bagit specification or any of the Library’s tools to support it? We’d love to hear from you on what you’ve discovered about them.