From There to Here, from Here to There, Digital Content is Everywhere!

(NOTE: This is an updated article from our digitalpreservation.gov website originally written by Mike Ashenfelder.)

As we discussed in an earlier post, the landscape is changing for the better in terms of the appearance of open source tools to support digital preservation and access.

NDIIPP has contributed by developing tools to transfer large quantities of digital data, because there’s a great need for reliable tools to get important digital content from the “there” where they’re created to the “here” of the stewardship organizations that will care for them for the long-term.

These tools have enable the Library of Congress to transfer and capture a significant amount of digital data over the past few years, and with each digital-collection transfer, the Library refines its tools, simplifying and improving them in iterative stages

The Library’s transfer workflow starts when a sender of a digital collection prepares for a transfer by packaging a collection in a digital “bag” and making it accessible for the Library to download.

“Bags” are based on a concept developed at the University of Tsukuba in Japan called “enclose and deposit” (PDF). “Enclose and deposit” became “bag it and tag it” in black-humored hands, and the Library has worked with partners to codify these method of structuring transferable objects in the BagIt specification (PDF).

Bagit describes a method where a digital collection is packed into a directory (the bag) along with a machine-readable manifest file (the tag) that lists the contents. Bags have a sparse structure that allows the possibility of any institutional data architecture or file format. Bags can hold documents, pictures, music, movies and even other folders. Anything digital can fit into a bag.

Still from the Bagit: Transferring Content for Digital Preservation video.

Still from the Bagit: Transferring Content for Digital Preservation video.

The Bagit: Transferring Content for Digital Preservation video is a great place to start to understand how digital content makes its way to the Library.

A bag is like a folder or directory on a computer. It is essentially comprised of three elements: A bag declaration text file, which is like a seal of authenticity; a text-file manifest listing the files in the collection; and a subdirectory – usually titled “data” – filled with the digital content. The manifest is machine readable for automated data ingest. The receiving computer analyzes the manifest and runs checksums on the contents; if the checksums match, the transfer is successful.

A bag can also contain an optional text file, titled “bag-info.txt,” that contains a small amount of administrative metadata, such as contact information for the collection owner and a brief description of the collection. Users can include much more metadata about the collection, but the Library recommends storing it in the “data” directory with the rest of the collection in order to keep the bag root directory uncluttered. Users can note in the “bag-info.txt” file that additional metadata exists and resides in the “data” directory.

A bag variation, called a “holey bag,” is gaining wider acceptance because of its flexibility. A holey bag has the standard bag structure but its “data” directory is empty.

Basic contents of a digital bag.

Basic contents of a digital bag.

The holey bag contains an additional text file titled “fetch.txt” at the root level that lists the Uniform Resource Locators (URLs) of the files to be fetched (so-called “holes” in the digital collection to be filled in). A script consults the “fetch.txt” file, follows the URLs, downloads the files and aggregates them into the local “data” directory within the bag. The sender’s source files do not need to reside in the same directory or on the same server; they can be retrieved from many different sources. A holey bag becomes complete after the digital collection is entirely downloaded and its manifest file is verified.

An example of a “holey bag” workflow was included in last year’s post on capturing the “end of term” of the George W. Bush presidential administration.

The Library has developed an open source tool based on the BagIt specification called the BagIt Library (BIL), a  Java library for the Unix environment. BIL has a range of features, including commands for making and manipulating bags, and the ability manipulate different aspects of the transfer, validation and verification of bags. The Library has also released a “friendlier” desktop version called Bagger.

Are you using the Bagit specification or any of the Library’s tools to support it? We’d love to hear from you on what you’ve discovered about them.

3 Comments

  1. Commercial Architects
    January 3, 2012 at 4:06 pm

    Great explanation of Bagit. I have been using online libraries for several years now and had never read of an explanation for “how does it get there”. Thanks very much.

  2. Walker Sampson
    January 3, 2012 at 4:34 pm

    We at the Mississippi Department of Archives and History have been looking at BagIt as a packaging/transfer format for electronic government records for a few months now. We’d like to, over time, have state agencies regularly transfer their records series as bags.

    To that end we are planning to rely on Bagger quite a bit as a user-friendly front-end for non-technical persons.

    A draft page: http://mdah.state.ms.us/arrec/digital_archives/bagger/

    Thanks for your work on the specification and library!

  3. Butch Lazorchak
    January 3, 2012 at 4:39 pm

    Great, thanks for the link and keep us posted on your progress!

    The NDIIPP-funded GeoMAPP project took a good look at the Bagit tools and published both a Bagit User Guide (http://www.geomapp.net/docs/Using_BagIt_ver2_geomapp_FINAL_20110321.pdf) and a series of video tutorials.

    You can find them all at http://www.geomapp.net/publications_categories.htm.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.