The following is a guest post by Laura Graham, a Digital Media Project Coordinator at the Library.
In late 2008, the Library of Congress, the California Digital Library, the University of North Texas, the Internet Archive and the Government Printing Office began the first collaborative project to capture and archive United States government web sites representing the “end of term” of the George W. Bush presidential administration.
The partners planned, strategized, developed tools to facilitate processes and settled on a division of responsibilities. Months later, when the crawling of content was complete, there were 5.7 terabytes at CDL, 1 at UNT and 9.1 at Internet Archive, for a total of 15.9 terabytes.
Selecting, crawling and archiving the content was such an accomplishment! But, how to share this content? How to make sure that each partner got a copy of the content they wanted? We needed a simple, straightforward method that was, above all, easy to track.
In 2008, the Library had set up a central transfer server for receiving National Digital Information and Infrastructure Preservation Program content. This server was for “pull” transfers by the Library from other institutions. We had also set up an rsync server and made Web archive content available for another institution, the University of Maryland, to “pull” from us.
So, we had experience moving content in both directions. We also had an established transfer history with the Internet Archive, which had the bulk of the “End of Term” content.
For these reasons, it made sense for the Library to set up a round-robin transfer workflow: pulling in content from each partner, and as it came in, making it available for transfer to other partners.
The workflow for the transfers worked like this: the Library pulled content from a partner to our transfer server, verified it to make sure that files were not missing or corrupted, then copied the content to a longterm storage server and to the rsync server for partners to transfer or “pull” for themselves. Once each partner had pulled content from the rsync server, we deleted it to make room for the next transfer.
The transfer of content began in May 2009, and the process was complete by mid-2010. All transfers were done over the network utilizing Internet2 and the collaborative effort was an excellent opportunity to test new transfer specifications and tools.
All the partners organized their content for transfer in “bags.” Bags are based on “BagIt (PDF),” a specification for packaging digital content for transfer and storage, developed collaboratively by NDIIPP partners.
A complete bag is one that is holding all its content. A “holey” bag is a bag empty of its content (e.g., full of holes!), but it does contain a crucial file, “fetch.txt,” that contains URL locations of each of the files that make up the complete bag. The complete bag is the target of transfer and the holey bag is the means of transferring it.
That is, a source institution provides pairs of complete and holey bags on its server. The transferring partner downloads the holey bag first, and uses it, with its fetch.txt file, to transfer the complete bag from the source server, “filling” the holey bag at its end.
Increasingly, institutions are using bags not just for transfer but to hold content on storage and access servers. The ability to manipulate holey bags for transfer means that institutions can organize content on their own servers as they wish, without having to re-organize or resize it for transfer.
For example, CDL bagged the files in each of its crawls, but two of these crawls were nearly 3 terabytes each, a very large quantity of information for transfer in one session. But thanks to holey bags, we did not need to ask CDL to do anything. Instead, we simply split the holey bags at our end, making two out of each one, and transferred the content in the size that was most convenient for us.
Conversely, UNT had many small crawls — over 70 bags comprising just 1 terabyte total. Instead of laboriously doing 70+ very small transfers, we simply merged all the holey bags into one and did a single transfer. Thus, the Library was able to easily re-package content for transfer when pulling it from a partner, and make those transfer-efficient bags available to other partners.
The Library has developed an open source tool called the BagIt Library based on the BagIt specification. BIL is a Java library for the Unix environment, with a range of commands for transfer, validation and verification, and for making and manipulating bags.
The Library used BIL to transfer content and provided the tool to all the other partners. They used it to bag their own content and transfer other partners’ content from the Library. Installation was minimal and the tool proved easy to use.
In addition to the Unix environment BIL, the Library has released an open source desktop version called Bagger. Users can now bag content from their desktops.
When we began this project, we were much less daunted by total content size than by the scheduling and tracking issues. We wanted an orderly way to go about the process. And we wanted to know that every partner had received the content they requested.
Our simple, straightforward methods served those requirements, but were greatly aided by the organization of all content into bags and the use of a common set of tools. All partners could use the same strategy and workflow.
There are some sophisticated systems available for sharing digital content these days, and we may use something different for the next End of Term collaboration. But this relatively simple first experience was a kind of use case for us—a way to sketch out how we would interact with each other, organize a workflow, and carry out a process to reach our goal.
Comments
Really good stuff. Thank you for sharing.