Those 1’s and 0’s are Heavier than You Think!

The following is a guest post by Laura Graham, a Digital Media Project Coordinator at the Library of Congress.

Bit preservation activities for the Web Archiving team include acquiring content, copying it to multiple storage systems, verifying it, and maintaining information current about the content.  But even these minimal steps, which do not include managing the storage systems and instituting statistical auditing, can be labor intensive.

We all know that bit preservation takes a lot of time and people. We factor it into our strategies and calculations–even if our conversations are more about large-scale storage management systems, the most reliable but cost-effective combinations of disk and tape, new cloud platform technologies, and current best practices.

So, how do we manage this very intensive labor?  Here on the Web Archiving team, we have been acquiring content and managing it in storage and on access servers since 2008.

A key issue in our efforts has been, how do we know where everything is, what its current status is, and most important, where we are, in its processes. Inventorying and tracking have been necessities in human undertakings for centuries, long before the advent of the “bit.”  And yet, it can still be a surprise to find out just how demanding these tasks can be.

In 2008, we had accumulated approximately 80 terabytes (TBs) of Web archive content at the Internet Archive (IA), our crawling contractor in California. This content was growing continuously. We needed to transfer it here to make it publicly accessible onsite, and also to create an additional storage replica.

When we began, we assumed network transfer over Internet2 from California to Washington, DC, would be the all-consuming process. We soon discovered that moving and processing the content after it had landed on our ingest server was far more laborious.

The steps had seemed deceptively simple: pull the content from IA’s servers to the ingest server and verify it; copy it to our tape storage system and verify it; copy it to disk on our access server and verify it there.  But to make progress, we needed to carry out these tasks in overlapping multiple streams of transfer, verification, copying and re-verification. And at this stage, all these processes were done “manually” on the unix command-line.  A lot of work!

Our initial tools comprised parallel.py, a network transfer script written in python that ran multiple threads on a single content transfer target. And, for inventorying and tracking? Of course, the ubiquitous spreadsheet.

The spreadsheet told us how much content we had transferred and how much of it was in storage and on our access server.  Empty cells told us what steps were outstanding on any single day.

But there are customization limits to any desktop software. We could not query and manipulate the accumulating mass of data as we wanted to. And the spreadsheet, while on a shared drive, was not easily accessible to everyone who needed to know what we were doing.

Fortunately, by the time we approached the 80 TB mark in our transfer of content from IA, the Library’s repository development team had reached some milestones of its own in development of a system-wide Content Transfer Service (CTS) with an easily accessible Web interface.

The CTS is complex and could easily be a blog topic all its own. Briefly, it provides a set of services for digital content that has arrived at the Library, the most important of which are receiving and inventorying the content and then copying it to other locations, such as tape storage, or access servers, where it then verifies and inventories the content in those locations as well.

Most importantly, the CTS moves the copying and verifying steps off the command-line and automates them for us.  While we still have to initiate processes, the CTS queues, monitors, and reports back to us on those processes via the interface and email. It automatically verifies content that has been copied to any new location.  Inventory reports are easily generated by project, location, and date range.

You might think that we then let go of our spreadsheet. Not so! It found new life in the Bag Tracking Tool (BTT).  (Our suite of bit preservation tools is modeled on the “bag” (PDF, 62 Kb) as a straightforward and efficient way to package content for storage and access.  Hence the name.)

While CTS performs inventorying, copying and verifying services for us, the BTT still tells us where we are in our work processes.  Most importantly, it tracks content acquisition processes that fall outside the CTS, such as external transfer from IA, and more recently, onsite crawling projects.

(We use our staff identification photos to indicate the owner of a bag in the BTT, which makes it very important, of course, to get that photo done right!)

The BTT also provides a team-oriented view of our workflow. It is a component of our Web archiving software developed here at the Library for managing the Web archiving lifecycle from nomination of Web sites to crawling and quality review of the captured content.

Eventually, we will integrate the BTT with the CTS, making a seamless inventorying and copying service and tracking system.

As a cumulative record, all of these tools tell us about that “labor intensive” component in bit preservation. They also tell us where we can best automate our efforts.

For we know, there are not just limits on storage, bandwidth and system performance.  There are also limits to that other component, staff, time, and effort!

One Comment

  1. Kate
    September 21, 2011 at 11:52 am

    This is great! As someone who’s trying to make the mental shift from thinking about METS as the end-all be-all packaging solution for digital content to BagIt, this is a great example of how it’s used in practice! Thanks so much!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.