Tracking Digital Collections at the Library of Congress, from Donor to Repository

Photo of Kathleen O'Neil

Kathleen O’Neill

When Kathleen O’Neill talks about digital collections, she slips effortlessly into the info-tech language that software engineers, librarians, archivists and other information technology professionals use to communicate with each other.  O’Neill, a senior archives specialist in the Library of Congress’s Manuscript Division, speaks with authority about topics such as file signatures, hex editors and checksums even though she has a traditional paper-centric Master of Library Science degree. She picked up her technology expertise on the job, through years of rescuing digital content off of erratic computers, troublesome files and unstable storage media.

The Library often acquires a collection at the end of someone’s career, which means that many of the digital files that O’Neill sees in collections may have been created decades ago. And chances are good that some of the storage devices, or the files they contain, will be obsolete and will require the Manuscript Division to process them with special digital forensics resources.

When a collection is first received by the Manuscript Division, a staff member reviews the contents and if digital media devices are found, they are transferred to the digital collections registrar, O’Neill. Archivists might find digital storage devices among the paper documents later when they are processing the collection. In either case, O’Neill records receipt of the materials in a local database. The record includes the collection name, collection number, a registration number and any additional notes about it. Said O’Neill, “If I get digital material, I give it a registration ID and that forms the beginning of what will become a unique ID for each piece of media.” This begins the tracking information or what O’Neill calls the “chain of custody.”

Once O’Neill gets the storage devices (which are essentially computer hardware versions of collection boxes or containers) she completes or supervises the following tasks:

  1. Physical inventory of the storage devices.
  2. Transfer of the files off the storage devices using the Bagger tool (more about that below).
  3. Transfer of the files to the Library’s digital repository for long-term preservation.
Photo of disk with electronic media removal form.

Disk with electronic media removal form.

Kimberly Owens, senior archives technician, is responsible for physical inventory of the storage devices. This task includes photographing the disk onto a standard paper form filled with information about the disk, most importantly the location from which it came. When she or the archivist assigned to the collection finishes adding the metadata, the paper form is placed into the paper collection where the disk was; the disk will go to a special hardware/software storage area. “If you remove the disk without creating proper documentation, you could lose context,” said O’Neill. “When you’re working on hundreds of pieces of media it’s hard to keep track, so it’s nice to have a visual to go back to and make sure that it’s the right one. We count and photograph each piece of media, which can be labor intensive. One of our larger collections had well over six hundred 3.5″ floppies.”

Before any action is taken on a storage device, O’Neill protects the media from being unintentionally altered by write-protecting it. She “locks” 3.5″ floppies by moving the tab on the disk to the write-protect position. For other media, the division has a variety of write-blockers on hand. O’Neill creates a file directory listing for each piece of media. “We do a simple command-line file directory listing so that we can capture a list of the files and the dates they were modified,” said O’Neill. “It’s been really useful early on because I’ve been able to identify different materials that we should or should not have.”

The next step is to bag the media. O’Neill prepares the files by organizing them into bags (PDF) using the Library of Congress’s Bagger tool. The bags contain the digital collection material along with self-describing documentation. The structure of a bag includes:

  • a directory containing the file or files (data)
  • checksummed “manifest, a receipt that itemize the files in the bag
  • a “bagit.txt” file that declares, “I am a bag.”

Often, this step does not go simply or smoothly.

Basic contents of a digital bag: a directory of data, a manifest and a bagit.txt file.

Basic contents of a digital bag.

Old disks can be unpredictable and inconsistent; just trying to view their contents could be a challenge. She might pop an old floppy disk into one computer and nothing happens; the computer might not recognize the presence of the disk. Or worse, the computer might ask if she wants to format the disk (and erase its contents). If she pops the disk into another computer of the same year, make and model, she might easily see the disk and view its contents. Or the disk might display onscreen but appear to be empty, when it’s really not.

O’Neill and her colleagues have a number of desktop tools for examining and reporting on disk contents, disk structure, modification dates and so on.

If they need more sophisticated tools for problematic disks, she can take the disks to the Library’s Preservation Digital Reformatting Program and use the Forensic Recovery of Evidence Device, or FRED (a high-end digital-forensics workstation), to analyze and recover files off the disks.

The FRED can read and access files off a variety of disparate storage media without accidentally damaging or erasing the contents. Loaded onto FRED is the Forensic Toolkit, or FTK, which is software for diagnosing disks and files, searching and sorting and restoring “deleted” files. FTK can identify and create reports about files’ properties and formats, declare which OS and software (and their versions) created the files and what applications will read them. FTK Imager can create a disk image (an exact copy of the content, exactly as it is on the original storage device, including data and structure information) so users can work with the copy to avoid the risk of damaging the original.

Screenshot of FTK Imager interface by Dorothy Waugh on http://bit.ly/1G1N4jW.

Screenshot of FTK Imager interface by Dorothy Waugh on http://bit.ly/1G1N4jW.

Another resource available to the Manuscript Division staff is BitCurator, which we profiled in the Signal interview with Cal Lee and the story about the Maryland Institute for Technology in the Humanities. Some of BitCurator’s hardware and software functions are modestly comparable to the FRED’s. But BitCurator was designed specifically to help digital archivists manage information, especially sensitive personal information that may be contained within the collections.

O’Neill said that the Manuscript Division has not had a major issue yet with a CD-ROM or DVD but if they do they can consult with the experts in the Library’s Preservation, Research and Testing Division to analyze it.

Through their thorough and detailed records, O’Neill and her colleagues have been able to search for patterns among certain storage media. For example, they discovered that old double-sided high density disks can be difficult to access with modern equipment. “But, then again, sometimes it could just be a bad set of disks,” said O’Neill. “I’ve hit a whole run where double-sided, high-density disks don’t work. I can’t get them to read. Sometimes I’m able to recover them quickly; sometimes not. Sometimes the third time works. It’s trial and error. But you have to balance that with what’s on the disk and whether or not it is worth the extra levels of work.”

Screenshot of command line Directory Listing by David R. Tribble on Wikipedia.

Directory Listing (command line) by David R. Tribble on Wikipedia.

After O’Neill accesses and catalogs the contents of a disk, she copies the files off the disks and into the Library of Congress repository. For this purpose, she uses the Library’s Content Transfer System, which enables staff to describe, inventory and transfer files from local media to the repository.

The Content Transfer System allows staff to validate the integrity of the files and the completeness of the bag by checking them against the manifest to confirm that nothing has changed. “We try to generate a checksum at the earliest point in the process,” said O’Neill, “so that we have that checksum carried through as the file moves through our system. It’s a way to document the authenticity of the file.”

O’Neill copies the bagged master files to the Library of Congress repository, which replicates them locally and at a geographically remote location. The Content Transfer System also scans for viruses when it ingests the file. The Content Transfer System tracks the user login information and all the metadata associated with the files, so between the Manuscript Division’s local database and the Content Transfer System, there is a continuous record associated with a given file from the time the first staff member appraised the collection.

Finally, O’Neill takes the original digital hardware and software and shelves it for preservation, in case someone ever needs to access it again.

In March, 2015, I asked O’Neill for an inventory of the storage media currently in the Library’s Manuscript Division collections and she came up with this list:
930 – 3.5″ floppies
250 – 5.25″ floppies
145 – Optical media (CDs, DVDs, CD-Rs, etc.)
65 – 8″ floppies
35 – Zip drives
30 – Computer tapes
3 – CPUs
3 – Bernoulli disks
4 – Flash drives
3 – External hard drives

Of course that list is continually expanding and, in fact, it should increase exponentially in the near future as the Library acquires collections from donors who created the bulk of their works digitally. The Manuscript Division has completed ingest to long-term storage on approximately 500 of these media. Kimberly Owens is working to inventory and bag the remaining backlog.

Screen shot of added metadata in form.

Added metadata.

Researchers visiting the Library of Congress can access copies of some of the digital collections but access depends on copyright and the conditions established by the collection donor. There are also technological challenges to serving up records. While the Division is scheduled for some infrastructure upgrades in the next several months, in the meantime the reading room terminals are connected to the Library’s network and not to a hard drive that is loaded with software that could open, say, graphics or documents. “That means that, depending on the file format, the researcher may or may not be able to read the file,” said O’Neill. “It would have to be something that was renderable in a browser. And if it’s renderable in a browser, somebody could copy it and email it to themselves. So there are security issues that we are trying to work through.” Access is currently available only onsite in the Manuscript Reading Room.

Viewing some files will continue to be a technological challenge, especially the old or obscure file types. “The earliest digital material that we’ve seen is from 1987,” said O’Neill. “That would be Word Perfect 4.1 or something or a really old Apple file format. All of that is tricky because we don’t have the software to read everything. No one at the Library has the software or drives to read every file format.”

So the volume of digital files that the Manuscript Division — and other divisions around the Library — archive far exceeds the ability of the staff to test each one to see whether it displays or not. A checksum is the best automated option for file integrity right now, checking the inventory at lightning speed to see if the stability of a file has changed or not. Display of the file’s contents is a different matter.

More than likely, instances of restoration will happen as researchers come to the Library to experience files in their original environments, on old computers with old operating systems. Maybe there will be an emulation solution and we won’t need original hardware and software. That would require a whole other set of digital forensics tools, many of which the Manuscript Division already has.

Not all researchers will require a perfect rendering of the original file though. “I think there will be two very different levels of interest from researchers,” O’Neil said. “There are the high-powered, technically savvy digital humanities people who seem to be driving a lot of the conversation in this area. But I think quite a lot of people are just interested in the information. They don’t care what the file format is. They want the information.”

At the very least, the Manuscript Division has an efficient end-to-end curatorial system in place, one that they continue to refine. “We tried to understand and perfect the ingest portion of it, so that we knew things were saved and safe and inventoried,” said O’Neill. “Access and appraisal of digital collections are ongoing issues with the archivists. And with every collection that comes in, there’s always some weird quirk that makes us re-think everything.”

Reaching Out and Moving Forward: Revising the Library of Congress’ Recommended Format Specifications

The following post is by Ted Westervelt, head of acquisitions and cataloging for U.S. Serials in the Arts, Humanities & Sciences section at the Library of Congress. Nine months ago, the Library of Congress released its Recommended Format Specifications. This was the result of years of work by experts from across the institution, bringing their […]

Introducing the Federal Web Archiving Working Group

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. “Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access […]

All the News That’s Fit to Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed […]

An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks […]

Gossiping About Digital Preservation

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn […]

Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress. Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify […]

The Library of Congress Wants Your File Format Ideas

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put […]