This is a guest post written by Amanda May, Digital Conversion Specialist in the Preservation Services Division. Her work includes recovering data from removable media in Library collections and providing consultation and services for born-digital collections data.
The Preservation Services Division uses a wide array of specialized software in order to preserve born-digital collections, most of which originally arrived at the Library of Congress on external media such as floppy disks, optical disks, and hard drives. By using this software, I am able to find and preserve millions of files that are otherwise trapped on older media.
Much of the software is used for creating disk images or recovering files from specific types of media. A disk image is an exact, bit-for-bit copy of the original media item, in digital form. I don’t always create a disk image, and we don’t always save disk images that we create – sometimes they are just useful as a surrogate for the original media until I am able to save the files we want. As detailed in a previous blog post, the Preservation Services Division uses the KryoFlux and the FC5025’s Disk Image and Browse software to create disk images for 3.5” and 5.25” floppy disks. For optical disks, Zip disks, and hard drives, I most often use FTK Imager, a free application published by AccessData. I sometimes also use the BitCurator environment’s Guymager app. I am able to see all of the drives attached to one of our workstations and choose which one to image or explore for the files we want to save.
For Mac-formatted hard drives of a certain age, I sometimes have to use HFS Explorer to view and image the media. Mac-formatted media can present its own challenges that requires special techniques and sometimes special software to recover. Emulation has been used in other parts of the Library in order to recover and view Mac files.
After a piece of media is imaged and the files have been exported, I need to know what I have recovered. The Preservation Services Division licenses AccessData’s Forensic Toolkit (FTK) to analyze the data recovered from all the media. As the software was created with the law enforcement community in mind, our collections become a “case” and each piece of media becomes “evidence”. I can analyze all of the disk images I made, or add several directories of files. I use the software to create file listings, find files of a certain profile (emails, executables, videos), and get a preview of each one. Using algorithms, I can calculate hash values, which help uniquely identify files and determine if they have changed over time or when copied elsewhere. I can use these values to look for duplicate files, even if they are in different locations and have different filenames. I can also look for personally identifiable information (PII) and keywords using regular expressions; one recent collection needed to be searched for several terms related to notable people and corporations, classified information markers, and PII like credit card numbers and social security numbers. Gathering this sort of information helps the processing archivist know what kind of data they have and they are able to easily get the collection ready for public use by enhancing description, organizing files, and redacting some files, if necessary. For instance, the Manuscript Division used reports about the American Lands Alliance records, 1976-2009, one of the first collections with a significant body of electronic material, to reimagine finding aids at the Library and describe the born-digital files alongside similar material found in the paper materials of the collection.
For some curatorial divisions, the content of graphics files is of utmost importance. Cameras, editing software, and creators all leave their mark on graphic files like JPEGs, TIFFs, and more. I can use ExifTool to export data from all of the graphics files in a collection, pulling out all of this data from each file into a CSV spreadsheet. Information like make and model of the camera, creation time, focal length, and geolocation data can all be embedded in a file, but this data is hidden without this type of report.
As the final step, all of the data has to be packaged for long-term preservation. The Library uses the BagIt specification as the standard structure holding all preserved data in the repository. Each set of data becomes a directory called a “bag”. The data is held in a “payload” directory called “data”, accompanied by “tag” files that describe the contents of the bag, keep an inventory of the contents and the tag files as well as hash values of each file, and describe the version of the bag. I use a homegrown application called Bagger to create bags one by one, or a Python version to create many bags at once. Once bags are created, I ingest the files into the digital repository or return them to the curatorial division for further processing. Creating the bags creates a record of the original state of the bag and creates a discrete package for storage and retrieval, both important concepts in digital preservation.
Many different types of specialized software are used for the Preservation Services Division’s born-digital preservation work. By using this software, we have been able to recover millions of files that might have otherwise been lost – a huge part of our modern cultural history. As we transitioned from a world of paper to a world of computers, more and more was done entirely digitally, with no paper evidence of life or work. If we are not prepared to preserve that data, we will have no archival record available for many modern notables.