The following is a guest post by Senior Archivist Kathleen O’Neill. Kathleen and her colleague Chad Conrady are currently working on a project called Born Digital Access Now! as the 2020 Staff Innovators in LC Labs. Their first blog post introduces the project, which aims to provide greater access to born digital materials held in the Manuscript Division Today’s post is the first in a series of three blog posts in which Kathleen will discuss different challenges or barriers to born digital collection access through the lens of three different metaphors. Up first is: “Media Format, or, Have Fun Storming the Castle!”
One of the joys of processing a paper collection is the initial review — opening the boxes, noting the condition of the papers, getting clues from the folder headings, documenting any evident organization, dates, and types of materials. It feels like the beginning of an adventure or mystery to unlock, and when a folder heading sparks interest, you simply pick up the folder, open it, and dive in. While there are times paper materials need mold remediation, de-acidification, or stabilization before an archivist can begin processing, for the most part, paper materials allow immediate access to their contents.
For born digital material, the joy described above is not as immediate and is usually hard won. Yes, a review of the media formats and their labels can provide clues to the content and age of the materials. But labels are often incorrect and there is no way to know the number or formats of files by looking at the label. The barriers to overcome include media format, acquiring tools, and in some cases, obsolete software and operating systems. The experience can feel like falling down a rabbit hole, setting off on a mission, translating a foreign language, unlocking a code, or revealing a buried treasure. This work requires patience, trial and error, multiple tools, and most importantly, mixed metaphors. Join me as I walk you through the challenges of accessing born digital materials and, in the process, introduce you to some of our collections.
Media Format or “Have fun storming the castle!”
The Library of Congress Manuscript Division born-digital holdings include content from the 1980s to the present day. The media formats in our collections reflect the myriad ways people captured and stored information during that date range. So yes, we have computer tape, 8”, 5.25”, and 3.5” floppies, CDs and DVDs, hard drives, Bernoulli drives, USB drives, and content from proprietary online services and applications. Each has its inherent challenges.
For the physical media, the most basic questions is: do you have the drives to access the media? Working with online services and apps entails questions about passwords, permissions, and concerns about altered metadata. We learn from our collections. I do not just mean learning subject matter and history, but we learn how to be archivists from the opportunities our collections provide us. This, of course, is true of working with paper materials, but born-digital materials are a relatively new format and the archival profession is still developing workflows, standard practices, and tools to access and preserve the digital content. I included the above line of dialogue from The Princess Bride because I think it perfectly captures my mood as I set off to tackle the digital content in the Seth McFarlane Collection of the Carl Sagan and Ann Druyan Archive. Like the motley group off to save Princess Buttercup, I had a sense of mission and responsibility, a hard deadline to meet, and a blissful ignorance of the difficulties that lay ahead.
Seth MacFarlane collection of the Carl Sagan and Ann Druyan archive
The Manuscript Division received the Seth MacFarlane collection of the Carl Sagan and Ann Druyan archive in 2012 when our born-digital workflow was relatively new and still developing. Carl Sagan did not use a computer so we expected some but not a great deal of digital content in the collection. The first pass on the paper materials uncovered two boxes of almost 200 storage media, a combination of 3.5”and 5.25” floppy disks. As the processing team worked over the next 18 months, more and still more media was found. In the end, the collection contained over 730 pieces of digital storage media and it remains one of Division’s largest collections in terms of the number of media.
When I inform Sagan admirers that he did not use a computer, they always look surprised and a bit disappointed. He used an array of technology in his fascinating creative process, which involved dictating sections of book drafts that were then transcribed for him. His writing process is described in more detail here: //www.loc.gov/collections/finding-our-place-in-the-cosmos-with-carl-sagan/articles-and-essays/carl-sagan-and-the-tradition-of-science/sagans-thinking-and-writing-process/. Together, the physical and digital parts of the collection document not only his creative process, but an overlay of technologies including the adoption and usage of various digital storage media and file formats.
After I got over the shock of the amount and diversity of digital media, I realized, fortunately, the collection contained primarily 3.5” floppy disks. Why fortunately? Well, since it was 2012, our work computers still had 3.5” floppy drives. We got to answer, in the affirmative, that first basic question: do you have the drives to access the media? Additionally, the 3.5” floppy are surprisingly stable. If you have a 3.5” floppy drive, it is relatively easy to copy files off the media. These files were largely from the mid-1980s to 1997 and therefore, tended to be small, not complex with simple or no hierarchies.
There were some bumps in the road recovering data from the 3.5” floppies. Some of the disks were damaged or the files were corrupt. We were able to recover content from the damaged and corrupt disks by using Forensic Toolkit (FTK) imager to create disk images then exporting the files from the disk image.
In the end, we were able to recover the content from 420 out of 498 of the 3.5” floppy disks, comprising over 19,000 files (242.6 MB). These were largely text files. The lead archivist, Connie Cartledge, estimated that approximately 95% of the digital content was printed out and could be found in the paper portion of the collection.
There still remained significant media format challenges. We were unable to process the remaining 78 of the 3.5” floppy disks. Most of these were either double sided double density disks or Mac formatted which our computers could not read. In addition, we had not yet developed the 5.25” workflow and did not have the drives to read the 8 floppy disk workflow and Bernoulli drives. The 8″ floppy disks and Bernoulli drives remain inaccessible.
Digital processing lesson: The proper hardware in the form of floppy drive readers does not guarantee access to digital content.
Where does that leave us with the “storming the castle” metaphor? We’ve made progress, rescued significant content. We’ve scaled the wall and reached the courtyard, only to discover the door to the dungeon is locked and we’ve brought the wrong key.
Next week, join us for Legacy File Formats and Operating Systems or “Lost in Translation” when the Walter Sullivan papers teach me what happens when obsolete file formats meet modern day operating systems.
Note: this post has been slightly edited for clarity.
Thanks for the fascinating intro to the world of born-digital manuscripts. Inconceivable!
I am curious about how you document the disk imaging process when there are errors. How do you describe the issues in the metadata. If a text document is corrupted, what is your process for preserving the file?
Hello, there. This reply is from Kathleen O’Neill, Senior Archivist and author of this post. Thanks for reading!
For each storage media, we assign a digital ID#, create a record for each digital ID# in an Access database, and document all actions taken on the files associated with that digital ID#. We refer to the grouping of files under a Digital ID# as a “bag”. In the database record, standard actions on the files have checkboxes or yes/no fields. Errors or extra processes are described in a free text field. If specific errors affected files within a digital ID# group, a READ_ME.txt is added to the bag describing the error and additional processing undertaken to remedy the error. If a collection required the same actions across multiple bags, rather than adding READ-ME.txt files to every bag, recently we have begun documenting that information in the finding aid under Processing History in the Administrative Information section.
Your question about recovering and preserving corrupted text is the subject of Part II of this blog post series! The Part II blog post describes the multiple processes we undertaken to recover content from corrupted storage media and legacy content that cannot be accessed in modern operating systems. While these tools and processes described in Part II are often successful, sometimes content cannot be recovered.
Thank you for your interest and your questions!