Kathleen O’Neill is currently serving as one of two Staff Innovators at the Library of Congress. Their 2020 project, Born Digital Access Now!, explores existing pathways for accessing born digital materials in the Manuscript Division. In this series of blog posts, Kathleen describes the complexities of gaining access to born digital materials even before they reach researchers. This is the second post in the series and focuses on legacy file formats through the metaphor of being “lost in translation.”
In the previous blog post, we discussed how media format can sometimes be an obstacle to accessing born digital collection material. Issues like severely damaged storage media or lacking the appropriate drives to read the storage media blocks access and can lead to total content loss. While obsolete or unidentified file formats can pose their own challenges, my experience has been that contending with an obsolete file format rarely results in a total loss of content. File viewer tools usually allow partial to full viewing of digital content. Sometimes, however, the file formats in our collections and current operating systems do not speak the same language. In these cases, the seemingly simple act of opening a file requires multiple processes and tools.
In the film Lost in Translation, Bill Murray asks Scarlett Johansson, “Can you keep a secret? I’m trying to organize a prison break. I’m looking for, like, an accomplice. We have to first get out of this bar, then the hotel, then the city, and then the country. Are you in or you out?”
Today we’re going to delve into the access challenges that come with obsolete file formats and the processes required to rescue this content. Are you in or you out?
Legacy file formats and operating systems or “Lost in Translation”
I attended my first Personal Digital Archiving Conference at the University of Maryland in 2013. We were still in the early years of building our workflow and on a steep learning curve. During one of the presentations, a speaker casually mentioned that a digital document is not static. What you see on the screen is an interpretation, an interplay between the operating system, the file and media format, as well as the driver reading the media– essentially every time you open a document, it is a “new” document. What? I hadn’t ever really thought about it– my mind tumbled down a rabbit hole as I wondered what constituted a document and by the end of the conference, I found myself seriously considering whether digital preservation might require I learn dance notation.
That was the moment, I concretely understood that access to a bitstream does not ensure access to the content therein (For a further explanation of bit-level preservation see this blog post).
Being able to copy digital files does not guarantee you can open these files. Files still need to be recognized and interpreted by the operating system. Not all file formats are supported in every operating system or can be opened by modern software. Therefore, it is important for both preservation and long-term access to identify and document file formats. Even then, there are times when the content cannot be fully rendered or accessed. The Walter Sullivan papers gave me the opportunity to discover what happens when an operating system and a file format do not speak the same language or what gets lost in translation.
Walter Sullivan Papers
Walter Sullivan was a writer and science editor for the New York Times. The digital portion of the collection was received on 5.25” and 3.5” floppy disks, comprising 243 digital files (12.82 MB). While the Digital File series includes draft correspondence and “Terra Mobilis”, a continental drift modeling program, the bulk of the digital files are Sullivan’s memoir drafts relating his experiences on expeditions to Antarctica and as a foreign correspondent in China, Korea, and Berlin.
By the time the Walter Sullivan collection was acquired, Amanda May, a Digital Conversion Specialist in the Preservation Reformatting Division (PRD) developed workflows to recover content from 5.25” floppy disks. Depending on the desired outcome and the age and formatting of the media, there are multiple tools in place to recover content from 5.25” floppy drives. Amanda May describes the various workflows for 5.25” floppy disks below:
“The easiest way to get into a 5.25” is through the FC5025. The card is powered and transmits data via one USB cable, it has a GUI interface, and it works fine with the most common formats that are found on the 5.25” disks. An advantage of this method is that you can browse the disk and transfer individual files if that’s all you want, or a disk image if you want the entire contents.
“When the FC5025 doesn’t work, I try the KryoFlux. It’s a more specialized piece of equipment and can be more difficult to set up and use. The KryoFlux can create a type of disk image called a “stream” which captures the magnetic fluctuations of the original disk, which can then be used in place of the original disk for creating formatted disk images; this is an advantage because it reduces wear and tear on the original item, and allows for some manipulation to make it past damage or deterioration. For both methods, the disk drive must be safely connected to a power source, while the controller card is powered by the same USB cable that connects it to the computer. Both the FC5025 and the KryoFlux have write protection built in.”
After multiple attempts to recover the content from the 5.25” floppy disks using FC5025 were unsuccessful, Amanda May was able to make an image stream of the media using the KryoFlux workflow. After some research, she was able to discern that the content was most likely created on a Wave Mate Bullet computer from the late 1970s or early 1980s. The Wave Mate Bullet computer used a CP/M operating system, pre-dating the IBM PC.
So how does a modern operating system interpret files created by an obsolete software in an obsolete operating system? Well, it depends. Many obsolete file formats can be opened in modern operating systems. For example, it’s been my experience that early versions of WordPerfect can be opened in Microsoft Word. Some file formats can be opened but with data and/or format loss. Other file formats such as those from the Walter Sullivan papers require specialized tools to access and recover the content. When Amanda May looked at the images using FTK Imager (a disk imaging tool), FTK Imager did not recognize the content as files but as unallocated space. She was able to export some of the unallocated space from the images, but in other cases, the export failed. In these cases, Amanda used the FTK Imager text viewer window to copy the content into .txt documents. That she was able to recover any content at all was amazing but, as you can see from the samples below, some text and formatting are lost.
The opening of the document begins with several pages of å å å å, followed by what appears to be a file list.
While you cannot see this in the image below, sentences and whole paragraphs are repeated throughout the document. For example, the 2nd paragraph below that begins with “It was getting dark, so we assumed…” is repeated earlier in the document but with one important difference, the earlier version has no lost text. This repetition of text with slight differences in the completeness and presence of content could possibly indicate that the content from this unallocated space contains multiple drafts of the same content merged together.
Digital processing lesson: When you cannot preserve everything, preserve what you can.
When an operating system and file format do not speak the same language and there is no existing software to mediate the relationship, digital content is essentially “imprisoned” and you may need to stage a prison break. And despite your best efforts, both content and formatting can be “lost in translation.” As a foreign correspondent, I like to think Walter Sullivan would have understood this problem.
The final blog will describe how we used emulation to render and interact with chaos data visualization software from the Edward N. Lorenz papers. We’ll take our inspiration from the literary world using the third of Arthur C. Clarke’s Laws: Any sufficiently advanced technology is indistinguishable from magic.
I’m enjoying this series. The problem of access is compounded when the content is created or posted on a social media platform that changes constantly. Your Facebook or Twitter post from 2015 will not render the same way in Facebook or Twitter today. So even if you’re able to catalog words, images, video and other file elements, and metadata, the original context is lost. How do you decide what’s “good enough”?
Great blog post and series!
I think the point about digital objects not being static is so important yet so often overlooked in digital preservation practice. IMO it’s not enough to know what file format an object is, but often to know what is contained within (from a technical as well as intellectual perspective). Information captured during characterization should include info such as number/type of embedded objects, parametric systems used within, etc. to allow us to check for correct technical interpreation across time as sw/hw changes.
You and/or Amanda May might find the SIMH emulator useful (http://simh.trailing-edge.com/). It is a multi system emulator that can run old operating systems like UNIX v5 or CP/M. I found a CP/M emulator that runs atop SIMH here: https://schorn.ch/altair.html
SIMH runs many old systems so it might come in handy in the future.
Apologies if these tools are well known to you. Perhaps they will prove useful in your efforts. Thanks for the important preservation work you do.