An Archivist’s Perspective on Legacy Files

This is a guest post by Chad Conrady, Archives Specialist in the Manuscript Division. Alongside Senior Archives Specialist Kathleen O’Neill, Chad is a 2020 Staff Innovator working on the project Born Digital Access Now!

In previous blog posts, Chad and Kathleen have shared publicly about their project, whose central aim is to research and understand the various file formats in the Library’s Manuscript Division collections containing born-digital material. The challenges and obstacles of this process have been documented in a series of three blog posts by O’Neill. In her final post as a Staff Innovator, Kathleen O’Neill detailed the outcomes of the file format analysis.

In this post, Chad discusses his area of expertise, emulation, which imitates older operating systems in order to open outdated or legacy files that are no longer operable with contemporary operating systems or software.

Mac & Me

In previous born-digital blog posts, Kathleen O’Neill discussed working to preserve and access obsolete media and file formats from DOS based or older computer systems in the Edward N. Lorenz Papers, the Walter Sullivan Papers, and in the Seth McFarlane Collection of the Carl Sagan and Ann Druyan Archive.  As the other Staff Innovator on the Born Digital Access Now! project, I will talk about obsolete Apple operating systems and the steps I took to preserve and access files created by obsolete Apple programs in the Nina V. Fedoroff Papers.

One of the central parts of my job as a processing archivist is to assess, or “appraise,” born-digital content that arrives in newly acquired collections. When I process born-digital collections, the part that I enjoy the most is trying to determine the file format of an unknown file and finding the right tool for accessing files of this type. I find science collections to be the most rewarding in terms of file complexity and variety. I approach these collections as a puzzle where I have to use the information I already know about what technology was available to the creator in determining what types of files were created and determine the best way to access them. The approach I took to resolve some of the complexities while processing the Nina V. Fedoroff Papers and the papers of Elizabeth Blackburn, the collection that I am currently working to process with Connie Cartledge, required a series of trials and errors, researching how an obsolete operating system and its programs worked, and learning how to use complex programs on the fly.

Processing the Nina V. Fedoroff papers

When I requested to process the Nina V. Fedoroff papers, I was warned that most, if not all, of the media and files were written by Macintosh computers and there were some issues using the Library of Congress’s Bagger program to “bag” or extract files from media formatted to a Mac.[1]  Kathleen O’Neill was excited about the possibility of accessing files in the collection created by the Hypercard program and working on the puzzle of accessing Mac-created files. When I initially started processing the digital portion of the Fedoroff papers, I used a 2016 Apple laptop to bag the Macintosh formatted media.  This initially worked well, but when I moved to a Windows computer to confirm the checksum using Bagger, I received a bag manifest error which said that some of the files were not on the media.  I re-checked the files on the Apple laptop in Bagger to confirm that the program copied all the files on the drive, but upon checking the bags the program did not output any bag manifest errors.

These disappearing and reappearing files were the result of older Macintosh operating systems using resource files used exclusively by Macintosh file systems (MFS, HFS, HFS Plus), making it nearly impossible to open Mac-created files on a non-Mac operating system. On a classic Macintosh computer, a resource file exists for the system, each application or program, and each document. The system resource file contains standard resources shared by all applications, and is initiated when the system is started up. A resource file for an application is initiated when the application starts up.  When you open a document the computer first checks the applications resource file. If that search comes up empty, the computer will check the system resource file in order to determine how to open the file and its contents.  These resource files are made up of two parts, a resource fork and a data fork. The resource fork for an application contains the code for the application and resources it uses, while the data fork will contain anything that the programmer wants to store there but it can also be empty.[2]  In a resource file for a document, the resource fork can contain document preferences and the last location of the window, and the data fork will contain the contents of a document. Without both forks, the file is inaccessible.[3]

The mysterious “.dra”

In order to preserve the resource forks of the documents created by Nina Fedoroff, I needed to create disk images, which are bit for bit exact copies of a piece of digital media. For each piece of media with a Macintosh file system, I worked with Amanda May, a Digital Conversion Specialist in the Preservation Services Division (PRD), to create the necessary images. By extracting the files from the disk images, I could easily revisit the files once I had figured out the best way to properly access them. In the meantime, I used a tool called QuickView Plus to review and appraise the extracted files I could see and worked to research viable ways to access these files. Most of the files that were viewable in Quickview Plus were created after 2000 when OS 10 came into use. OS 10 did not use resource forks in the same way as older Apple operating systems, so files created using this operating system or newer tend to be more easily viewable with Quickview Plus.[4]

One of the key tools I used to identify the remaining inaccessible file was a hex-editor, which allow me to inspect the fundamental structure of the file and extract possible file signatures (Figure 1). By running the hex-editor, I found many of the unreadable files contained the file signature “DRWGD2,” file extension “.dra,” along with the text “vector” (Figure 2). From these clues, I decided to start my search with programs that could create vector drawings.

Screen capture of a hex editor tool with numbers running across the top in columns and down the side in rows. Numbers are organized in clusters of two. On the far right-hand side, is a column titled Decoded text and directly beneath it are the letters DRWGD2 highlighted in yellow.

Figure 1: Hex editor showing file signature for .dra.

 

Screen capture of a hex editor tool with numbers running across the top in columns and down the side in rows. Numbers are organized in clusters of two. On the far right-hand side, is a column. After multiple blank rows, the word Vector is discernable.

Figure 2: .dra file showing the “vector” term in a hex-editor

I then had to figure out what potential programs could create vector image files dating from 1987 to the 1990s. There were only a few programs capable of creating a vector image file of this vintage, narrowing my options to Adobe Illustrator, an AppleWorks file, or a Claris MacDraw file. Through the process of elimination, it seemed more likely that it was a MacDraw file since that was used on early Macintosh. I eventually found across an archived forum post about MacDraw II from 1993, which confirmed that the header “DRWGD2” was indeed a MacDraw file.[5] Following this process of using the hex-editor, I was also able to identify another large batch of files as Hypercards cards. This collection of files turned out to be a specialized version of the Hypercard program, which is a software development program.

Accessing the Identified Files through Emulation

Now that I knew which programs were used to create these files, I had to figure out how to gain access and view them. I realized that due to the ways in which older programs utilized resource forks, no modern program would be able to open these files correctly. In an effort to provide access to data I hoped would supplement processed materials in the analog collection, I chose to work on files that could be opened with the Hypercard program first.

After much trial and error, I was able to use an old Apple computer that ran an OS 9 operating system and a copy of Hypercard in order to open the Hypermaize files. This confirmed that with the necessary software and hardware it was to access these files, but in order to provide more practical access for archivists and researchers I would need to look into emulation.

Emulation uses a computer program to imitate an operating system and programs on a host computer. I began by examining the emulation of older Macintosh operating systems using programs such as SheepShaver or Basilisk II.[6] Sheepshaver, unlike other emulator programs, requires a ROM file along with an operating system to install and run an Apple system. A ROM file is a small amount of memory programmed when the computer was manufactured that works as an intermediary between the programs on the computer and the installed hardware.[7] Having this ROM file present essentially tricks the software into believing the emulated system is a real Mac computer. Each type of Mac computer has its own ROM file, so if I wanted my OS 7 emulated environment look, feel, and have the technical specifications of a Macintosh Quadra 650, I would need to extract the ROM file from such a computer. After a little trial and error, I mounted the Hypercard disk image in the SheepShaver settings so I could install the program once the OS 9 operating system was running. The Hypercard program worked just like it did on the Apple laptop by initially opening a Hypercard file through the program and subsequent files were opened clicking on the file (see Video 1).

Video 1: An emulated Apple OS 9 environment opening the now obsolete Hypercard program, a type of software specifically designed to work with cards mapping the genomes of maize.

Once the emulated OS 9 environment was set up, I worked to access the other group of files, the “.dra” MacDraw files. The MacDraw files worked similarly to the Hypercard files with the important distinction that the date of these files’ creation varied widely.  Many modern software programs are backwards compatible, which means they’re compatible with older equipment or previous versions of software. This is not always the case for older programs and files. The MacDraw image files contained in the Nina Fedoroff papers spanned a range of dates starting around 1987 to about 1995. Some of the older MacDraw files refused to open with the MacDraw Pro program I had installed. At the time, I wasn’t sure why these older MacDraw files refused to open when much more recent files were able to open without issue using MacDraw Pro (see Video 2).  I later learned that MacDraw Pro was created by a different company than the one which created MacDraw II, and while the files had the same file extensions, to open all the “.dra” extensions in the collection required both MacDraw II and MacDraw Pro (see Video 3).

Video 2: A computer program running an emulation of MacDraw Pro software.

Video 3: A computer program running an emulation of MacDraw II software. 

Data Loss

After SheepShaver to emulate an Apple OS 9 system, I also wanted to view some of the “.dra” file extensions to see what data loss occurred from my previous attempt and if emulation was the best route to access older Apple file formats.  Figure 4 shows the same file that I opened in Figure 3. As you can see, Figure 4 represents the complete set of data while much of this original information is lost in Figure 3. Such side-by-side comparison demonstrates the importance of maintaining Apple files in their appropriate file formats, and the usefulness of emulation in providing access to these types of files with the potential for less data loss then what would occur if they were removed from media’s system format.

Conclusion

The Nina Fedoroff papers were the first collection for which the Manuscript Division worked to resolve barriers to accessing obsolete Apple file formats by using emulation.  This project laid the ground work for access and set processing standards for other collections such as the Art Buchwald Papers and the still in process Elizabeth Blackburn Papers. Kathleen O’Neill and I hope to use emulation tools and other digital forensic tools to expand the possibilities for using the Manuscript Division’s born digital collections.  The Library of Congress joins many other national memory institutions who are currently exploring the affordances of emulation for public access. We hope the outcomes of our Staff Innovator project will make useful contributions to moving this conversation along.

To get in touch with Kathleen or Chad about their project, born-digital access, file formats, and/or emulation, please email [email protected] .  

 

[1] For more about the Library’s Bagger program and BagIt File Packaging Format, check out these blog posts describing the program and its features, celebrating its tenth anniversary of being adopted at the Library, and discussing how it can be used for file fixity checks.

[2] Apple Computer, Inc., Inside Macintosh, Vol. 1 (New York: Addison-Wesley Publishing Company, Inc.), I-105

[3] Apple computer, Inc., “The Data Fork and the Resource Fork,” http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/MoreToolbox/MoreToolbox-11.html#MARKER-9-91 (accessed: 10/13/2020).

[4] Indiana University, “What is a forked file structure?,” https://kb.iu.edu/d/aarp (accessed: October 13, 2020).

[5] Eric S. Boltz, email to Newsgroups: comp.sys.next.misc mailing list, July15, 1993, https://ftp.nice.ch/peanuts/GeneralData/Usenet/news/1993/_Misc93-II.html (accessed: 10/1/2020).

[6] “SheepShaver,” https://sheepshaver.cebix.net/  (accessed: 10/1/2020); “Basilisk II,” https://basilisk.cebix.net/ (accessed: 10/1/2020).   SheepShaver is used to emulate a Mac OS 7.5.2 to OS 9.0.4 on modern computer system, while Basilisk II emulates MacOS 7.x, to OS 8.1.

[7]Dog Cow [pseud], “Exploring the Macintosh ROM,” https://macgui.com/news/article.php?t=493 (accessed 10/19/20).

LC for Robots in Action: using the API to access the Federal Theatre Project collection

The following is a guest post by Derek Miller, Harvard University, and Elizabeth Brown, a reference librarian in the Main Reading Room at the Library of Congress. In it, they discuss how Brown helped Miller access LC for Robots resources that helped him gain enhanced access to Library of Congress digital collections used in his research.

Nominations sought for the U.S. Federal Government Domain End of Term 2020 Web Archive

This is a guest blog post by Abbie Grotke, Assistant Head, Digital Content Management Section You may have noticed that it is presidential election season in the United States, which means it’s also time for web archivists to gather once again to archive United States Federal Government websites during the end of the presidential term. […]

Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archivist in the Manuscript Division at the Library of Congress. In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.

How to Write a FDD in 149 Easy Steps: Learning to Evaluate Digital File Formats

Today’s guest post is from Marcus Nappier, who is a Digital Collections Specialist in the Digital Content Management Section at the Library of Congress. The Library of Congress maintains over 470 format description documents (FDDs) on the Sustainability of Digital Formats website that provide information about file-formats, bit stream structures and encodings, and their usage in […]

Finding By the People Transcriptions in the Library’s Digital Collections

Today’s guest post is from Dr. Victoria Van Hyning, who served as a By the People Community Manager at the Library from 2018-2020. Starting in Fall 2020, she will be an Assistant Professor of Library Innovation at the University of Maryland iSchool, where she will continue her research on crowdsourcing, outreach, and inclusion.   The […]

Making a valuable resource even better: the Recommended Formats Statement and RFS 2.0

Today’s guest post is from Jesse Johnston (Sr. Research Development Officer Office of Research, Office of the Vice President for Research, University of Michigan), Kate Murray (Digital Projects Coordinator, Digital Collections Management & Services Division), Marcus Nappier (Digital Collections Specialist, Digital Content Management Section), and Ted Westervelt, Chief, US/Anglo Division. It has become ever more […]

Gina Jones and 20 Years of Web Archiving at the Library of Congress

Today’s guest blog post is from Gina Jones and Abbie Grotke, both of the Web Archiving Team. As a part of our series looking back at some of the people and stories around our 20th Anniversary of Web Archiving, I wanted to share with you an interview with a person who has been working on […]

Happy Birthday to LCWA! Celebrating the 20th Anniversary of Web Archiving at the Library of Congress.

Today’s guest post is from Abbie Grotke, who is Lead Librarian, Web Archiving Team in the Digital Content Management Section of the Library of Congress.   2020 marks a special occasion for the Library of Congress – our anniversary of 20 years of web archiving! Remember the year 2000? Back when we all breathed a […]