Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archives Specialist in the Manuscript Division at the Library of Congress. She’s shared about Born Digital Access Now!, her Staff Innovator project, in previous posts.

In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.

As a 2020 Staff Innovator working on the Born Digital Access Now! project, I conducted an analysis of the file formats contained in the Manuscript Division (MSS) holdings. Analyzing and documenting file formats is a necessary first step to mapping the 85 processed collections containing born-digital material to the most suitable access pathway. Additionally, this analysis will inform the development of a pilot digital access workstation with the appropriate specifications and tools.

Some of the questions I sought to answer as part of this analysis included: What file formats are in the Manuscript Division’s collections? How many? What do the file formats tell us about the content in our born-digital collections?

These questions may seem straightforward but answers sometimes proved to be illusory or even contradictory. In the song, Brilliant Disguise, Bruce Springsteen warns “So when you look at me / You better look hard and look twice / Is that me, baby / Or just a brilliant disguise?”

As you’ll see in this post, assumptions about file formats can create “brilliant disguises” that illuminate certain aspects of a collection while obscuring others.

File formats and their “Brilliant Disguise(s)”

The file format analysis focused on the processed born-digital collections inventoried by the Content Transfer Services (CTS), a homegrown system built to transfer files across LC servers. CTS offers file format analysis based on file extensions. File format identification by file extension has limitations that will be discussed later, but the primary advantage to using this report is that it allowed me to create a comprehensive list of all the file extensions at both the collection and MSS born-digital total holdings (“MSS total holdings”) level.

I’m excited to share with you one of the major findings of the report. Astonishingly, we discovered that despite the small file and byte count of MSS total holdings relative to LC CTS total holdings, MSS collections contain more than 4,800 unique file extensions (18.5% of all distinct file extensions in CTS). This figure dramatically illustrates that the Manuscript Division faces significant processing, preservation, and access challenges related to the range of file formats in our collections.

A black, white, grey, and orange word cloud in the shape of a 3.5” floppy disk.

Figure 1: Manuscript Division Unique File Extensions. For legibility, dates and numbers were removed from the unique file extension list.

Does this mean there are more than 4,800 file formats in MSS total holdings? No. It’s more likely that MSS holds only several hundred distinct file formats. A file extension is not a reliable indicator of file format. In modern file systems, file format extension indicates the way data in a file is structured. Depending upon a file’s age, the file system in which the file was created, and the software that created the file, a file extension also might imply how data in a file is stored (e.g. .doc suggests a Microsoft Word Document) or what software created it (e.g., MyEssay.wpd suggests Word Perfect)[1]. Further, early file systems and operating systems allowed for users to create their own file extensions (e.g., MyEssay.finaldraft) or have no file extensions at all. So MyEssay.finaldraft may have been created in Microsoft Word, or even WordStar, we have no way to confirm a file format without checking for a file signature [2].

The chart below displays the top 10 file extensions by file count for MSS total holdings. They are, for the most part, file extensions indicating standard, ubiquitous formats. Given that the top five MSS file extensions (.txt, .tif, .doc, .jpg, and .pdf) account for 98.5% of total MSS holdings, it would appear that these files should be accessible and renderable using currently available software and tools. However, the devil is in the details.

Figure 2: Manuscript Division Top 10 File Extensions by File Count

Rank Extensions File Count
1 txt 1,768,626
2 tif 118,769
3 doc 116,335
4 jpg 67,975
5 pdf 31,198
6 [no extension] 27,170
7 wpd 16,126
8 gif 12,627
9 vfo 10,061
10 htm 9,980

A file’s format does not necessarily provide an accurate understanding of its content. A full 1.5 million of the over 1.7 million files with a .txt extension are from the Joseph I. Lieberman Papers and consist of exported email files from the Senate Constituent Services System (CSS). The Lieberman papers are on deposit and have not yet been processed. If and when we process the collection, reconstructing the CSS email relationships will be a monumental task. Emails disguised as text documents in file extension reports clearly illustrates a core finding for this project and born-digital preservation generally: just because a file appears to be in a ubiquitous, well-documented format, it does not necessarily follow that processing or providing access will be easy or straightforward.

Art Buchwald Papers

Even with file format identification tools that look at the file signature, the findings can be surprising and contradictory. For example, the Manuscript Division holds the Art Buchwald Papers, humorist and political commentator. The collection’s born-digital content includes speeches and writings spanning the years 1987 to 1999 and files from Buchwald’s ibook laptop spanning the years 2002 to 2006. The top 10 file formats are as follows:

A donut shape chart showing the top 10 file format by percentage

Figure 2: Art Buchwald Papers: Top Ten File Formats by File Extension

After looking at the file extensions in the Buchwald collection, we can now cross-examine those results by using file format identification tools that use file signatures to identify file formats. We used Siegfried, a file format identification tool which recognizes file formats based on their file signature and Brunnhilde, a tool that runs Siegfried against a disk image, queries the PRONOM database[1], and creates a report.

The first surprise in the report was that there were more “Unknown” file formats after the report than there were before. How can that be? Well, it turns out some files with .doc extensions that I had categorized as Text files assuming they were Microsoft Word docs, were not. Either Siegfried could not match the file signature to any file format in the PRONOM database or some files had no file signature at all. (Ok, everybody sing! “Is that you, baby? Or just a brilliant disguise?”).

The second surprise came from contradictory identification of the files. Many files had either a .pwp extension or no file extension at all. An initial internet search for the .pwp file extension revealed more than four file formats associated with this extension including PhotoWorks Image File, Smith Corona Word Processor, and Professional WritePlus. The hex viewer in QuickView Plus (a file viewer) identified the files as Professional WritePlus. Professional WritePlus was a word processing software from Software Publishing Corporation, popular from the early 1980s to the mid-1990s.[2] The file signature for these Professional WritePlus files was 5B 76 75 65 62 5D. Using this signature, Siegfried searched PRONOM for a match and found that this file signature is associated with AMI Professional Document. The PRONOM entry for Professional WritePlus is an outline record that does not include a file signature. Another mystery is that AMI Professional Document files are associated with .SAM file extension, none of which appear in the Art Buchwald Papers.

File format identification tools will improve as the documentation of obsolete formats improve. One of the outcomes of the Staff Innovator file format analysis is a recommendation to develop a regular process to share file format identification information we encounter with PRONOM and Wikidata’s file format Wikiproject. Until then, we keep singing, ““So when you look at me / You better look hard and look twice / Is that me, baby / Or just a brilliant disguise?”

 

[1] https://en.wikipedia.org/wiki/Filename_extension accessed 09/22/2020.

[2]. A file signature is a unique string of numerical or textual values embedded into the file as metadata. This value is the same for all instances of a specific file type. So, for example, all RTF (Rich Text Format) files include the magic number string {\rtf at the same location in every file. This consistency allows for automation of file format identification.”

[3] UK National Archive’s PRONOM is “an online technical registry providing impartial and definitive information about file formats, software products and other technical components of.” http://www.nationalarchives.gov.uk/PRONOM/Default.aspx, accessed 09/24/2020.

[4] https://winworldpc.com/product/professional-write/plus-1x

 

Note: this post has been edited for clarity.

4 Comments

  1. NIELS E NIELSEN
    October 22, 2020 at 1:04 pm

    Have you analyzed the distribution over time? It would be interesting to focus on the file formats from ~1975-1990 when there were many different word and text processing applications and a rapidly-growing corpus of born-digital content. Also, how will you handle content that is born in the cloud (e.g., Google docs) and adopts a file extension disguise only when saved?

  2. David
    October 23, 2020 at 6:53 am

    Not sure I know what MSS is… Manuscript Services System?

    • Eileen Jakeway
      October 26, 2020 at 9:19 am

      David,

      Thanks for your comment! MSS stands for Manuscript Division. We updated the first sentence for clarity.

  3. Beverly Wright Coleman
    November 15, 2020 at 12:38 pm

    Reading this article was more fun than watching 10 good detective movies! I wanted to share a personal story I know you will appreciate.
    I made my living for many years beginning in the 1980s developing custom databases for a number of nonprofits. For example, an association of energy managers in California would hire me to keep track of its members and even publish an annual membership directory. My business partner wrote a database program for me, and I managed the data. When my contract ended with that client, the Executive Director wanted the data. Legally he was entitled to the data, but without the program written in Pascal and the interface designed for a Sage computer running a UCSD operating system, how could he USE the data?
    Fortunately, the state of computing had evolved to comma separated files and ASCII files that could be read over multiple platforms, but those formats were of little consolation to that Director.
    On a related note, I read just yesterday (November 14, 2020) that Adobe is still defending its .PDF file format, though many young programmers at the company want to drop efforts to keep it alive. I noticed that .PDF turns up in your file extensions in great enough quantity that seeing that program relegated to the trash heaps of history would be a loss.
    Thank you for all you do!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.