Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archives Specialist in the Manuscript Division at the Library of Congress. She’s shared about Born Digital Access Now!, her Staff Innovator project, in previous posts.

In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.

As a 2020 Staff Innovator working on the Born Digital Access Now! project, I conducted an analysis of the file formats contained in the Manuscript Division holdings. Analyzing and documenting file formats is a necessary first step to mapping the 85 processed collections containing born-digital material to the most suitable access pathway. Additionally, this analysis will inform the development of a pilot digital access workstation with the appropriate specifications and tools.

Some of the questions I sought to answer as part of this analysis included: What file formats are in the Manuscript Division’s collections? How many? What do the file formats tell us about the content in our born-digital collections?

These questions may seem straightforward but answers sometimes proved to be illusory or even contradictory. In the song, Brilliant Disguise, Bruce Springsteen warns “So when you look at me / You better look hard and look twice / Is that me, baby / Or just a brilliant disguise?”

As you’ll see in this post, assumptions about file formats can create “brilliant disguises” that illuminate certain aspects of a collection while obscuring others.

File formats and their “Brilliant Disguise(s)”

The file format analysis focused on the processed born-digital collections inventoried by the Content Transfer Services (CTS), a homegrown system built to transfer files across LC servers. CTS offers file format analysis based on file extensions. File format identification by file extension has limitations that will be discussed later, but the primary advantage to using this report is that it allowed me to create a comprehensive list of all the file extensions at both the collection and MSS born-digital total holdings (“MSS total holdings”) level.

I’m excited to share with you one of the major findings of the report. Astonishingly, we discovered that despite the small file and byte count of MSS total holdings relative to LC CTS total holdings, MSS collections contain more than 4,800 unique file extensions (18.5% of all distinct file extensions in CTS). This figure dramatically illustrates that the Manuscript Division faces significant processing, preservation, and access challenges related to the range of file formats in our collections.

A black, white, grey, and orange word cloud in the shape of a 3.5” floppy disk.

Figure 1: Manuscript Division Unique File Extensions. For legibility, dates and numbers were removed from the unique file extension list.

Does this mean there are more than 4,800 file formats in MSS total holdings? No. It’s more likely that MSS holds only several hundred distinct file formats. A file extension is not a reliable indicator of file format. In modern file systems, file format extension indicates the way data in a file is structured. Depending upon a file’s age, the file system in which the file was created, and the software that created the file, a file extension also might imply how data in a file is stored (e.g. .doc suggests a Microsoft Word Document) or what software created it (e.g., MyEssay.wpd suggests Word Perfect)[1]. Further, early file systems and operating systems allowed for users to create their own file extensions (e.g., MyEssay.finaldraft) or have no file extensions at all. So MyEssay.finaldraft may have been created in Microsoft Word, or even WordStar, we have no way to confirm a file format without checking for a file signature [2].

The chart below displays the top 10 file extensions by file count for MSS total holdings. They are, for the most part, file extensions indicating standard, ubiquitous formats. Given that the top five MSS file extensions (.txt, .tif, .doc, .jpg, and .pdf) account for 98.5% of total MSS holdings, it would appear that these files should be accessible and renderable using currently available software and tools. However, the devil is in the details.

Figure 2: Manuscript Division Top 10 File Extensions by File Count

Rank Extensions File Count
1 txt 1,768,626
2 tif 118,769
3 doc 116,335
4 jpg 67,975
5 pdf 31,198
6 [no extension] 27,170
7 wpd 16,126
8 gif 12,627
9 vfo 10,061
10 htm 9,980

A file’s format does not necessarily provide an accurate understanding of its content. A full 1.5 million of the over 1.7 million files with a .txt extension are from the Joseph I. Lieberman Papers and consist of exported email files from the Senate Constituent Services System (CSS). The Lieberman papers are on deposit and have not yet been processed. If and when we process the collection, reconstructing the CSS email relationships will be a monumental task. Emails disguised as text documents in file extension reports clearly illustrates a core finding for this project and born-digital preservation generally: just because a file appears to be in a ubiquitous, well-documented format, it does not necessarily follow that processing or providing access will be easy or straightforward.

Art Buchwald Papers

Even with file format identification tools that look at the file signature, the findings can be surprising and contradictory. For example, the Manuscript Division holds the Art Buchwald Papers, humorist and political commentator. The collection’s born-digital content includes speeches and writings spanning the years 1987 to 1999 and files from Buchwald’s ibook laptop spanning the years 2002 to 2006. The top 10 file formats are as follows:

A donut shape chart showing the top 10 file format by percentage

Figure 2: Art Buchwald Papers: Top Ten File Formats by File Extension

After looking at the file extensions in the Buchwald collection, we can now cross-examine those results by using file format identification tools that use file signatures to identify file formats. We used Siegfried, a file format identification tool which recognizes file formats based on their file signature and Brunnhilde, a tool that runs Siegfried against a disk image, queries the PRONOM database[1], and creates a report.

The first surprise in the report was that there were more “Unknown” file formats after the report than there were before. How can that be? Well, it turns out some files with .doc extensions that I had categorized as Text files assuming they were Microsoft Word docs, were not. Either Siegfried could not match the file signature to any file format in the PRONOM database or some files had no file signature at all. (Ok, everybody sing! “Is that you, baby? Or just a brilliant disguise?”).

The second surprise came from contradictory identification of the files. Many files had either a .pwp extension or no file extension at all. An initial internet search for the .pwp file extension revealed more than four file formats associated with this extension including PhotoWorks Image File, Smith Corona Word Processor, and Professional WritePlus. The hex viewer in QuickView Plus (a file viewer) identified the files as Professional WritePlus. Professional WritePlus was a word processing software from Software Publishing Corporation, popular from the early 1980s to the mid-1990s.[2] The file signature for these Professional WritePlus files was 5B 76 75 65 62 5D. Using this signature, Siegfried searched PRONOM for a match and found that this file signature is associated with AMI Professional Document. The PRONOM entry for Professional WritePlus is an outline record that does not include a file signature. Another mystery is that AMI Professional Document files are associated with .SAM file extension, none of which appear in the Art Buchwald Papers.

File format identification tools will improve as the documentation of obsolete formats improve. One of the outcomes of the Staff Innovator file format analysis is a recommendation to develop a regular process to share file format identification information we encounter with PRONOM and Wikidata’s file format Wikiproject. Until then, we keep singing, ““So when you look at me / You better look hard and look twice / Is that me, baby / Or just a brilliant disguise?”

 

[1] https://en.wikipedia.org/wiki/Filename_extension accessed 09/22/2020.

[2]. A file signature is a unique string of numerical or textual values embedded into the file as metadata. This value is the same for all instances of a specific file type. So, for example, all RTF (Rich Text Format) files include the magic number string {\rtf at the same location in every file. This consistency allows for automation of file format identification.”

[3] UK National Archive’s PRONOM is “an online technical registry providing impartial and definitive information about file formats, software products and other technical components of.” http://www.nationalarchives.gov.uk/PRONOM/Default.aspx, accessed 09/24/2020.

[4] https://winworldpc.com/product/professional-write/plus-1x

 

Newspaper Navigator Search Application Now Live!

On September 15, 2020, the Library of Congress announced the release of Newspaper Navigator, an experimental web application which makes 1.5 million photographs from the dataset from Chronicling America available to the public to explore for the first time. Read more about the design and features of the project below or jump straight to the newly launched application at //news-navigator.labs.loc.gov/search !

Computing Cultural Heritage in the Cloud Quarterly Update

This is a guest post from LC Labs Senior Innovation Specialist Laurie Allen. This is the second post in a series where we are sharing experiences from the Andrew W. Mellon-funded Computing Cultural Heritage in the Cloud. The series began with an introductory post.  Learn about the grant on the experiments page, and see the […]

Connections in Sound and at the Library of Congress: Reaching out to experts to connect Irish traditional music through Linked Data

Patrick Egan is a scholar and musician from Ireland, who served as a Kluge Fellow in Digital Studies at the Kluge Center. He has recently earned his PhD in digital humanities with ethnomusicology in at University College Cork. Patrick’s interests over the past number of years have focused on ways to creatively use descriptive data in […]

Sprinting toward a Lab: defining, connecting and writing a book in five days

A lab is where experimental and research-focused tools, methods, and services are incubated. The starting premise for a lab is often wanting to spur change and make space for new practice and new people. Yet calling something a lab can also signal separation between traditional services and new approaches. Labs, and innovation in general, can seem like a passing fad that promotes shallow thinking about the application of digital technologies. Considering the limited resources and lack of cutting-edge technologies available at most galleries, libraries, archives, and museums (GLAMs), should GLAMs consider opening labs? 

Open a GLAM Lab book cover.To begin to answer this question, the British Library Lab, which opened in 2013, held a meeting in September of 2018 called “Building Library Labs” to start a conversation among practitioners who were currently running a lab or thinking of opening one. There was a lively enough discussion to warrant another meeting in March of 2019 in Copenhagen. The buzz from these events created a community of “labbers” and the lab-interested that has grown to 250 participants from 20 countries. 

Even with this momentum and interest, participants identified a need to articulate lab values, share relevant experience and case studies, and suggest some best practices for those starting up cultural heritage innovation labs. One year after the first gathering in late September 2019, a group of 16 librarians, developers, archivists, curators and academics from around the world (including myself) landed in Doha, Qatar, to embark on a BookSprint, a collaborative and rapid publishing methodology to write a book in five days

At the end of that five days, the authors strongly argue, yes, “Open a GLAM Lab”.

Labs can be directly tied to achieving the missions of GLAMs and they can be inclusive change-makers. GLAM Labs build on the work of their institutions to create, preserve and provide access to collections. They can work with and share new technologies and methods for creating and disseminating expertise embedded in and adjacent to cultural and memory organizations. By explicitly inviting new users into GLAMs and designing new partnerships with communities, labs can address contemporary challenges around reaching new audiences, collaborating with communities, and sharing the value of collections broadly.

Tools and services created in a GLAM Lab are not devised as permanent. Therefore, space emerges where researchers, artists, educators and the interested public can collaborate with a group of partners with the time and remit to explore questions that help create new collections, tools, and services. These outcomes help transform the future ways in which knowledge and culture are disseminated. The exchange, experimentation and data created in a Lab are open, iterative and shared widely, which can feel risky to authoritative organizations. But GLAMs are full of people who are passionate about connecting collections to communities; Labs provide opportunities to combine new ideas with the deep expertise of existing staff and a mechanism to imagine and test future possibilities.

All this positive thinking about the future of digital transformation in GLAMS can be contagious, and we (the authors) hope that it is. But, there is very hard work involved and a resilient mindset is required. Bureaucracy-hacking, risk-taking and reacting to criticism are all everyday activities in a Lab.  It is challenging to  navigate ingrained processes, anxieties, user expectations, and technical limitations while generating momentum toward future progress. Not all experiments or partnerships end with the hoped for results. Labs can help articulate criteria and provide evidence for hard choices that GLAMs make everyday.

As exciting as the new methods and technical possibilities are, people are the center of a GLAM Lab. Only through engaging with people can you change the culture and direction of an organization. A GLAM Lab helps to translate expertise and generosity from across the organization to make collections and technologies approachable and usable. Establishing values for a GLAM Lab provides guiding principles for how to engage with partners and communities. Nurturing staff and taking an inclusive and transparent approach to engaging with collaborators and user communities help to ensure all groups feel welcome and supported in lab environments.

GLAM Lab Booksprint in Action

GLAM Lab Booksprint in Action

People were also at the center of the experience of writing the book. Capturing the collective experience and perspective from 16 people was a unique experience. As we reflected in the Introduction: Making this book was hard but is was also very special. The themes you see reflected in this book: being open to experimentation, risk-taking, iteration, and transformation also capture the methodology of the BookSprint. The process of extracting ideas from sixteen heads and making a coherent narrative under extremely tight deadlines sometimes got messy. There were highs and lows, moments of brilliance, feelings that we’d never finish, and very late nights. We had to push each other to keep going, be uncomfortable, debate, disagree, come to a decision, and move forward to finish. Sometimes we didn’t do this well, but we were always able to come together again. The five days of intense work resulted in a book, but it also resulted in a very bonded group that is galvanized to make positive change. The process allowed diverse experiences and perspectives to meld together into a unified book that we hope you find useful in answering questions about why time, space and resources for experimentation are important to create.

You can download the open access e-book from http://glamlabs.io and sign up for the GLAM Lab listserv for updates. The book is a collective product with contributions from Mahendra Mahey, Abigail Potter, Aisha Al-Abdulla, Armin Straube, Caleb Derven, Ditte Laursen, Gustavo Candela, Katrine Gasser, Kristy Kokegei, Lotte Wilms, Milena Dobreva-McPherson, Paula Bray, Sally Chambers, Sarah Ames, Sophie-Carolin Wagner and Stefan Karner who are from the following institutions. 

  • Austrian National Library, Austria
  • The British Library, UK
  • Fundación Biblioteca Virtual Miguel de Cervantes, Spain
  • Ghent Centre for Digital Humanities, Ghent University, Belgium
  • History Trust of South Australia, Australia
  • Library of Congress, USA
  • KB National Library of the Netherlands, The Netherlands
  • National Library of Scotland, UK
  • Qatar University Library, Qatar
  • The Royal Danish Library, Denmark
  • State Library of New South Wales, Australia
  • UCL Qatar, Qatar
  • University of Alicante, Spain
  • University of Limerick, Ireland

The University College London, Qatar, Qatar University, the British Library and the Library of Congress provided funding to hold the BookSprint.  Read more »

Recommendations for Enabling Digital Scholarship

Mass digitization — coupled with new media, technology and distribution networks — has transformed what’s possible for libraries and their users. The Library of Congress makes millions of items freely available on loc.gov and other public sites like HathiTrust and DPLA. Incredible resources — like digitized historic newspapers from across the United States, the personal papers […]

Open Science Framework: Meeting Researchers Where They Are

This is a guest post by Megan Potterbusch, National Digital Stewardship resident at the Association of Research Libraries. Openly sharing research data, code and methodology are integral parts of open science. Whether due to disciplinary culture shifts or funder and publisher mandates, the general trend towards open science has been increasing in many research fields. […]