Fun with File Formats

Today’s guest post is from Kate Murray, Marcus Nappier, and Liz Holdzkom of the Digital Collections Management & Services Division at the Library of Congress.

Are you a file format fan? If you’re curious how to pronounce the still image format HEIF (spoiler alert: it rhymes with “beef”) or the difference between PDF/A-3 and PDF/A-4, the Library of Congress’s Sustainability of Digital Formats (a.k.a., Formats) is the place for you. To help you satisfy your need for in-depth technical, and perhaps more than a bit nerdy, knowledge about all things digital file formats, we’ve decided to start a regular series about what we’re up to. Welcome to Issue Number 1 of Fun with File Formats!

The Formats site is one of the premiere resources in the world for in-depth information about digital file formats. It covers over 525 formats, encodings and wrappers in a variety of content categories – sound recordings, still images, datasets, textual materials, moving images, website and web archives, geospatial, music composition, 3D and architectural drawings and many more (see below).

Figure 1. The defined content categories for the Sustainability of Digital Formats.

One of the hallmarks of the site are the well-known and highly adopted seven Sustainability Factors which evaluate the likely feasibility and cost of preserving the information content in the face of future change within the technological environment in which users and archiving institutions operate (see figure 2 below). These factors are significant in the strategy that is adopted as the basis for future preservation actions: migration to new formats, emulation of current software on future computers, or a hybrid approach. Disclosure, Adoption, Transparency, Self-documentation, External dependencies, Impact of patents, Technical protection mechanisms – all of them have a role to play in how a format might perform over the long haul.

Figure 2. The seven sustainability factors.

The first format descriptions were drafted as static HTML files in 2003, with updates and additions continuing in the years that followed. The production process began to move into an XML mode in late 2007. By 2012 all of the existing descriptions had been converted to XML and new descriptions were being created in XML. Today, all of our format descriptions are available in HTML and XML, with the option to access individual XML files or download a frequently updated zip file of the entire set in XML.

Each format description document (known as an “fdd”) is assigned a six digit number starting with 000001 and a prefix of “fdd” (i.e., fdd000001 for WAVE Audio File Format). New fdds are assigned the next available number in a sequential list. On a practical level, we refer to the fdds by the last three digits of their identifier, for example fdd001 for WAVE.

As described in a previous blog post, authoring an fdd is no easy feat and it’s a substantial investment of time, energy and effort. So far in 2021, we’ve researched and published nine new fdds:

But how do we decide what file formats to research for an fdd? It’s a bit of a complicated decision but the gist is that we focus on formats that are of interest to The Library of Congress because we have them in our collections or are getting them in or sometimes we see a newish format on the horizon and want to get ahead of the game. We also prioritize formats listed in the Recommended Formats Statement (RFS) because if we have a preferred or acceptable format, we want to be informed about it.

One of our greatest challenges is keeping our fdds current so we also have a tiered prioritization protocol to decide for which fdds get reviewed and updated each year based on user stats, last update date, RFS status and other project information. We have extensive references in each fdd, many of them from the Web. In the past, we updated broken and outdated URLs yearly but given our scope and scale, this is no longer feasible. Now we update as needed and make good use of the Internet Archive WayBack Machine for legacy or potentially unstable URLs.

We’re always curious about our users and we have A LOT of them (see figure 3 below). The Formats site averages about 35,000 unique visitors each month and they come from all over the world including the United States, United Kingdom, Australia, New Zealand, India, Greenland, French Polynesia, Lesotho, Japan, Panama, Bolivia and many more. In fact, we have users on every continent except Antarctica (so if you know a file format fan in Antarctica, ask them to give us a visit to complete our global map!)

Figure 3. Geographic distribution of users for 2021.

It’s fun to see which fdd had the most visitors each month. For example in October 2021, DWG (AutoCAD Drawing) Format Family (fdd445) took the top spot followed closely by Wavefront Material Template Library (MTL) File Format (fdd508). This tells me that 3D and design formats are on the community’s mind. But in July 2021, Encapsulated PostScript (EPS) File Format, Version 3.x (fdd246) was the number one fdd followed by Microsoft Office Excel 97-2003 Binary File Format (.xls, BIFF8) (fdd510) so we had some more textual and spreadsheet folks in the mix during the mid-summer months. And there are ebbs and flows across the year overall with January and June being our slower months and September and October typically being the busier ones.

Figure 4. Number of unique visitors to the Sustainability of Digital Formats from January – October 2021.

A relatively new feature as of a few years ago is mapping our format descriptions to other file format data sources including PRONOM, hosted by The National Archives UK, and Wikidata. Just as we have our fdd numbering protocol for our unique identification numbers, PRONOM uses PUIDs (PRONOM Unique Identifiers), such as fmt/6, while Wikidata uses a unique title ID number with a “Q” prefix, such as Q217570.

When there are equivalent matches in these resources, we make a note of it in our fdds so format researchers can have access to multiple sources of data. For example, let’s look at WebM (fdd518), a non-proprietary, royalty free open source format developed and maintained by Google optimized for web-based media content. It has exact matches in both PRONOM (fmt/573) and Wikidata (Q309440). Easy peasy. Each of the three resources describe this format at the same level of granularity and specificity so we can confidently say that fdd 573 = fmt/573 = Q309440.

This isn’t always the case however. Sometimes, the match isn’t exact. Take for instance TIFF 6 (fdd022). Wikidata (Q27231633) has an exact match for this but PRONOM (fmt/353) doesn’t distinguish between the different versions of TIFF like our fdds do. This would not be an exact match and this is important to document because there are important differences between all the versions of TIFF which a file format researcher or digital preservation practitioner would want to understand.

And in some cases, there’s no match at all. This can just be a case of PRONOM or Wikidata not making an entry for that format yet, as with DivX Video Codec (fdd069), but also, there simply may not be a match at the same level of information hierarchy.

We have a class of fdds we call “family fdds” where we describe the common characteristics of a related group of formats, such as PDF (Portable Document Format) Family (fdd030) but go into more detail in separate fdds focused on subtypes or versions, such as PDF 2.0, ISO 32000-2 (2017, 2020) (fdd474). There’s no PRONOM match for fdd030 because that resource doesn’t describe formats at the aggregate “family” level but there is one (fmt/1129) for the subtype PDF 2.0.

There’s another type of fdds nicknamed “combo packs” which describe a specific encoding in a wrapper or container such as QuickTime File Format with V210 Video Encoding (fdd368). There simply aren’t equivalent entries in PRONOM or Wikidata for this type of fdd.

We’re still playing catch up trying to map our older fdds to these other resources – there’s actually quite a bit of intellectual reasoning that goes in to deciding what’s a match and what’s not and why. We look at file type signature information including magic numbers, file extensions, IANA media types, standards and documentation, versions and subtypes and more. Fdds from recent years have a declaration of matching status (including when there is not match) and we’ll review these as we can. Thanks in part to the Python code scripting skills of our 2021 Junior Fellows, we’ve also released a new spreadsheet of fdd-PUID-QID mappings to help those playing Format Match Game at home. We’re still tweaking it as we go but we’ll update it monthly to reflect any new or revised matches. We’ve also included the scripts themselves so users can run them at any time to get the latest updates.

One more new outcome from our work is that, for the first time, we submitted our research information on HyperCard Stack files (fdd537) to PRONOM in order for them to create a new PUID for this format. We were thrilled to see fmt/1490 in the 27th October 2021 PRONOM new release notes! A match made in file format heaven! (Actually, it was a match made via the PRONOM “add a new entry” submission process but the sentiment still holds.)

Figure 5. Mappings for HyperCard Stack fdd to new PRONOM PUID.

As you can perhaps tell, we format folk are all about the details. Whether it’s the various profiles for MPEG-4 Video Encodings (fdd080) or how to distinguish between versions of MBOX (fdd383) for email, we do the (exhaustive) research and break it down for you in all its nerdy glory. Issue Number 2 of Fun with File Formats will come in mid-2022 and we’ll catch you up on all our progress to date. Until then, 66 69 6c 65 20 66 6f 72 6d 61 74 73 20 66 6f 72 65 76 65 72 (hexadecimal for “file formats forever”)!

Annotation as Aesthetic: A Closing Interview with Innovator in Residence Courtney McClellan

2021 Innovator in Residence Courtney McClellan created Speculative Annotation, an experimental browser-based application that encourages students and teachers to have conversations with historic Library of Congress items through annotation and mark-making. McClellan is a research-based artist who lives in Atlanta, Georgia. With a subject focus on speech and civic engagement, McClellan works in a range […]

It’s a bird, it’s a plane, it’s a…derivative dataset!

This post describes a collaboration between LC Labs member Eileen J. Manchester and Peter DeCraene, the Albert Einstein Distinguished Educator Fellow to answer the question: “what would it mean to treat a dataset as a primary source?”

Developing a New Digital Collections Strategy at the Nation’s Library

Today’s guest post is from Joe Puccio, Collection Development Officer at the Library of Congress. Tremendous progress has been made by the Library of Congress in acquiring born-digital content as part of a coordinated strategy presented in its 2017 Digital Collecting Plan and previously reported in the Signal. With that plan now in its fifth […]

An Archivist’s Perspective on Legacy Files

In this post, 2020 Staff Innovator Chad Conrady discusses his area of expertise, emulation, which imitates older operating systems in order to open outdated or legacy files that are no longer operable with contemporary operating systems or software.


Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archivist in the Manuscript Division at the Library of Congress. In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.