Top of page

Filling in the File Format Gaps

Share this post:

Today’s guest post is from Kate MurrayMarcus Nappier, and Liz Holdzkom of the Digital Collections Management & Services Division at the Library of Congress.


This is the fourth installment of our semi-annual blog series about file format research for the Sustainability of Digital Formats: Planning for Library of Congress Collections at the Library of Congress. If you’re a file format fan, take a look at the other entries Fun with File Formats, Return to the Fascinating World of File Formats!, and Even More Fun with File Formats!. We may not have the most creative blog post titles but we know our way around a specification and how to find a magic number.

This has been a busy few months for your favorite file format folks! Let’s catch you up on all the goings on.

New and updated file format descriptions (and LOTS of them)

Thanks in part to a contract with NVision Solutions, we have published 30 new file format descriptions (known as FDDs) to our site this calendar year. A full list of the new entries is available on our 2022-2023 workplan and we’re also keeping our publication log up-to-date so you can follow along at home when we publish a new one.

These new FDDs fall into several content categories:

Screenshot of spreadsheet showing newest Format Description Documents (order from newest to oldest), with FDD numbers, names, URLs, and publication dates.
Formats publication log for new additions from January – June 2023. For the live version, see www.loc.gov/preservation/digital/formats/fdd/fdd_workplan.shtml.

RFS FDD prioritization

Let’s keep the FDD update train rolling! In preparation of the release of the 2023 Recommended Formats Statement (RFS), we’ve also been updating the FDDs called out in the RFS’s various content categories. You may remember in our Return to the Fascinating World of File Formats! blog post from last June, we developed a new process to pull the date of last update from our FDD xml to target those RFS FDDs. We’ve continued to build on this work and standardize the process to update these FDDs. “What information are we updating?” is probably a question you’re asking right now. We’re sure by now you’ve checked out an FDD or two and noticed LOTS of links to external resources. That’s where we start with our updates to ensure that links are still active and resolve to the correct source. We’ve now also developed template language for the “LC Experience” and “LC Preference” sections in our FDDs to better clarify the Library’s holdings of a particular format or whether that format is listed in the RFS. The clarity in the “LC Preference” field is important because we haven’t always been consistent in the past and it’s caused a few (or many) headaches when running our XML parsing script. We’re continuing to work on establishing consistency in that field to save ourselves from future headaches.

Unlike last year, we actually had a priority one FDD from our prioritization list! WACZ (Web Archive Collection Zipped) as mentioned above is a brand new FDD in the RFS. We still prioritized FDDs that were listed as a preferred or acceptable format without a significant update for 5-10 years but also reviewed newer FDDs as well. With over 50 completed FDD updates, we continue to see the high value of this work and it will remain a critical part of our yearly review.

The 2023-2024 version of the RFS will be published in the coming weeks so stay tuned for a follow up blog post highlighting all the changes.

Upcoming work

We are excited to begin a new contract this month with Ashley Blewer, Abi Simkovic and Frances Harrell through Myriad Consulting. Over the next 12 months, this team will research and write close to 40 new FDDs. The 2023-2024 work plan is available and includes a few new areas of interest such as mobile device support, packaging, software and installation support, forensics and disc imaging as well as filling in gaps for existing content categories Email and Personal Information Manager (PIM) Formats, Design and 3D, Datasets and Databases, Still Images and Text. We’re personally looking forward to the research work on Audio Definition Model (ADM), gzip, bzip, and Apple ProRaw just to name a few.

We’ve discussed how we prioritize which formats to work on in a previous blog post. More specifically for this upcoming group of FDDs, priority formats were identified via the Library’s Music and Manuscript divisions’ research efforts and holdings, inclusion in projects such as BitCurator (the Library of Congress is a member of the BitCurator Consortium) and wider community discussion.

Fan favorite formats

But it’s not just all about the new FDDs, so let’s look at the old favorites. We looked at the analytics from the last 12 months, and found that CSV is our most popular FDD, followed closely by Wavefront Material Template Library (MTL). We love a good CSV so this makes sense.

Screenshot of CSV Format Description Document
A snippet of everyone’s favorite FDD, CSV Comma Separated Values (RFC 4180)! See www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml for the full version.

Then DWG (AutoCAD Drawing) Format Family and Email (Electronic Mail Format) come in third and fourth but with a lot less views than our top two (we’re talking thousands).

And more stats we can love: Thousands of visitors came to our site over the past year from The Signal blog and blog posts just like this! And Wikipedia is also a major referring site, which means Wikipedians are using our FDDs for source material. No matter where you are coming from, whether you are linking from a different site or coming to us directly, we love our visitors just the same.

As always, comments and feedback is very welcome! Leave a comment here or send us a note at [email protected].

Comments

  1. Amazing work, all! Looking forward to seeing these new FDDs : ~ )

Add a Comment

Your email address will not be published. Required fields are marked *