Top of page

Graphic of a word cloud that shows keywords for FDDs
Figure 1: Word cloud of FDD keywords

A Picture is Worth a Thousand Data Points: Visualizing File Format Data

Share this post:

Today’s guest post is from Ashley Blewer, independent archives technologist, and Kate Murray, Digital Projects Coordinator in Digital Collections Management and Services at the Library of Congress.

File format fans will be well aware of the Library’s Sustainability of Digital Formats (affectionately known as ‘Formats’), one of the world’s pre-eminent resources for in-depth and well-researched information about a wide variety of digital file formats. Take a look at our semiannual updates with the most recent one from December 2023, Growing the File Format Fam, followed by Filling in the File Format Gaps (June 2023), Even More Fun with File Formats! (December 2022), Return to the Fascinating World of File Formats! (June 2022) and the one that started it all, Fun with File Formats (December 2021).

Formats provide a wealth of data for over 525 formats, encodings and wrappers in a variety of content categories – Still Image, Sound, Textual, Moving Image, Web Archive, Datasets, Geospatial, Email and PIM, Design and 3D, Accessibility, Aggregate as well as Generic. The current work plan and publication log are updated frequently to increase transparency and accountability.

As described in one of our previous blog posts, the format description documents (or ‘FDDs’) are drafted in XML and converted to HTML for web display, with the option to access individual XML files (along with the XML schema) or download a frequently updated zip file of the entire set in XML. The XML versions of the FDDs provide our data in a reliable and structured form for reuse in other projects and systems.

New Data Visualization Tool: FDD Dashboard

Technologist Ashley Blewer is making great use of this XML data in a new data visualization project, FDD Dashboard. Ashley is more than familiar with the FDD data and structure because she is part of the team from Myriad Consulting who are working on researching and writing new FDDs for us. Their collaborative work-to-date is included on our work plan. Ashley undertook this visualization project independently on her own time because she was interested in understanding how the FDDs looked together as a dataset instead of as individual documents. Seeing the sets at more of a bird’s eye view helped her see certain patterns in writing, how the collection has grown over time, and what were the most commonly used references and recommended sources for more information about file formats.

Ashley started by downloading the XML as a group, packaged in a zip file. She then began mapping the data from their elements-based, nested XML structure to a flat database structure, thinking about which fields could be combined, which fields needed to be split apart, which could be mapped directly, and which fields or categories are best suited as their own database table.

Format description documents contain many Quality and Functionality factors, which vary in usage, based on the content categories. Rather than making dozens of fields that would mostly be blank for each row, Ashley chose to condense all of the Quality and Functionality factors into one field. This compromise reduces the granularity through which one can search and retrieve specific details such as “How many FDDs contain Quality and Functionality factors detailing sound fidelity?” This data is still available, but mixed in with other Quality factors.

She wrote a Python script to open each XML document and write the fields to a SQLite database. SQLite makes for a good experimental database because, as noted in the FDD for SQLite, it is a cross-platform database engine that is designed to be stored locally as a single file.

Storing the data together in a SQL format instead of individual XML files allowed the data to easily be integrated with the open source data exploration and publishing tool, Datasette. By using Datasette with a visualization plug-in, Ashley was able to quickly put together some charts and graphs based on the FDD set.

Looking Good, Doing Good

After taking a moment to admire how fantastic our data looks in chart form, the Formats team was quickly able to put these visualizations to work to improve our internal quality control.

For example, the blue wedge in Figure 2 indicates that there were eight FDDs without a “format category” assignment from our controlled vocabulary list of ‘file format,’ ‘encoding,’ ‘family,’ or ‘file group.’ All of our FDDs should have one or more format category assignments, so something went amiss somewhere along the way with these eight entries. Thanks to this chart, we were able to use the data to quickly identify this error and make the correction.

Pie chart showing the breakdown of format categories in FDDs, with a blue wedge indicating FDDs without a category assignment
Figure 2: FDD Category assignments as a pie chart
Screenshot of SQL query results showing the breakdown of format categories across FDDs, with a blank row for the eight FDDs that do not have a category assignment.
Figure 3: FDD Category assignments as a detailed list

Another chart (Figure 4) that grabbed our attention is the number of FDDs updated per year. While the yearly work plan mainly focuses on researching and writing new FDDs, we also strive to update essential FDDs, especially those referenced as preferred or acceptable in the Recommended Formats Statement. Starting in 2021, we developed a tiered prioritization protocol to identify which FDDs to review and update each year based on user stats, last significant update date, RFS status, links to references and citations and other project information. We go into detail about this review process in the 2022 blog post, Return to the Fascinating World of File Formats!.

Bar chart showing the number of FDDs updated by year (2003 to present), broken down by category assigned
Figure 4: FDDs updated per year, 2003 to present

Clearly, we have been BUSY and it’s paying off. Obviously, we had a drop in 2020 thanks to the pandemic but the focused attention has yielded real results. For the first time since we implemented this scheduled review in 2021, all the FDDs listed in the RFS have been reviewed within at least the last five years! This is a major accomplishment because the number of FDDs in the RFS are many and the Formats team is small (but mighty). We are well aware that there are still a number of FDDs that are not listed in the RFS and have not been reviewed in the last five years, but we are chipping away at the mountain. The struggle for maintenance culture is real, folks.

There are other fun data points as well. Like the FDD with the most citations (we see you VP9 (FDD 579) with 110!), most relationships to other FDDs (no surprise to see PDF (Portable Document Format) Family take this title with connections to 25 other FDDs) and even how often we reference the excellent file signature resource, GCK’S File Signatures Table, which is a trusted source in almost 10% of our FDDs.

When sharing the Dashboard with friends, colleagues, and file format fans, one of the first comments was “you need to include these on your annual report!”. And yes, we’ll do that. But the impact is so much greater than dressing up our reports.

First, we love to see the reuse of the data in creative ways. We provide the XML (and have since 2012) with the hope that people can use it for their own purposes. We didn’t see this use coming but we are 100% supportive!

Second, pictures can tell a story in an approachable way. Who doesn’t love a chart, especially one that tells a good story? Our FDDs are dense documents – on purpose honestly. We aim to provide robust, in-depth and correct information and we understand that FDDs aren’t light bedtime reading. Boiling some key points down to digestible visualizations is a great way to get a peek into and communicate scope and trends.

Finally, these visualizations have helped improve our internal processes which will in turn benefit our large community of users. We now use the scripts as part of our internal QC process.

We love to see the FDD data reuse in new and interesting ways and especially these visualizations (thank you Ashley!). Let us know how you use our Formats data in the comments or [email protected].

Add a Comment

Your email address will not be published. Required fields are marked *