Today’s guest post is from Kate Murray, Marcus Nappier, and Liz Holdzkom of the Digital Collections Management & Services Division at the Library of Congress.
Calling all file formats nerds…and nerds who aren’t file formats nerds…yet! The wait for more file formats news is over. We have some cool updates to share about new additions to the Sustainability of Digital Formats site and our takeover of Stacks! Welcome to Issue Number 2 of Fun with File Formats.
New Content Categories
We are extremely excited to announce new additions to the Sustainability of Digital Formats site, one of the premier resources in the world for in-depth technical information about digital file formats. Since our last update, we have added three new content categories to reflect expanding community needs at the Library of Congress and beyond.
The new content categories are Email and Personal Information Manager; Design and 3D; and Aggregate. The Email and Personal Information Manager (PIM) category, which also includes calendaring and instant messaging formats, has been steadily gaining traction within the Library over the past few years. These Email and PIM formats include functionality for instant messaging, contacts, appointments, and other personal data that are usually bundled together in software packages. The Library is exploring its workflows for processing and archiving its growing number of email collections. Some of these Email and PIM formats include:
- Electronic Mail Format (EML)
- iCalendar Electronic Calendar and Scheduling Format (iCal)
- Internet Message Format (IMF)
- Microsoft Outlook Item (MSG)
- And many others!
The Design and 3D content category includes 2D and 3D computer-aided design (CAD) and computer-aided manufacturing (CAM), built environment, schematics, architectural drawings, photogrammetry scanning, point cloud data and much more! With a new content category comes new Quality and Sustainability factors. See our previous blog post that briefly talks about how quality and functionality factors serve as one of the pillars of our formats work. The quality and functionality factors for Design and 3D formats are based on a 2008 analysis from the National Center for Supercomputing Applications (NCSA). There are four key aspects of a 3D model including geometry, appearance, scene, and animation, which serve as the basis of our quality and functionality factors. 3D Model geometry describes the model’s shape, based on point clouds, line sets, or meshes among other approaches. The appearance incorporates colors, textures, and material types. Model scene highlights light source positions, camera, and other objects relative to the 3D model, while animation defines how a 3D model moves! Popular Design and 3D formats include: STEP, Photoshop Files, STL (STereoLithography), X3D, and DXF.
The Aggregate content category consists of a subset of simple bundling formats that are used to collect multiple data files together into a single file for easier portability and storage, with the option for data compression to save storage space in addition to other features. Aggregate formats, such as ZIP, RAR, tar and the new format description for 7z, are classified as archive files in computing and in many standards specifications. The Formats site, in alignment with other efforts in the digital preservation community, is using the term aggregate instead of archive because the latter term has broader community use beyond the definitions of these formats.
There are three quality and functionality factors for aggregate file formats: compression, support for error detection, and functionality beyond normal rendering. One of the main features of aggregate file formats is compression, which allows for the collection of multiple data files to be packaged together. Aggregate files support a variety of compression algorithms, ratios, and methods. Error detection support references the ability for aggregate files to include checksums, hash values, or any other fixity tool to minimize data loss.
FDD Review and Updates
In our previous Fun with File Formats blog post, we mentioned that we prioritize new format description documents (or fdds) if the format is listed as preferred or acceptable in the Recommended Formats Statement. Priorities for updating fdds are no different. In preparation for the release of the updated RFS later this summer, we’ve been hard at work updating fdds for preferred and acceptable formats (55 so far!).
With some help from a teammate, we created a script that parses the XML for all of our fdds to pull out the “LC Preference” field and the date of the last update. With that data, we used some Excel magic (and human eyes) to identify the fdds that are listed in the RFS. Through this work, we’ve found that we haven’t always been consistent in our language in the “LC Preference” field. This posed problems when we try to use Excel to identify the “acceptable” and “preferred” formats, not those formats that are “not preferred.” We’re now working on establishing consistency in that field to save ourselves from future frustrations.
Armed with a list of formats featured in the RFS, we prioritized our updates based on the “Last significant FDD update” date. We’re embarrassed to say that some of our fdds had not been updated for, well, a long while. So the highest priority updates were those fdds that were listed as a preferred or acceptable format but had not received a significant update in over 10 years. With over 500 fdds and counting these delays are bound to happen, which makes this review all the more important.
As of today, we’ve reviewed and updated all fdds previously updated 5+ years ago, for all preferred and acceptable formats. We also now have two new Junior Fellows, Mari Allison and Dan Hockstein, diligently identifying dead links for the other fdds. You’ll hear more about the work of our Junior Fellows in a later blog post.
As we reviewed these documents, we looked for the usual broken links (as mentioned) and typos, but also connections to newer versions or subtypes and changes in adoption and use. We also added PRONOM unique identifiers (PUID) and Wikidata ID (QID) links when matches were available but not previously documented and, with some of these older fdds, we added links to specifications (in addition to updated specifications) that were missing. This comprehensive review encouraged us to standardize and document our best practices for fdds, both updated and new, to ensure more consistency overall. This standardization will help us as we pull data on our fdds for other projects.
You might remember from our last blog post, we have been running a Python script monthly to track our mappings to PUIDs and QIDs for our fdds. Using this data, we’ve found that this round of updates for the upcoming release of the RFS has resulted in the addition of 12 QIDs and 4 PUIDs for fdds that previously did not have one listed. This number will surely climb as we progress with this project.
Building on some of that recent success with mapping to PUIDs, the Formats team here at LC has also been developing a workflow to map our fdds and related formats data (including documented extensions and file format names) to PRONOM’s DROID signature files. These signature files are generated by PRONOM and used by DROID for file format identification and analysis. Because these signature files are presented in XML form, we felt that there was an opportunity to leverage scripting skills across DCM to parse data from these signature files and match them to our own fdds. Now with this script, we’ll be able to identify missing extensions and PUIDs that can be added to make our fdds more robust.
Stacks Format Mapping
All of our work on our new content categories and fdds wasn’t enough, so we undertook a project to review the listed file formats in the Library’s Stacks platform. Our onsite users might be familiar with to rights-restricted content from the Library of Congress’ reading rooms. Stacks allows faceting by media type, but the format names that displayed weren’t always the most helpful to our users—unless you like to search by the “application/vnd.openxmlformats-officedocument.wordprocessingml.document” or similar file type. The Formats team knew we could improve on this.
Starting from a list of file type labels in Stacks, we used resources like IANA (the Internet Assigned Numbers Authority) which maintains a registry of media types, and MIME type lists from Mozilla to identify these mysterious (and sometimes not so mysterious) labels. As we worked through the list and crafted new labels, we kept the users in mind by using a label that is likely to be easily understood and known. For those labels that are often identified by an acronym—everything from PDF to ELF—we opted for using the long-form name only for those uncommon formats. So, in this case, the well-known CSV format will remain CSV.
In the end, we updated labels for 62 media types in Stacks and brought users a much more intuitive search experience. Throughout this project, we also developed rules for formatting labels and wrote documentation on pulling format labeling data, researching appropriate labels, and submitting these changes to the Stacks team. When we approach this project again in the future, we’ll be prepared with the standards and practices that we learned along the way.
We’ve certainly been busy over here in the formats group, so we’re excited to share these updates with our file format fanbase. Comments and questions are always welcome at [email protected].