The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.
An Office of Strategic Initiatives team began compiling a set of digital format descriptions in 2004, yielding the format sustainability website, described in my Signal blog posts for December 19 and December 20, 2011. The team includes Jimi Jones and me from the Library of Congress and, during the last couple of years, three expert consultants. One of these is Caroline Arms, who helped establish the activity while still a full time Library employee and who continues to contribute today. The others are Nancy Hoebelheinrich (Knowledge Motifs LLC) and Natalie Munn (Content Innovations LLC), who developed our offering pertaining to geospatial formats.
Our descriptions now cover 270 formats and subformats, presented in heavily, um, formatted pages where sets of structured tables present different categories of descriptive information. For example, one table gives a summary description of the format and how it relates to other digital formats. Another focuses on factors that determine whether content in a format is likely to be sustainable in the long term: is it publicly documented, widely used, unencumbered with patents, and so on? Other tables assess a format’s support for quality (e.g., level of image clarity), functionality (e.g., support for multi-channel sound), and provide technical information about how a format may be recognized (e.g., “magic numbers”).
Our webpages serve readers who find relevant entries and read the text, a little like finding and reading an article in an encyclopedia. Like every other Web site, the human interface for readers employs HTML (HyperText Markup Language). Until 2008, we drafted our format sustainability pages directly in HTML, as we sorted through the intricacies of our data elements and structure. As an aside, since this blog is about formatting, I should split one hair. Our pages incorporate server-side includes (SSIs), pointers to subsidiary coding that fill in the header and footer parts of the Web page that repeat throughout the set. Strictly speaking, we create SHMTL files.
From the start, however, we saw advantages to authoring in XML (eXtensible Markup Language). XML markup can describe the different pieces of information using “tags” that convey the meaning of each chunk of text. Thus XML files can support a broader set of uses for the underlying data than HTML pages. Some of these uses will employ automated processes, treating our format descriptions as a database rather than as an encyclopedia. For example, an interested organization could take the XML versions of our documents and apply an XSLT that recognizes the tags that are meaningful to the organization and use them to extract selected segments. (XSLT stands for Extensible Stylesheet Language Transformation, a kind of script that reformats marked-up text or data.) A format registry like the important-and-emerging Unified Digital Format Registry could extract particular elements from our dataset to supplement their own format-specific data.
XML authoring will also make our own internal processes more efficient. If we “master” in XML, we too can apply an XSLT to our files. This means that when the Library’s Web design team updates the look or functionality of our online pages, we can easily batch-generate a fresh set of HTML files using a new XSLT script. The outcome would be that we would only need to manage and update the XML masters and draft any needed XSLTs. The HTML pages would take care of themselves.
Why am I telling you all of this? This blog post celebrates the completion of the initial XML conversion of all of our legacy HTML pages, concluding a part-time activity that began in 2008. At that time, and in follow-up refinements, we enlisted the expert consultant Ignacio Garcia del Campo (CACI, Inc.) to develop an XML schema, the term for a precise and constrained description (a little like a cookbook) that governs the creation of a particular class of XML documents. Garcia del Campo also developed a script to help us convert our legacy HTML documents to XML and a data entry system that we could use to create new descriptions or to edit our legacy documents.
Alas, the conversion could only be partly automated. The conversion script was dandy but we editors had to manually correct coding errors resulting from inconsistencies in our original HTML. We also needed to adjust for changes in the facts about a given format, particularly with respect to the level of adoption of a format, how much it is used. And of course we found that many of the useful resources we had linked to on the Web had changed their URLs or disappeared.
How can interested parties get a copy of the XML versions of the descriptions? Soon, we plan to provide a webpage that provides direct downloading of this content. In the meantime, however, we have a bit more editing to do and we also need to finalize our management of the versioning of the XML schema. We will name our current “working” schema version 1.0 (and so label each description that conforms to it), since we foresee the need to change certain details in a year or two: version 1.1. For the moment, people who wish to obtain copies of the “working” XML resource should contact Jimi or me directly, or send a comment to the format sustainability website.
Would you consider sharing the XML via a collaborative platform like GitHub? That would make it easy for people to extend the entries and for you to fold those changes back if you want them.
Thank you for the comment. Our use of the format data at the Library is best supported by the “human-readable” HTML-based online presentation that we will maintain. The idea for the XML versions of the texts is to provide a way to hand off the information for others to use in other applications. Meanwhile, regarding the compilation of the information in the first place, we do seek (and have received) inputs from readers at our comments page (http://www.digitalpreservation.gov/formats/contact_format.shtml) and then we correct errors or add new information. This is a side project for all concerned and for the moment we have no resources to explore different technology.