Office Opens up with OOXML

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in the Office of Strategic Initiatives.

Office1

Before VisiCalc, Lotus 1-2-3, and Microsoft Excel, spreadsheets were manual although their compilers took advantage of adding machines. And there were contests, natch. This 1937 photograph from the Library’s Harris & Ewing collection portrays William A. Offutt of the Washington Loan and Trust Company. It was produced on the occasion of Offutt’s victory over 29 competitors in a speed and accuracy contest for adding machine operators sponsored by the Washington Chapter, American Institute of Banking.

We are pleased to announce the publication of nine new format descriptions on the Library’s Format Sustainability Web site. This is a closely related set, each of which pertains to a member of the Office Open XML (OOXML) family.

Readers should focus on the word Office, because these are the most recent expression of the formats associated with Microsoft’s family of “Office” desktop applications, including Word, PowerPoint and Excel. Formerly, these applications produced files in proprietary, binary formats that carried the filename extensions doc, ppt, and xls. The current versions employ an XML structure for the data and an x has been added to the extensions: docx, pptx, and xlsx.

In addition to giving the formats an XML expression, Microsoft also decided to move the formats out of proprietary status and into a standardized form (now focus on the word Open in the name.) Three international organizations cooperated to standardize OOXML. Ecma International, an international, membership-based organization, published first in 2006. At that time, Caroline Arms (co-compiler of the Library’s Format Sustainability Web site) served on the ECMA work group, which meant that she was ideally situated to draft these descriptions.

In 2008, a modified version was approved as a standard by two bodies who work together on information technology standards through a Joint Technical Committee (JTC 1): International Organization for Standardization and International Electrotechnical Commission. These standards appear in a series with identifiers that lead off with ISO/IEC 29500. Subsequent to the initial publication by ISO/IEC, ECMA produced a second edition with identical text. Clarifications and corrections were incorporated into editions published by this trio in 2011 and 2012.

Here’s a list of the nine:

  • OOXML_Family, OOXML Format Family — ISO/IEC 29500 and ECMA 376
  • OPC/OOXML_2012, Open Packaging Conventions (Office Open XML), ISO 29500-2:2008-2012
  • DOCX/OOXML_2012, DOCX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • DOCX/OOXML_Strict_2012, DOCX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • PPTX/OOXML_2012, PPTX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • PPTX/OOXML_Strict_2012, PPTX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • XLSX/OOXML_2012, XLSX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • XLSX/OOXML_Strict_2012, XLSX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • MCE/OOXML_2012, Markup Compatibility and Extensibility (Office Open XML), ISO 29500-3:2008-2012, ECMA-376, Editions 1-4

Microsoft is not the only corporate entity to move formerly proprietary specifications into the realm of public standards. Over the last several years, Adobe has done the same thing with the PDF family. There seems to be a new business model here: Microsoft and Adobe are proud of the capabilities of their application software–that is where they can make money–and they feel that wider implementation of these data formats will help their business rather than hinder it.

Office2

Office work in the days before computer support. This photograph of the U.S. Copyright Office (part of the Library of Congress) was made in about 1920 by an unknown photographer. Staff members are using typewriters and a card file to track and manage copyright information. The original photograph is held in the Geographical File in the Library’s Prints and Photographs Division.

Although an aside in this blog, it is worth noting that Microsoft and Adobe also provide open access to format specifications that are, in a strict sense, still proprietary. Microsoft now permits the dissemination of its specifications for binary doc, ppt, and xls, and copies have been posted for download at the Library’s Format Sustainability site. Meanwhile, Adobe makes its DNG photo file format specification freely available, as well as its older TIFF format specification.

Both developments–standardization for Office XML and PDF and open dissemination for Office, DNG and TIFF–are good news for digital-content preservation. Disclosure is one of our sustainability factors and these actions raise the disclosure levels for all of these formats, a good thing.

Meanwhile, readers should remember that the Format Sustainability Web site is not limited to formats that we consider desirable. We list as many formats (and subformats) as we can, as objectively as we can, so that others can choose the ones they prefer for a particular body of content and for particular use cases.

The Library of Congress, for example, has recently posted its preference statements for newly acquired content. The acceptable category for textual content on that list includes the OOXML family as well as OpenDocument (aka Open Document Format or ODF), another XML-formatted office suite. ODF was developed by the Organization for the Advancement of Structured Information Standards, an industry consortium. ODF’s standardization as ISO/IEC 23600 in 2006 predates ISO/IEC’s standardization of OOXML. The Format Sustainability team plans to draft descriptions for ODF very soon.

5 Comments

  1. Robert
    February 11, 2015 at 2:58 pm

    OOXML is a format of Microsoft. ODF is the real open standard. who has been paying who here?

  2. Kate Murray
    February 11, 2015 at 3:44 pm

    Thank you for the comment. As noted in the final paragraph of the blog, “The Library of Congress . . . preference statements . . . include [both] the OOXML family as well as OpenDocument (aka Open Document Format or ODF) . . . . The Format Sustainability team plans to draft descriptions for ODF very soon.

  3. KeithCu
    February 12, 2015 at 11:02 am

    Here is something Google wrote about OOXML:

    ———–
    Although OOXML may formally comply with Ecma, it was clearly not designed with an “open” spirit. Comparing the current with the future situation, interoperability is likely to become more difficult instead of easier. The implementation of a fully compatible ODF importer (the current efforts regarding .doc and .xls) is not an easy task, but it is dwarfed by
    the implementation of a fully compatible OOXML importer, which we estimate to take something between 50 – 500 person years, or even longer. Therefore, although it is theoretically possible to generate an OOXML document, this document will
    probably only use a very small subset of the standard.

    In sum, OOXML can be compared to Microsoft giving access to a labyrinth to which it alone owns a map; moreover, certain tunnels within this labyrinth are not accessible without a key that only Microsoft has, and that third parties would need to replicate first. (And, in doing so, these third parties would not know whether they would violate any rights that exposes them to litigation).
    ——–

    I have several pages about OOXML in my book: http://keithcu.com/

  4. Carl Fleischhauer
    February 13, 2015 at 7:59 am

    Thank you for highlighting the challenges to third-party implementation. Our perspective has been rather more concerned with long-term preservation than with implementation. With future researchers in mind, we seek to properly manage the content that is added to our collections. Word processing documents, for example, are most likely to be acquired as a part of collections of personal papers or institutional records, and they typically fall to the custodial care of our Manuscript Division. They generally arrive years after the documents were written and we have no choice but to accept the files in the formats in which they were created.

    The Library’s Recommended Formats statement (//www.loc.gov/preservation/resources/rfs/) is framed with such collection building in mind. In that statement, ODF and OOXML are listed on an equal footing for textual collections. When compared to the binary .doc files that would have comprised the bulk of, say, an individual’s papers at the end of the twentieth century, having XML-based files (ODF or OOXML) in the twenty-first seems like a definite improvement. If digital archeology is eventually required for preservation management, as opposed to migration to some alternate or successor format, the level of effort is likely to be lower for XML-based formats.

    Meanwhile, the Library’s separate Format Sustainability Web site carries descriptions that do not, in and of themselves, make a recommendation, although they do point to preference statements by the Library and by other organizations. These descriptions are intended to help our staff and persons in other organizations with an interest in preservation understand the formats. We hope that this information will help everyone maximize the chance that historical content remains accessible in the future. Meanwhile, we promise a write-up of ODF as soon as possible, and that description will report such facts as ODF’s adoption by a number of national governments for communications between citizens and government when an editable format is required.

  5. Jack
    February 22, 2015 at 9:08 pm

    Carl:

    Respectfully, you should go back to the library and do some research about OOXML. You state in response to one of the comments, “ODF and OOXML are listed on an equal footing.” Why would you do that? You are comparing apples and oranges.

    The ISO doesn’t weigh in on openness or not (hence, they did approve OOXML, even though it is closed. (Just because one can read sections of its XML such as “Do this the way Word 95 does it” does not make it “open”)).

    But the ISO does approve formats for their singular purposes, and won’t approve multiple formats for the same thing. The Library of Congress needs to be asking itself, “What are these formats for? With what purpose did ISO standardize them? What are they supposed to be used for?”

    Open Document Format is the only format approved by the ISO for original documents.

    The only way proponents of OOXML (such as your Caroline Arms, apparently, whose loyalty sure doesn’t seem to rest with transparency and good archival practices, if she was an, ugh, OOXML proponent) were able to get ISO approval for OOXML was to assert that it was somehow different from ODF. In the same way JPEG and GIF are, in a sense, image formats, but do very different things, so do OOXML and ODF. Read the scope statement of OOXML’s ISO specification to see what it is for. Or research the comments of Microsoft employees such as Doug Mahugh, who describe OOXML as a format for “carrying forward” old documents into XML.

    Just because, after ISO approved OOXML for that limited, only to be used for old documents purpose, Microsoft turned around and callously set OOXML as the default for new documents in its products, (which was never its purpose), does not mean that the LOC should be an accessory to such bad faith practices.

    I have read LOC’s Sustainability, Quality, and Functionality factors, and OOXML frankly fails most of them.

    At this point, OOXML should be considered a deprecated format, because, if there ever was a need to carry forward old documents into XML, surely by this point in time, more than half a decade later, that need has ended. Is there seriously even a single person out there in 2015 carefully taking some old documents and transforming them into XML. No, OOXML has zero purpose any longer, if it ever had any at all.

    The LOC could really make a statement, and defend good library principles, by loudly rejecting OOXML from your Format Sustainability site, and calling for its discontinued use as a format by anyone, anywhere, as any need that might ever have been in place for its existence has surely ended. If your Caroline Arms was actually on the ECMA committee pushing for this garbage OOXML format, then she should be well aware of all this. So why isn’t she educating you?

    I strongly recommend you become educated about these formats yourself, because right now, the LOC looks like a laughingstock to anybody who is at all knowledgeable about formats.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.