Wouldn’t it be great to have a single technical solution that solves all your long-term digital archiving, stewardship and preservation needs? Perhaps a file format with millions of users, widespread adoption across different computing platforms, free viewers and open documentation?
A lot of hopes and dreams have been poured into the idea of “one preservation tool to rule them all,” and many people, both inside and outside of the preservation community, have come to think of the “archival” version of the widely used Portable Document Format as this single solution.
However, a close examination of the tool shows that while it’s useful and valuable for many things, it’s not the only answer for long-term archiving and preservation. This can’t be stated often enough, especially as awareness grows around the October 2012 release of the latest version of the PDF/A specification.
The specification, which goes by the cumbersome name of Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3) (or ISO 19005-3:2012 for short), defines a file format based on PDF which provides a mechanism for representing electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. “Static visual appearance” ultimately means that conforming PDF/A files are complete in themselves and use no external references or non-PDF data.
But the scope of the PDF format has significantly expanded since a variety of organizations first met in October 2002 to begin work on the archival version of the specification. In 2011 PDF/A-2 brought the specification in concordance with the international standardization of PDF itself, and PDF/A-3 now addresses expanding business concerns in addition to the specification’s original strict preservation orientation defined largely by the cultural heritage community.
PDF/A-3 makes only a single, fairly monumental change. In the PDF/A-2 specification users were allowed to embed files, but only PDF/A files. PDF/A-3 now allows the embedding of any arbitrary file format, including XML, CSV, CAD, images and any others.
At first glance this sounds like a gigantic betrayal of everything that the format has stood for. Why define a subset of PDF attributes to ensure the long-term comprehension of the file if you’re going to turn around and allow the kitchen sink to be embedded within it? (You can follow some of the original discussion of this change here.)
The answer is that a wider business community, beyond the traditional archiving and cultural heritage sectors, pushed hard for it. The good news is that the addition of this feature to the specification will open up new application areas without seriously threatening the scope and intent of previous versions.
In the United States the corporate interest in PDF is led by the pharmaceutical, banking and financial sectors. As these industries already use PDF heavily, it makes sense for them to try and extend the PDF/A specification and leverage it for their own purposes.
The pharmaceutical sector, for example, has the challenge of managing a multitude of documents over long timeframes in the process of submitting their work to the FDA for approval. For their legal protection they also need to retain and archive these documents for the long-term, a natural benefit of PDF/A. Why not create a new version of the specification that would allow the multitude of documents to travel together in a single package?
In theory, this creates external dependency challenges in the newly created PDF/A-3 documents. But the specification makes the PDF/A-3 document a “dumb” container that prohibits “actionable” access to the embedded files. The embedded files should not be required in any way to comprehend the information in the PDF/A-3 document and are supplied merely as support to the information already in the document.
The significant language is in section 6.8, “Embedded files”:
Although embedded files that do not comply with any part of this International Standard should not be rendered by a conforming reader, a conforming interactive reader should enable the extraction of any embedded file. The conforming interactive reader should also require an explicit user action to initiate the process
For example, you might embed a word processing document that is the converted source of the PDF/A-3 document, or a spreadsheet file that is represented in the PDF/A-3 document by an image or a form. A PDF/A “conforming reader” (a software tool that renders a PDF/A-3 document reliably according to the rules of the specification) should not activate the embedded files but enable the files to extracted to another location for viewing, if the user has the proper tool to engage with that type of file.
Of course, a big assumption behind this change to the specification is that PDF documents are suitable universal package formats for all kinds of data. While this does fit into the established workflows of many communities, the idea has been met with skepticism in the preservation community.
PDF/A as an archival format isn’t broken with the introduction of PDF/A-3. The allowance of embedded files doesn’t make the preserving institution responsible for the keeping the embedded files comprehensible over time, and their inclusion shouldn’t affect the informational content of the document in any way.
As we all get smarter and technology improves the acute concerns about format obsolescence may diminish and we will likely welcome the fact that source materials have been stored in the PDF/A-3 documents. This change is significant, but before we discount the format altogether let’s explore what it means in practice and see how we can use this change to the advantage of the long-term stewardship community.
Update: Image credits were added on 11/28/12