Wouldn’t it be great to have a single technical solution that solves all your long-term digital archiving, stewardship and preservation needs? Perhaps a file format with millions of users, widespread adoption across different computing platforms, free viewers and open documentation?
A lot of hopes and dreams have been poured into the idea of “one preservation tool to rule them all,” and many people, both inside and outside of the preservation community, have come to think of the “archival” version of the widely used Portable Document Format as this single solution.
However, a close examination of the tool shows that while it’s useful and valuable for many things, it’s not the only answer for long-term archiving and preservation. This can’t be stated often enough, especially as awareness grows around the October 2012 release of the latest version of the PDF/A specification.
The specification, which goes by the cumbersome name of Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3) (or ISO 19005-3:2012 for short), defines a file format based on PDF which provides a mechanism for representing electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. “Static visual appearance” ultimately means that conforming PDF/A files are complete in themselves and use no external references or non-PDF data.
But the scope of the PDF format has significantly expanded since a variety of organizations first met in October 2002 to begin work on the archival version of the specification. In 2011 PDF/A-2 brought the specification in concordance with the international standardization of PDF itself, and PDF/A-3 now addresses expanding business concerns in addition to the specification’s original strict preservation orientation defined largely by the cultural heritage community.
PDF/A-3 makes only a single, fairly monumental change. In the PDF/A-2 specification users were allowed to embed files, but only PDF/A files. PDF/A-3 now allows the embedding of any arbitrary file format, including XML, CSV, CAD, images and any others.
At first glance this sounds like a gigantic betrayal of everything that the format has stood for. Why define a subset of PDF attributes to ensure the long-term comprehension of the file if you’re going to turn around and allow the kitchen sink to be embedded within it? (You can follow some of the original discussion of this change here.)
The answer is that a wider business community, beyond the traditional archiving and cultural heritage sectors, pushed hard for it. The good news is that the addition of this feature to the specification will open up new application areas without seriously threatening the scope and intent of previous versions.
In the United States the corporate interest in PDF is led by the pharmaceutical, banking and financial sectors. As these industries already use PDF heavily, it makes sense for them to try and extend the PDF/A specification and leverage it for their own purposes.
The pharmaceutical sector, for example, has the challenge of managing a multitude of documents over long timeframes in the process of submitting their work to the FDA for approval. For their legal protection they also need to retain and archive these documents for the long-term, a natural benefit of PDF/A. Why not create a new version of the specification that would allow the multitude of documents to travel together in a single package?
In theory, this creates external dependency challenges in the newly created PDF/A-3 documents. But the specification makes the PDF/A-3 document a “dumb” container that prohibits “actionable” access to the embedded files. The embedded files should not be required in any way to comprehend the information in the PDF/A-3 document and are supplied merely as support to the information already in the document.
The significant language is in section 6.8, “Embedded files”:
Although embedded files that do not comply with any part of this International Standard should not be rendered by a conforming reader, a conforming interactive reader should enable the extraction of any embedded file. The conforming interactive reader should also require an explicit user action to initiate the process
For example, you might embed a word processing document that is the converted source of the PDF/A-3 document, or a spreadsheet file that is represented in the PDF/A-3 document by an image or a form. A PDF/A “conforming reader” (a software tool that renders a PDF/A-3 document reliably according to the rules of the specification) should not activate the embedded files but enable the files to extracted to another location for viewing, if the user has the proper tool to engage with that type of file.
Of course, a big assumption behind this change to the specification is that PDF documents are suitable universal package formats for all kinds of data. While this does fit into the established workflows of many communities, the idea has been met with skepticism in the preservation community.
PDF/A as an archival format isn’t broken with the introduction of PDF/A-3. The allowance of embedded files doesn’t make the preserving institution responsible for the keeping the embedded files comprehensible over time, and their inclusion shouldn’t affect the informational content of the document in any way.
As we all get smarter and technology improves the acute concerns about format obsolescence may diminish and we will likely welcome the fact that source materials have been stored in the PDF/A-3 documents. This change is significant, but before we discount the format altogether let’s explore what it means in practice and see how we can use this change to the advantage of the long-term stewardship community.
Update: Image credits were added on 11/28/12
Comments (9)
Great article, but I want to add two items:
Although this standard has a “-3”, it does not replace the “-2” version. Since PDF/A is a long term standard, all versions are in the archivist’s tool bag to be used as appropriate in perpetuity.
The inclusion of alternate forms of content in PDF’s, especially XML renderings has been a best practice of many governments. This content is generally consumed by machines and the PDF rendering is consumed by humans. This new version allows using PDF/A as their PDF rendering.
Like many others I was also surprised and not entrely pleased by the possibility to embed just anything in PDF A/3. I also followed some of the discussions about this on Twitter. I think much of the discussion (and disagreements) essentially boils down to 2 things:
1. Differences in scope between ‘archiving’ as done by dedicated (national) archives and libraries on one hand, and ‘business’ archiving (don’t know a better word for it) on the other. This is also mentioned by Butch. Limiting the discussion to static (text-based documents) here, in the first case the main objective is to ensure that a document remains accessible and renderable in the long term. In the second case (‘business’ archiving) the archived document may be the basis for further revisions. For this use case I definitely see how having the source files (e.g. an MS Word document) wrapped in a container with the PDF/A could be useful, especially considering the fact that the ‘archive’ in this case is usually just a simple file system without any dedicated archival management facilities / tools.
On the other hand, revising (and to a lesser extent reusing) content isn’t of any concern to dedicated archival institutions: we’re really only interested in keeping our content accessible over time. One possible way of dealing with this is to simply ignore embedded files in any deposited PDF/A-3 documents. This is no problem if we’re sure that any embedded files are simply alternative representations of the renderable PDF/A-3 (e.g. its source MS Word file). In that case we can simply decide not to care about the embedded file, since we’re (hopefully?) confident that the ‘frozen’ PDF representation will remain renderable over time (if not, there would be little point in using PDF/A to begin with!). However, here we arrive at something else …
2. Function creep. Since PDF/A-3 allows you to embed anything, you can bet people will end up doing exactly that. This has already been possible in ‘regular’ PDF for ages, but the added ‘A‘ and the backing of PDF/A as an archival solution by ISO may add a veneer of this being good archiving practice. The worst case scenario would be that people end up using PDF/A-3 for things that have traditionally been the domain of dedicated container formats (e.g. ZIP, TAR), in the misguided belief that this would somehow make any long-term sustainability concerns magically disappear. Because of this I think PDF/A-3 has some potential of becoming an (unintended) archival obfuscation format.
Implications
Butch points out that “the allowance of embedded files doesn’t make the preserving institution responsible for the keeping the embedded files comprehensible over time, and their inclusion shouldn’t affect the informational content of the document in any way”. This is true, but only for the scenario where the embedded file is an alternative (source) representation of the main body of the PDF. On the other hand, should PDF/A-3 gain popularity as a more general-purpose container format, we might end up in a situation where we need to consider any embedded file objects as well. This would (technically) complicate archival management in a variety of ways. For instance, suppose the main body of a PDF document mentions the existence a video, which is embedded in the same PDF. If we want to ensure that the embedded video remains renderable over time, we first need to be able to identify the video as an individual object, which requires an identification tool or service that is able to deal with the PDF container. Preservation actions (emulation, migration) on embedded files become more complex as well, since they also need to deal with the additional PDF layer. Not that this requires any rocket science, but it does add a layer of complexity that most of us probably could do without.
Johan (and Steve!), thanks very much for the comments. I couldn’t have said it better myself (but trying to keep it under 900 words!).
Johan, you’ve amplified several key questions and concerns that I wasn’t able to articulate clearly enough.
The “business” community made the strongest push for embedding in A-3 and the traditional “preservation” community pushed back strongly against the idea (follow the “original discussion” link above for more on that) but the business community view ultimately won out.
I suppose they could have gone and created their own standard, but at least the restrictions from PDF/A-1,2 will remain in effect and that’s beneficial to our community.
Similarly, I’m not sure I’d recommend A-3 to an archival entity starting from scratch unless their workflow was heavily dependent on PDF. I’ve never thought PDF was a good “universal package” format and I still don’t think so, but for some organizations it makes sense. And whether our community likes it or not, creators are cramming more and more content of all types into PDFs and we’ll have to deal with it.
I agree that some organizations might look inappropriately to PDF/A-3 as a solution because of the branding, but they still have to work through their own technical workflows before they come to a decision. If they’re rigorous about self-analysis they’ll determine whether PDF/A-3 is an appropriate fit (or not).
The last point is the most challenging because I’m of two minds. On the one hand, if the embedded objects are “inert” they’re not hurting anybody (except larding up the format and increasing the size of the files, but that’s a different story). This is easy enough to deal with and we can treat them like they don’t exist (as long as they truly are “inert,” which I do have some concern about).
On the other hand, it will be very useful to have the originals around at some point in the future.
I hope (!) that we’ll all continue to get smarter as we go along and that the file format issues that concern us now will be much less vexing. In that (imagined future) case, the concerns about whether we’ll be able to render a spreadsheet uniformly will fall away and we’ll be able to interact with the original files, providing a much richer experience for future researchers.
Keep the comments coming! I’d love to more from folks with questions and concerns about the format.
I’m finding it difficult to accept the argument that a desire to keep related artifacts together with an archival PDF that presumably holds a version or derivation of those artifacts for long term rendering should result in placing the related artifacts inside the PDF file. There are multiple generic container formats available in the market place that have broad usage and enjoy the status of de facto standard at least. It seems to me that the community has an inverted solution in PDF/A-3. Would it not be better to embed the related artifacts, along with the PDF file(s) in something like a zip, or better yet, BagIt container? These would allow self description as well as content hash strings to be maintained along with the artifacts. And these container formats are understood to be simple containers, rather than a file of reduced-functionality postscript that is intended to support rendering on a wide variety of platforms, into which we have poured un-renderable data files.
As the Project Leader for PDF/A, let me add a few things to what Butch has already said as well as address some comments from others.
As Butch and Steve said, while the capabilities and requirements of parts 1 and 2 were HEAVILY driven by professional archivists, it was the rest of the world that drove part 3. Governments, Libraries, Businesses, etc. And their needs can be summed up thus: the requirement to have a single electronic document that could be stored (anywhere) or transmitted (eg. emailed/posted online) that could be either human or machine consumed.
both @Johan and @Tom suggest that this requirement could also have been met by the use of a ZIP (or similar) container format. I would suggest that answer to that position is “sort of”. While it would provide for machine consumption, the average human doesn’t know what to do with the ZIP file. And while desktop OS platforms have native ZIP support, mobile platforms do not. So going in that direction would have created a HUGE impediment to the human consumption requirement. Secondarily, there does not exist an international standard for digitally signing ZIP files – something that is extremely important in those communities. (NOTE: there now exists an EU-based standard, but it has received little traction).
@Butch pointed out that the international community could have gone off and invented a completely new format/subset to solve their needs – and in fact we did explore that avenue for a few months. However, the world had done such an amazing job on evangelising the benefits of PDF/A for “long term reliable storage of content” that anything else simple could not measure up.
Hope this gives you some more insight into the situation. And like Butch, I’m happy to comment further if you have additional questions.
Some of the justifications given here are quite remarkable. For example:
“their inclusion shouldn’t affect the informational content of the document in any way.” but then “…we will likely welcome the fact that source materials have been stored in the PDF/A-3 documents”
and similarly:
“This is easy enough to deal with and we can treat them like they don’t exist (as long as they truly are ‘inert,’ which I do have some concern about).” but then “On the other hand, it will be very useful to have the originals around at some point in the future.”
Do I really need to point out that this is not a consistent position! Either the attachments contain more information, or they do not, and if they don’t, they can be deleted. But of course they do, or else they would not be there, and whether I’m responsible for preserving that additional information or not is not up to you, Butch!
The underlying problem appears to stem from the lack of consistency around the use cases. On this page, we have the FDA submission example, we have the idea of embedding the ‘originals’ in the PDF (which smacks of ‘to PDF/A’ being treated some kind of universal preservation action), and the idea of including a machine readable version of information in the parent document. Each of these three may be defensible, or they may not, but that is still not quite the heart of the problem.
The problem is that this variety of distinct use cases, each seeking the PDF/A label, has lead to the acceptance of a solution that is so very general that almost anything at all is permissible. The standard does not state that this feature should only be used to embed originals, or for a machine-readable version, or to add supporting information. It does not state that the parent PDF should not explicitly refer to the attachments (which could not be enforced anyway). By simply being attached rather than transcluded, anything goes, and anything will.
In that light, the FDA approval example (as outlined here) is little short of terrifying. The embedded information is not required to ‘comprehend’ the submission, but may ‘support’ it? This appears to imply that while we can be assured that we will be able read the submission itself, the same cannot be said of the underlying data required in order to *trust* it!
Andrew,
Great points and I’m glad you chimed in. The discussion has been illuminating.
I share concerns about using PDF/A-3 as a packaging format, but I understand that there are workflows where it makes sense (or at least lobbyists for those workflows have claimed that it makes sense).
There are two sides to my admittedly inconsistent response.
Call the first the “see no evil” response.
A valid PDF/A-3 document should be self-contained. That is, the document should be completely comprehendible as a PDF in and of itself without any reference to any embedded files.
That said, imagine a document with a table in it. To understand the information content in the document there’s no need to know the formulas that determine the numbers in the table, but at some point in the future, someone may wonder what the formulas were. They could then extract the spreadsheet and examine it in greater detail.
In this case the embedded files represent optional, secondary content and the maintainers of the files make no promises about the preservation of this content. This is “willful ignorance” on the part of the preserving organization and they do it out of institutional protection, but also because you shouldn’t need the embedded files to understand the content of the PDF.
The second is the “magic pony” response.
Your point that “either the attachments contain more information, or they do not, and if they don’t, they can be deleted” has plenty of merit.
So why keep the files there at all? At some future point (which I imagine is sooner rather than later) the file format concerns around many common file types will fade away.
For example, think of a file with embedded FLAC audio of the author reading the document. We currently think of PDFs as flat files similar to pieces of paper (and PDF/A files are even more “paper-like” than others). In the “future” (when we’re smarter) we’ll be so used to dealing with multi-dimensional PDF files that a flat file will seem archaic.
At that point we’ll welcome the rich supporting material embedded in the PDF because it will be a natural part of our engagement to be able to easily render the embedded files and we’ll welcome their existence.
Again, we won’t need any of the embedded files to understand the document, but their addition will add richness to the work.
There is lots of room for discussion on these points and we welcome more. Here are a few points that the community might want to consider for further discussion:
1. What are the use cases for using PDF/A-3 for long-term preservation purposes and how can we provide guidance on these?
2. What are the use cases where PDF/A-3 IS NOT a good choice and how can we educate the community when it is not an appropriate choice?
3. What are other options for long-term preservation packaging and how can we educate users on these options?
4. How can we ensure that conforming readers enforce the rules on embedded files?
Assuming that we are happy with referring the PDF/A3 as a very generic container file, do we have good sight of the payload or manifest process that will describe the contained objects? It reads like the main proposed use case is as a generic AIP object (or as Andy said, “as some kind of universal preservation action”).
Will we know the contents? will we know what formats we are holding in the container?
I appreciate that idea is that the objects are all self describing. but that description is surely contextual, not technical?
In the example you offered about the ‘table in a document’, the assumption is that if the spreadsheet that contains the figures that comprise the table data is to exist intelligibly inside the PDF/A3 container, then the spreadsheet file needs to be understood / rendered / cataloged… If the spreadsheet data is of a renderable ‘PDF’ (postScript) form, how do we get to any cell based transforms and functions?
In principle it seems PDF/A3 like a like it will have a purpose, but I suspect that purpose is not well aligned with medium to long term preservation / archival requirements.
Butch: to respond to your points…
1. The committee behind PDF/A-3 clearly has use cases in mind. Perhaps these could be written up as guidelines or usage profiles? If my speculation was accurate then the use cases include “supporting source data”, “machine readable version” and “embedded source document”. Once enumerated, can some more specific restrictions be outlined for each case? Can embedded metadata be used to help identify and distinguish these different use cases?
2. I suspect enumerating all the cases where this approach may not be appropriate is not a good use of anyone’s time. Focusing on the positive use cases should help raise any ‘gotchas’ which can then be addressed explicitly.
3. Again, more clarity on use cases might help. In general, there are plenty of good standards for keeping things together, like ZIP or even just filesystem conventions. The case we seem to be addressing here, “keeping things together so they can be passed around”, is commonly addressed by email.
4. You can’t. Similarly, PDF/A-3 cannot be validated automatically, and I think this departure from the past is the reason behind much of the bad feeling towards the new standard. If stricter, use case oriented guidelines were in place (see point 1), then a manual conformance audit process could be designed, c.f. TRAC certification. The same applies to the evaluation of conforming readers.