We’re lucky in the digital stewardship community that our challenges tend to be non life-threatening. Still, when we get fired up about something there is guaranteed to be spirited debate and passionate advocacy on all sides.
Such was the case with the release of the PDF/A-3 file format specification in October 2012. We wrote about it on the Signal shortly after and it was immediately a hot topic. To the barricades!
The PDF/A family of international standards defines a file format based on the Portable Document Format which provides a mechanism for representing electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. “Static visual appearance” ultimately means that conforming PDF/A files are complete in themselves and use no external references or non-PDF data.
The first version of the PDF/A specification (PDF/A-1) was published in September 2005 and has been updated at regular intervals since. The A-3 version of the specification was received with some concern in the stewardship community as it adds a single and highly significant feature to its predecessors. The PDF/A-2 specification permitted the embedding of other files as long as the embedded files were valid PDF/A files. A-3 permits the embedding of files of any format.
While a PDF/A-3 file’s primary document is still intended to be robust against preservation risks over the very long term, PDF/A-3 does not require that the embedded files be considered archival content, creating a series of potential technical and policy challenges for preserving institutions.
The National Digital Stewardship Alliance Standards and Practices Working Group clearly recognized these challenges and felt the community would benefit from an examination of the format and what it means for collecting institutions.
Which leads to today’s release of the NDSA report on “The Benefits and Risks of the PDF/A-3 File Format for Archival Institutions” (pdf).
The report takes a measured look at the costs and benefits of the widespread use of the PDF/A-3 format, especially as it effects content arriving in collecting institutions. It provides background on the technical development of the specification, identifies specific scenarios under which the format might be used and suggests policy prescriptions for collecting institutions to consider.
For example, the report suggests that for memory institutions, the acceptance of embedded files in PDF/A documents would depend on very specific protocols between depositors and archival repositories that clarify acceptable embedded formats and define workflows that guarantee that the relationship between the PDF document and any embedded files is fully understood by the archival institution.
Additionally, the report notes that the complexity of the PDF format and the wide variance in PDF rendering implementations and creating applications suggests that PDF/A-3 may be appropriate for use in controlled workflows, but may not be an appropriate choice as a general-purpose bundling format.
Certainly, the introduction of such a problematic new feature in the latest version of the PDF/A family should press the community of memory institutions into a more strategic, active, and vocal role in the standards development process, and in the PDF/A process specifically.
This report is the latest in a series of NDSA publications and activities that provide insight on a range of digital stewardship issues. We welcome your comments on the PDF/A-3 report in addition to suggestions for areas where the NDSA can continue to provide benefit to the digital stewardship community.
Comments (3)
Many thanks to Butch and the other members of the NDSA Standards and Practices Working Group that authored this blog post and report. This report is crucial to understanding the complex and changing nature of PDF-A.
Thanks Glen. I do want to note the special contributions of our former Library of Congress colleague Caroline Arms (now an LC consultant) and Sheila Morrissey of Ithaka for their significant contributions to the report. They really carried the heavy load.
One of the goals that I thought PDF/A format was to accomplish was an end to PDF file corruption which renders the file useless. Unfortunately, even the newest PDF/A-3 standard does not address this.
Furthermore, the introduction of additional archive materials in a still very easily corruptible file format just opens the door for more data loss.
What is really needed for long term archive is a Robust Document Format (RDF) specification that is backward compatible with the PDF standard. The RDF format would need to address the common reasons for corruption of the traditional PDF format, and address it in such a way that a PDF reader can still read the file. And in cases where a file is corrupt, an RDF reader could read and repair the file, or simply read the file, thereby replacing a PDF reader altogether.