PDF is Here to Stay: Archiving with the Portable Document Format

Today’s guest post is from Kate Murray (Digital Projects Coordinator, Digital Collections Management and Services Division, Library of Congress), Duff Johnson (Executive Director, PDF Association / ISO Project Leader, ISO 32000), and Kevin De Vorsey (Senior Electronic Records Policy Analyst, Records Management Policy and Standards, National Archives and Records Administration).

PDF in the Federal Archiving Community

As the world’s largest library the Library of Congress has a wide variety of digital file formats in its collections, from MXF to XML, GIF to TIFF, WAV to WARC and many more. The various flavors of PDF (Portable Document Format with the .pdf extension) are a significant percentage, comprising about 110 TB or over 100 million files currently in digital storage and growing, including many gathered by harvesting government web archives.

As discussed in a 2017 blog post, the Library is both a collector and a creator of PDF files, which come into Library collections in a number of ways. The Recommended Formats Statement (RFS) includes high quality PDF files, with features such as searchable text, embedded fonts, lossless compression, high resolution images, as a preferred format for textual works in digital form, electronic serials, digital musical compositions, and accompanying image/text files for digital audio. The RFS also includes web optimized PDF as an acceptable format for this same content as well as other graphic images – digital.  PDF files also make their way into Library collections through the Copyright Office program for group registration of newspapers as well as through projects like the National Digital Newspaper Program.

The National Archives and Records Administration (NARA) is an independent agency established in 1934 to identify, protect, preserve, and make publicly available the historically valuable records of all three branches of the Federal government. NARA manages the federal government’s archives, administers a system of Presidential Libraries, and provides oversight of federal agencies’ records management activities. Agencies use a wide variety of technologies, applications, and file formats when creating records to accomplish their missions. Of these records, a small percentage are identified as having permanent value and are eventually accessioned into the National Archives. To help agencies create electronic records that can be effectively managed over long periods of time, NARA issued guidance on acceptable file formats for permanent electronic records. NARA selected formats based on their ubiquity, sustainability, and flexibility. PDF is accepted in five of the sixteen record types including computer aided design (CAD), scanned text, digital posters, presentation formats, and textual data; more than any other format.

The identification of PDF as acceptable for use with diverse record types reflects its exceptionally wide adoption by federal agencies. From U.S. Courts who require that case files be submitted as PDF files, to the Internal Revenue Service (IRS) and the Department of State who both make use of PDF forms to gather data, and PDF documents to disseminate published information, PDF is everywhere in the federal government.

NARA and the Library of Congress recognize that the complexity of the format, the variability of content that can be captured in PDF, and the extremely high volume of PDF documents already in collections as well as those still to come presents challenges. The option to “Save as PDF” is baked into many diverse content-generating software programs.  It’s up to user communities to make our requirements for preservation known and campaign for solutions to satisfy them.

One approach staff at both agencies use to mitigate risks is by participating in the maintenance of the standards that govern PDF and its subset formats through involvement in ISO TC 171 SC2 as well as by engaging with the vendor community directly through membership in the PDF Association. These connections facilitate advocacy for the needs of the records management and cultural heritage communities.

Why PDF Matters

Due to its flexibility, PDF is able to meet a wide range of business needs and has seen adoption across diverse domains. As a result, PDF is an inevitable and valuable part of the archiving landscape.

PDF technology uniquely combines a very specific set of features and attributes that have made it a “go to” electronic document format since the late 1990s. The Portable Document Format delivers formatted text and graphics with guaranteed precision, but goes far beyond mere digital paper with support for metadata, JavaScript, video, 3D and geospatial data, among others. A single PDF page could accurately represent thousands of square kilometers on a map or include complete layered schematics for an integrated circuit.

These features allow PDF to appeal to numerous individual horizontal marketplaces such as printing, publishing and documentation as well as the financial, legal and engineering worlds.

PDF is popular because it is powerful enough to appeal to diverse needs, self-contained for portability, has a low barrier to use, is always free to view and delivers the goods in a single, self-contained and transactable object independent of access, bandwidth, operating system, CSS, fonts, reader software and many other possible variables.

Specification

Underlying PDF’s technical flexibility and commercial success is the most fundamental feature of all: PDF’s open and democratically managed specification.

The PDF family of formats was first developed by Adobe, which published the core PDF reference specification at no charge. Various subset specifications of PDF were adopted as ISO international standards, such as PDF/X (ISO 15930) in 2001 and PDF/A (ISO 19005) in 2005 before PDF version 1.7 became ISO 32000-1 in 2008.

PDF/A is a subset of the PDF specification originally published 15 years ago to ensure the long-term preservation of electronic documents. Developed prior to the

PDF specifications and subsets over time, from the Library of Congress Sustainability of Digital Formats website.

availability of dedicated preservation repositories, it originally focused on increasing the sustainability of a PDF file by restricting features that could pose a risk to the future rendering of content. As a result, encryption, JavaScript and embedded files were all forbidden in early documents. Digital preservation theory and practice has evolved since PDF/A was first published. Newer versions of the PDF/A standard, in addition to keeping up with current-generation PDF technology, allow support for some previously forbidden features whose risks can be mitigated by other means. Digital preservation tools for PDF/A include veraPDF, an industry supported, free, open source file format validator for all PDF/A parts and conformance levels.

PDF is ubiquitous in part because developers can acquire and implement PDF’s “cookbook” themselves and make PDF files that are fully usable irrespective of software. PDF viewing software is freely distributed and bundled with most computers, phones, tablets and web browsers. However, even though PDF technology itself is fully transparent that does not imply that PDF software – including commonplace open source software – is equally capable… or even up-to-date with the latest version of the specification.

Although software can vary, it can also be readily updated. The openness of PDF guarantees fundamental vendor and system independence, a key criteria for long term preservation.

The Need for Policy

As Peter Parker learned when he became Spiderman, with great power comes great responsibility. PDF is indeed powerful, and with a base specification of about 1000 pages, complex. Errors of interpretation and software bugs may result in non-conforming files that might render perfectly for the creator but nonetheless fail during validation into a repository or, worst of all, won’t open in other applications. Additionally, PDF includes support for many features that collecting institutions may deem to be problematic. Accordingly, PDF’s use for archival purposes should be guided by institutional policies that address potential problems.

For example, PDF files, like many other formats, may be password-encrypted by a user possessing basic PDF software. This useful (and often critical) feature is typically viewed as anathema to digital preservation and to the archival mission in general. Likewise, PDFs created without embedded fonts or color profiles are potential risks to preservation. These factors generally reflect intentional decisions by the author or limitations in the source application or PDF creation software rather than indicating a deficiency of the format. Institutions responsible for preserving externally created PDFs should develop policies and transfer instructions to help minimize problematic content. At NARA, policies are expressed as transfer instructions that apply to any format that provides support for a feature such as font embedding or encryption. The Library’s Recommended Formats Statement has similar statements about digital rights management technologies or encryption (see, for example, textual works).

The PDF Association

The original purpose of the PDF Association’s forerunner, the PDF/A Competence Center, was to provide a forum in which vendors could determine a common interpretation of PDF/A. Today, the PDF Association is an international non-profit with over 130 member organizations in 23 countries including both the Library of Congress and the National Archives. Its PDF Validation Technical Working Group supervises the veraPDF Test Suite for the industry-supported PDF/A validator; the technical and application notes it publishes, are authoritative commentaries on the specification’s language.

In addition to its role as a meeting-place for PDF technology developers and promoting PDF technology, the PDF Association is also a locus for interaction between user communities (such as accessibility specialists or government agencies) and the software developers who guide development of PDF technology through industry and ISO deliberations and processes. PDF Days events, such as the one held at NARA in 2018, are prime examples of this interaction and collaboration between PDF users and the vendor community.

Engaging Industry

Although preserving the author’s intent in a final-form document remains the fundamental value proposition of the format, PDF technology development ultimately responds to user requirements beyond cultural preservation – even beyond archiving in general. From video to forms and JavaScript, from passwords to layers and 3D data, PDF’s scope continues to move and grow because customers demand new capabilities as they seek to leverage PDF’s existing strengths in new applications.

The PDF Association is a gateway to subject matter expertise and software developers, including those who create and maintain the ISO specifications for PDF. PDF Association members enjoy direct access to draft ISO documents and may contribute comments for review by the various ISO working groups. The organization welcomes active engagement from third party organizations including NARA and the Library of Congress.

Conclusion

PDF technology is a ubiquitous format for electronic documents as evidenced by the many millions of PDF files in the collections of the Library of Congress and the holdings of the National Archives and Records Administration. The inclusion of new features in recent and upcoming versions of PDF make it likely that it will see adoption in additional domains and be used with new types of electronic content. The Library of Congress and NARA are taking an active role in helping shape the future of PDF by working to ensure that the standards committee and vendor community understand the archival requirements for long-term preservation.

Machine Learning + Libraries Summit: Event Summary now live!

The Machine Learning + Libraries Summit Event Summary is now available as a downloadable report on labs.loc.gov. This document includes more detailed information about the conference proceedings. It broadly summarizes recurring themes of discussion and compiles the outputs of the small group activities.

Computing Cultural Heritage in the Cloud Quarterly Update

This is a guest post from LC Labs Senior Innovation Specialist Laurie Allen. This is the second post in a series where we are sharing experiences from the Andrew W. Mellon-funded Computing Cultural Heritage in the Cloud. The series began with an introductory post.  Learn about the grant on the experiments page, and see the […]

LC Labs Letter January 2020

LC LABS LETTER A Monthly Roundup of News and Thoughts from the Library of Congress Labs Team The Computing Cultural Heritage in the Cloud Project is HIRING! Come join the Mellon-funded Computing Cultural Heritage in the Cloud Project as one of two digital scholarship specialists! The positions will be funded for three years and will […]

Digital Strategy Year in Review

This is a guest post by Leah Weinryb-Grohsgal from the Digital Strategy Directorate. Leah outlines below some of the major milestones reached by the Directorate in 2019. Looking Back and Looking Forward Exciting changes are afoot for digital transformation at the Library of Congress!  This post reviews some of the things we did last year […]

LC Labs Letter: December 2019

LC LABS LETTER A Monthly Roundup of News and Thoughts from the Library of Congress Labs Team Keeping up with the Innovators This month, Brian Foo and Ben Lee came back to the Library to gather feedback from staff in the early stages of their project and to showcase working prototypes. Brian presented his project […]

Born to Be 3D: Born-Digital Data Stewardship

Today’s post is from Jesse Johnston and Jon Sweitzer-Lamme. Jon is the Librarian in Residence at The Library of Congress’ Preservation Directorate. He is a 2017 graduate of the University of Illinois at Urbana-Champaign’s iSchool, receiving a MSLIS with a minor in Museum Studies and a certificate in Special Collections. On November 2, the Library hosted […]

Inside, Inside Baseball: A Look at the Construction of the Dataset Featuring the Smithsonian’s National Museum of African American History and Culture and the Library of Congress Digital Collections

This is a guest blog post by visiting scholar archivist Julia Hickey who is on a professional development assignment from the Defense Media Activity to the Library of Congress Labs team. Julia has been helping us prepare for and build out a visualization of collection data for our Inside Baseball event. This post was also […]