Digital Preservation-Friendly File Formats for Scanned Images

From a preservation standpoint, some digital file formats are better than others.  The basic issue is how readable a format remains over the course of time and successive waves of technological change.  The ideal format will convey its content accurately regardless of advances in hardware, software and other aspects of information technology.

FILE 2009, by Andre Deak, on Flickr

FILE 2009, by Andre Deak, on Flickr

Over the last several years, the Library has developed a web resource to help guide preservation-optimal choices in selecting file formats.  Sustainability of Digital Formats Planning for the Library of Congress Collections outlines a number of sustainability factors that have a bearing on how effective formats are expected to be with regard to long-term preservation.

The factors are listed below, in brief.

  • Disclosure. Degree to which complete specifications and tools for validating technical integrity exist and are accessible to those creating and sustaining digital content.
  • Adoption. Extent of acceptance by the primary creators, disseminators or users of information resources.
  • Transparency. Openness to direct analysis with basic and non-propriety tools.
  • Self-documentation. Inclusion of metadata needed to render the data as usable information or understand its context.
  • External dependencies. Degree to which a particular format depends on particular hardware, operating system, or software for rendering or use and the predicted complexity of dealing with those dependencies in future technical environments.
  • Impact of patents. Extent that licenses  may inhibit the ability of archival institutions to sustain content.
  • Technical protection mechanisms. Embedded capabilities to restrict use in order to protect the intellectual property.

Application of these factors to current format choices has led to identification of different flavors of TIFF and JPEG 2000 as preferred choices for scanned digital images.  Also in the mix is PDF/A-1, PDF for Long-term Preservation.

The Library is also working with the The Federal Agencies Digitization Guidelines Initiative to define common guidelines, methods and practices to digitize historical content in a sustainable manner.  The Federal Agencies Still Image Digitization Working Group, a subsection of the larger initiative, is concentrating its efforts on image content such as books, manuscripts, maps and photographic prints and negatives.

11 Comments

  1. Chris Rusbridge
    October 13, 2011 at 6:21 am

    I am always surprised by the choice of JPEG2000 given the added risk of file loss through minor corruption, exposed by Heydegger, V. (2008). Analysing the Impact of File Formats on Data Integrity. Proceedings of Archiving 2008. Bern, Switzerland. Retrieved from http://old.hki.uni-koeln.de/people/herrmann/forschung/heydegger_archiving2008_40.pdf (via Richard Wright of the BBC)

  2. Bill LeFurgy
    October 18, 2011 at 4:40 pm

    Chris–Thanks for your comment. I’ve asked Steve Puglia and Carl Fleischhauer to weigh in, and their response is below.

    Others have conducted similar analyses of JPEG 2000 robustness and have seen similar results in terms of susceptibility to corruption. Nevertheless, some of these organizations have concluded that JPEG 2000 is an appropriate file format choice from a robustness perspective (“…conclude that JPEG 2000 is a good current solution for our digital repositories.”) A Format for Digital Preservation of Images by Buonora and Liberati. It is also worth noting that the format includes some “resiliency” elements that add robustness and thereby counteract some effects of data loss. These resiliency elements are described in the notes at the bottom of this page in the Library’s Format Sustainability website.

    Digital preservation is complex, and more than just file format susceptibility to corruption needs to be considered. In other words, just because a file format is less susceptible to corruption, it does not mean the file is preserved. Given the sheer volume of digital photographs being produced, currently up to 375 billion per year (orders of magnitude more files than are being produced by digitization efforts), and probably almost all of them are JPG files, the answer for digital preservation is not going to be insisting all image files be saved as uncompressed formats.

    Data corruption is and will remain a problem. An active part of digital preservation will be to overcome this problem. The LOCKSS concept includes an approach for dealing with the problem – “…the bits and bytes are continually audited and repaired…to protect fragile digital content for the very long time.” – see ). LOCKSS now has a 12 year track record.

  3. Ed Summers
    October 19, 2011 at 9:43 am

    My main beef with JPEG2000 is the lack of good opensource tools that support it. It’s my impression that the patent situation around the JP2 has inhibited tool development and widespread use. Given the narrow dispersion of the tool support and the complexity of the format, I would actually characterize it as a preservation risk.

  4. Roger Howard
    October 20, 2011 at 10:32 am

    “Given the sheer volume of digital photographs being produced, currently up to 375 billion per year (orders of magnitude more files than are being produced by digitization efforts), and probably almost all of them are JPG files, the answer for digital preservation is not going to be insisting all image files be saved as uncompressed formats.”

    So why not recommend JPEG? – far and away the most well-supported, well-understood digital image file format in history, and by far the one most likely to remain supported nearly anywhere pixels are processed.

  5. Andrew Jackson
    October 24, 2011 at 5:20 am

    I believe JP2 can work well, but should be approached with caution. Some ambiguities in the initial standard lead to various commercial and open source tools doing the wrong thing with the physical resolution and the colour profile. See http://www.openplanetsfoundation.org/blogs/2011-06-06-paper-jpeg-2000-preservation-9 for details. Progress is being made (e.g. updating the standard) but some tools are lagging behind and have not been improved.

    As for JPG, I think one of the reasons people prefer JP2 for new content is because the compression loss-rate can be very finely tuned, or reduced to zero. Also the tiling/precinct feature assists in presentation. However, I don’t think anyone is suggesting that JPG files are migrated to JP2. That process would be a much greater source of preservation risks than simply supporting JPG which is hardly going to go obsolete anytime soon and has great open source support.

  6. Rick Wiggins
    October 31, 2011 at 12:28 pm

    For our Electronic Thesis and Dissertations, we would like for our students to save their documents in PDF/A format. Students using Mac OS X can print to PDF and PDF/X, but there is nothing on Apple’s web site that I can find about their support for PDF/A. I know that Adobe Acrobat X will convert PDF to PDF/A, but this requires the student to purchase a piece of software. I wonder if there is anyway to “suggest” to Apple that we really need them to support PDF/A in addition to their existing support for PDF/X?

  7. Nandita Chaudhri
    January 30, 2012 at 6:07 am

    I have heard that digital scanned images can be converted to an XML based format which is suitable for long term preservation. One knows of metadata in XML but what about the actual images? Is there any Open Source standard or work towards one which I should know about when considering long term archival?

  8. Bill LeFurgy
    January 30, 2012 at 12:33 pm

    My expert colleagues provide the response below to your question.

    Conceptually, it is possible to encode a raster or pixel-based image using XML to store the brightness and/or color values for each pixel.

    However, compared to a traditional binary file, an XML or other character-based file will be substantially larger in terms of stored file size for the same size digital image (same number of pixels). In other words, it takes a lot of character data to represent a pixel-based image, much more data overall than a traditional binary file format like TIFF.

    At this time, we are not aware of any XML-based raster image file formats. There are vector image file formats that are XML-based, like SVG, but none for raster images.

    So at this time, using XML to encode raster images remains only an interesting theoretical option, but is not practical.

  9. Nandita Chaudhri
    January 31, 2012 at 12:19 am

    Thanks so much, Bill.

  10. Ann Farris
    March 14, 2013 at 8:10 pm

    I was hoping to find information on the best way to scan negatives dating back to the 1930’s. We are beginning this process and want to be sure we are using the most effective way. Do you know where this is addressed?

  11. Bill LeFurgy
    March 15, 2013 at 10:00 am

    Ann: It probably wasn’t your intention, but your question perfectly illustrates the intended point of my post! So many people need guidance on digitization and it can be hard to get a comprehensive overview of the subject, which from my perspective includes long term preservation of the digital copies. I suggest you check out http://dpbestflow.org/camera/camera-scanning . It does a good job presenting options for camera copying, as well as digital preservation concerns. Good luck!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.