The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.
The Still Image Working Group within the Federal Agencies Digitization Guidelines Initiative (FADGI) recently posted a comparison of a few selected digital file formats. We sometimes call these target formats: they are the output format that you reformat to. In this case, we are comparing formats suitable for the digitization of historical and cultural materials that can be reproduced as still images, such as books and periodicals, maps, and photographic prints and negatives.
This activity runs in parallel with an effort in the Audio-Visual Working Group to compare target formats for video reformatting, planned for posting in the next few weeks. Meanwhile, there is a third activity pertaining to preservation strategies for born-digital video. The findings and reports from all three efforts will be linked from the format-compare page cited above.
The two comparisons of digitization formats employ similar, matrix-based tables to compare about forty features that are relevant to preservation planning, grouped under the following general headings:
- Sustainability Factors
- Cost Factors
- System Implementation Factors (Full Lifecycle)
- Settings and Capabilities (Quality and Functionality Factors)
The still image format-comparison is a joint effort of the Government Printing Office, the National Archives, and the Library of Congress. The initial posting compares JPEG 2000, “old” JPEG, TIFF, PNG, and PDF, and several subtypes. In time, the findings from this project will be integrated into the Working Group’s continuing refinement of its general guideline for raster imaging.
Speaking for all of the compilers, I will note that we have varying levels of confidence about our findings, and we hope to benefit from the experience and wisdom of our colleagues. (The FADGI site includes a comment page. As I was drafting this blog, we received very helpful comments from colleagues at Harvard University.) The FADGI working group is not alone in parsing this topic. Members of the digital library community discuss the pros and cons of various still image target formats from time to time. During the first week of May this year, for example, there was a vivid exchange in the Digital Curation Google Group.
In this first blog of two, I’ll sketch a bit of background and offer some notes about the tried-and-true TIFF-file-with-uncompressed-picture-data. The second blog will offer some thoughts about JPEG 2000–one motivation for the format comparison was to size up JPEG 2000–and also PNG. We are not aware of any preservation-oriented libraries or archives that employ PNG as their master target format. The absence of experience narratives for this particular application left us with only a moderate level of confidence in this part of our comparisons.
The “which format” question has two dimensions, although it is not clear that these are always carefully attended to. One aspect is the wrapper, what some would call the file format (although that is narrower than the definition provided in the FADGI glossary). TIFF is an archetypal example of a wrapper–you have a header and a handful of structural features–and it can contain a number of different picture-data encodings.
These days, the most frequently used encoding employed by memory institutions is uncompressed, barely an encoding at all. With uncompressed data, the raster (aka bitmapped) data is stored in a straightforward manner, one sample point after another in a grid. (The term raster connects back to the word rastrum, the name for a five-pointed pen used to draw music staff lines, a tool that resembles a rake and connects us to Latin radere, more or less to scratch or scrape.) Specialists call the sample points where the grid lines intersect picture elements or pixels.
The values stored in the file on a pixel-by-pixel basis may represent grayscale or color information in varying degrees of precision, depending on how many bits are allocated to each pixel. An uncompressed data structure has one powerful strength: it is relatively transparent. It would not be difficult to build a tool to read the wrapper information and also unpack the rasterized data in order to present the image. To be sure, there is a correlative weakness: the lack of compression makes for big files.
Uncompressed TIFF files consume a lot of storage space, and each time you summon one up, it takes a bit of time to read back from the storage media and travel thru the network to your display device. Although not extensively used at the Library of Congress, TIFF does support the use of the LZW compression algorithm, which will generally cut the size of grayscale or color bitmap in half, with a corresponding decrease in transparency.
The TIFF wrapper specification was developed by the Aldus Corporation, with some Microsoft connections, in the 1980s, and moved to Adobe in the 1990s more or less when Adobe bought Aldus. The most recent complete specification, version 6, dates from 1992. It is a very open and well documented industry standard, i.e., not a capital-S standard from a Standards Developing Body like the International Organization for Standardization (ISO). As the 1992 date indicates, TIFF is a little long in the tooth, although its endurance in time can be seen as a strength, especially considering the wide array of applications that can read it. Worth noting, however, is the fact that the application array is not as deep as one might wish: TIFF files cannot be read natively in most browsers (you typically need a plug-in, but there are plenty around). Apple’s Safari is notable exception.
Meanwhile, there are schools of thought about embedding metadata in digital files, and digital library folks sometimes debate about what type, how much, and even whether it is a good idea to embed. (This writer is strongly in favor of embedding a “core” chunk, including an identifier that gives folks a bread crumb trail back to, say, a bibliographic record or other metadata in a database.) The TIFF header can carry an identifier although there are differences of opinion as to exactly where and how. But for those keen on what librarians call descriptive metadata, the native TIFF header is not so helpful. Many folks (especially professional photographers) solve the problem by using Adobe’s XMP specification (now an ISO standard) together with the IPTC metadata standard but, at the Library of Congress, we have not yet taken the plunge.
Part Two of this series appeared on Thursday, May 15, 2014.
Excellent post. Very well explained.
Fantastic work and great to see it being shared so widely! Thank you.
Just a quick word on your matrix. We have been recommending the use of JPEG 2000 and png images as masters. PNG files have been easy to create and open.
We have not recommended the use of TIFF, partly because of the tagging issue you discuss above, but also because the multipage option is not readily supported, other than in Microsoft viewers.
Great post…I love the historical context for “raster”!
While I appreciate these debates and examinations, it’s important to keep in mind that the digital images are simply representations of “the real thing,” and will never actually be “the real thing”. The choice of image format will never get one closer to the goal of producing an accurate representation of the original. I’m not saying file format choices don’t matter at all, but in the scheme of things using the most ubiquitous, accessible formats will work fine for pretty much any reasonable purpose (I prefer JPEG), and you are far better off spending time and energy considering good imaging equipment and technique…as they say, garbage in, garbage out.
I used to be of the philosophy that one should store TIFF as master and produce JPEGs for access. However, using good cameras, good lighting, good target surfaces, and good post-processing tools can produce JPEGs that are so beautiful, high quality, and high resolution that if done properly one will never find a need to go back to the TIFF at all. I have done amazing printed reproductions from JPEGs as well.
I realize that it is blasphemous to suggest that just storing JPEGs is ok, but in my experience JPEGs can deliver the quality, are small enough to manage and deliver easily, consume a fraction of the storage footprint of TIFFs, can be opened natively in literally every application that deals with images, and can easily be wrapped up into a PDF or converted to other formats (including JPEG2000 if you want to change your delivery format to accommodate tiling or on-the-fly derivative generation – yes, it’s true!).
While JPEG is “old” (though I prefer the word “mature”), the poor-looking low-res JPEGs from older digitization projects on the web have nothing to do with the fact that they are JPEGs and have everything to do with the facts that scanners were not as sophisticated, digital cameras barely existed, monitors were lower resolution, computers were slower and lower-capacity, and dial-up limited the amount of data you could post on a website.
Use whatever format you are most comfortable with, but I just caution against dismissing the merits of JPEG just because it has less 1’s and 0’s, or because “compressed” is often mistakenly interchanged with “low-quality”. If you choose to save uncompressed TIFFs, you will spend a lot of money storing them. If you choose to use JPEG2000, you will spend a lot of time and money making them accessible.
That said, looking forward to the next post!
Thanks for the responses. And special thanks to Mitchell Brodsky for his graceful handling of an exchange of views! I have a friend whose favorite aphorism is, “Everything’s a trade-off,” and that fits very nicely here.
One issue that Mitchell brings to the surface has to do with lossy compression, no matter the encoding. Often (but not always — see below) we are talking about surrogates for the real thing. (Classic example: an Abe Lincoln signed letter, the original of which will be forever conserved.) In such instances, you could apply a “good enough” yardstick.
In my Digital Curation Google Group posting, I said that some folks have looked at various categories of material, e.g., (a) catalog cards, (b) widely held twentieth-century printed matter (think “congressional documents”), (c) maps, (d) manuscripts, and (e) photographs (including negatives). Reading from left to right, one can make the argument that the imaging stakes go up in a progressive manner. Might lossy compression be acceptable for the first one or two in the spectrum, and not for the others? The Library carried out a bit of an exploration of the lossy-vs-lossless question in the mid-1990s but the findings did not take root.
That twenty-five-year-old project compared old JPEG to uncompressed images and, for certain classes of manuscript, (slightly) lossy JPEGs seemed acceptable. But the context was different. For example, we were then shy about color–we remembered that monochrome had always been accepted for preservation microfilms and we worried about big files. Today, I think we would likely embrace color for most manuscripts and, if we still accepted lossy, we would be glad to see the added clarity provided by JPEG 2000. (But Mitchell’s right: JP2 costs more to implement–it’s a trade-off!)
There are, however, some categories where the surrogate argument is weak or even wrong. One archetype for this category is magnetic recordings of video and audio, endangered by the risk of deterioration and, more significantly, by the obsolescence of physical formats. It’s getting more and more difficult to find a player for your 2-inch videotape, even if the tape is great shape. A reformatted copy will be a reproduction, I suppose, but more of a replacement than a reference-service stand-in.
In a slightly different way, there is also something beyond surrogacy in the digitization of a photo negative. Very few seek to “have” the digital image represent the negative-qua-negative. In the analog mode, a photo negative is a, um, bundle of potentiality that is realized in the darkroom as a print, and one print from a given negative may differ from the next. Similarly, the digital entity from a scanned negative can be a kind of rich data set, capable of being output in various way, ranging from aesthetically pleasing to forensic. (“Can I see into the deep shadows to find where the killer threw the knife?”) In cases like these–magnetic recordings, photo negatives, no doubt others–the stakes are higher and we will be poorly served by the truncations of lossy compression.
Everything’s a trade-off!
Carl – Thank you for your thoughtful response, and for pointing out that there are scenarios for capturing lossless data which I neglected to acknowledge in my JPEG-espousing fervor.
Regarding audio/visual, I completely agree that because the physical materials are at such a high preservation risk level that we must assume we will not be able to access them again even if we could find the proper equipment to do so. In those cases, the digital surrogate does become a replacement and should be captured in uncompressed, lossless formats. We abide by that philosophy at the New York Philharmonic. I see the argument for photographic negatives as well, though I still believe that imaging equipment is so high-quality these days that one could capture enough data to manipulate the image and detect the hypothetical knife regardless of the output format. (Great idea for a study!)
Paper is different. We know that under the right conditions, the preservation risk of most paper is far lower than that of the digital surrogates (at least for now), and therefore I believe we should digitize paper-based material for the primary goal of easing/increasing access which carries the collateral benefit of protecting the items from further handling. And if that’s truly the motivation, the point is to utilize our digitization budgets to do as much of it as possible.
One problem I see all the time is that in a world of limited resources, institutions are sacrificing their digitization budgets on storing gigantic uncompressed files because they are told that this is the better way — the acceptable way — in all cases. As you say, one size doesn’t fit all, and in cases where it’s ok to just store JPEGs (which nevertheless can be very high resolution), I would rather see institutions do just that, spending their time, money, and energy on producing as many high-quality, low-maintenance digital surrogates as they can. JPEG offers that balance quite nicely.
This discussion has been great food for thought! Thanks!
“keep in mind that the digital images are simply representations of “the real thing,” and will never actually be “the real thing”.”
Aren’t digital images- those photos and works of graphic art created as digital images- being considered real things?
David is correct: born digital content is indeed _real_ and the related preservation issues are a topic unto themselves. For example, we have a FADGI subgroup — led by my colleague Kate Murray — assembling a set of case histories that describe how a number of federal agencies are handling born digital video. This interesting report will be published on the site later in summer. Meanwhile, this blog (and the next) concern still image formats-for-reformatting: if you are scanning books, manuscripts, maps, photographs, what are your options for target formats?
Adding to the discussion by Carl and Mitchell – for items where the physical item is at-risk (or already compromised), I would argue for TIFFs over JPEG2000s. For example, my work involves digitizing negatives on cellulose acetate base, many of which are already suffering chemical damage. Given that the digital master will replace the physical item in time, we opt for keeping the maximum information possible in the form of uncompressed TIFF.
In my opinion, digitizing for access v. preservation have different requirements in terms of file formats, colour management, etc.
Megapixels aren’t everything. You have to think about image size, image quliaty, image typeA 7. 2 MP machine is capable of big images, but most cameras take small, medium and large images. Obviously, smaller images will be smaller size. Most cameras also have settings for image quliaty, like low, medium, and high. The lower quliaty photos will be smaller, and won’t look as good when reduced. The other thing to consider, is some cameras save image files in formats other than jpg. Those other formats may produce bigger or smaller files. Unless you posted the the exif file information for both cameras, there would be no way to tell one way or another what is really going on.