Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design requirements. These decisions often require revisiting as technology improves and what previously seemed like a reasonable compromise now feels constricting.

I recently ran into an example of this while working on the next version of the World Digital Library website, which still has substantially the same look and feel as it did when the site launched in April of 2009. The web has changed considerably since then with a huge increase in users on mobile phones or tablets and so the new site uses responsive design techniques to adjust the display for a wide range of screen sizes. Because high-resolution displays are becoming common, this has also involved serving images at larger sizes than in the past — perfectly in keeping with our goal of keeping the focus on the wonderful content provided by WDL partners.

When viewing the actual scanned items, this is a simple technical change to serve larger versions of each but one area posed a significant challenge: the thumbnail or reference image used on the main item page. These images are cropped from a hand-selected master image to provide consistently sized, interesting images which represent the nature of the item – a goal which could not easily be met by an automatic process. Unfortunately the content guidelines used in the past specified a thumbnail size of only 308 by 255 pixels, which increasingly feels cramped as popular web sites feature much larger images and modern operating systems display icons as large as 256×256 or even 512×512 pixels. A “Retina” icon is significantly larger than the thumbnail below:

Icon SizesGoing back to the source

All new items being processed for WDL now include a reference image at the maximum possible resolution, which the web servers can resize as necessary. This left around 10,000 images which had been processed before the policy changed and nobody wanted to take time away from expanding the collection to reprocess old items. The new site design allows flexible image sizes but we wanted to find an automated solution to avoid a second-class presentation for the older items.

Our original master images are much higher resolution and we had a record of the source image for each thumbnail but not the crop or rotation settings which had been used to create the original thumbnail. Researching the options for reconstructing those settings lead me to OpenCV, a popular open-source computer vision toolkit.

At first glance, the OpenCV template matching tutorial appears to be perfect for the job: give it a source image and a template image and it will attempt to locate the latter in the former. Unfortunately, the way it works is by sliding the template image around the source image one pixel at a time until it finds a close match, a common approach but one which fails when the images differ in size or have been rotated or enhanced.

Fortunately, there are far more advanced techniques available for what is known as scale and rotation invariant feature detection and OpenCV has an extensive feature detection suite. Encouragingly, the first example in the documentation shows a much harder variant of our problem: locating a significantly distorted image within a photograph – fortunately we don’t have to worry about matching the 3D distortion of a printed image!

Finding the image

The locate-thumbnail program works in three steps:

  1. Locate distinctive features in each image, where features are simply mathematically interesting points which will hopefully be relatively consistent across different versions of the image – resizing, rotation, lighting changes, etc.
  2. Compare the features found in each image and attempt to identify the points in common
  3. If a significant number of matches were found, replicate any rotation which was applied to the original image
  4. Generate a new thumbnail at full resolution and save the matched coordinates and rotation as a separate data file in case future reprocessing is required

You can see this process in the sample visualizations below which have lines connecting each matched point in the thumbnail and full-sized master image:

The technique even works surprisingly well with relatively low-contrast images such as this 1862 photograph from the Thereza Christina Maria Collection courtesy of the National Library of Brazil where the original thumbnail crop included a great deal of relatively uniform sky or water with few unique points:

Scaling up

After successful test runs on a small number of images, locate-thumbnail was ready to try against the entire collection. We added a thumbnail reconstruction job to our existing task queue system and over the next week each item was processed using idle time on our cloud servers. Based on the results, some items were reprocessed with different parameters to better handle some of the more unusual images in our collection, such as this example where the algorithm matched only a few points in the drawing, producing an interesting but rather different result:

Reviewing the results

Automated comparison

For the first pass of review, we wanted a fast way to compare images which should be very close to identical. For this work, we turned to libphash which attempts to calculate the perceptual difference between two images so we could find gross failures rather than cases where the original thumbnail had been slightly adjusted or was shifted by an insignificant amount. This approach is commonly used to detect copyright violations but it also works well as a way to quickly and automatically compare images or even cluster a large number of images based similarity.

A simple Python program was created and run across all of the reconstructed images, reporting the similarity of each pair for human review. The gross failures were used to correct bugs in the reconstruction routine and a few interesting cases where the thumbnail had been significantly altered, such as this cover page where a stamp added by a previous owner had been digitally removed:

7778 original7778 reconstructed

 

 

 

 

 

 

 

 

http://www.wdl.org/en/item/7778/ now shows that this was corrected to follow the policy of fidelity to the physical item.

Human review

The entire process until this point has been automated but human review was essential before we could use the results. A simple webpage was created which offered fast keyboard navigation and the ability to view sets of images at either the original or larger sizes:

Screen Shot 2014-08-03 at 18.42.23This was used to review items which had been flagged by phash as less than matching below a particular threshold and to randomly sample items to confirm that the phash algorithm wasn’t masking differences which a human would notice.

In some cases where the source image had interacted poorly with the older down-sampling, the results are dramatic – the reviewers reported numerous eye-catching improvements such as this example of an illustration in an Argentinian newspaper:

Illustration from “El Mosquito, March 2, 1879″ (reconstructed).

 

Conclusion

This project completed towards the end of this spring and I hope you will enjoy the results when the new version of WDL.org launches soon. On a wider scale, I also look forward to finding other ways to use computer-vision technology to process large image collections – many groups are used to sophisticated bulk text processing but many of the same approaches are now feasible for image-based collections and there are a number of interesting possibilities such as suggesting items which are visually similar to the one currently being viewed or using clustering or face detection to review incoming archival batches.

Most of the tools referenced above have been released as open-source and are freely available:

Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress

The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection. The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the […]

Making Scanned Content Accessible Using Full-text Search and OCR

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library. We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for […]

Web Archiving and Preserving the Performing Arts in the Digital Age

The following is a guest post from Gavin Frome, an intern for the Web Archiving Team at the Library of Congress. Performing artists are by necessity a traveling people. They journey far and wide in the pursuit of their respective crafts, working, learning and weaving a fabric of loose cultural connections that help bind people […]

Recommended Format Specifications from the Library of Congress: An Interview with Ted Westervelt

Continuing the NDSA Insights interview series, I am thrilled to talk about the new Library of Congress Recommended Format Specifications with Ted Westervelt, head of acquisitions and cataloging for U.S. Serials – Arts, Humanities & Sciences at the Library of Congress. Ted has been overseeing the development of the Recommended Format Specifications. While the specifications […]

National Digital Stewardship Residents Conclude Program with Capstone Meeting

The following is a guest post by Kris Nelson, Program Management Specialist at the Library of Congress and Program Coordinator of the National Digital Stewardship Residency. A version of this article was originally published in the Library of Congress weekly staff newspaper The Gazette. The National Digital Stewardship Residency concluded the inaugural year of the […]

Saving the Date: Exploring Calendar and Scheduling Formats

Each January, my family picks out a new wall calendar to hang in our kitchen. Its main appeal these days is nostalgic decoration since we no longer use it to write down our appointments or important dates. Like many people, we now rely on electronic calendar and scheduling tools built into personal information manager software […]

Shaking the Email Format Family Tree

Recently, we’ve started to add email formats to the Sustainability of Digital Formats website. Eventually, when we get a more robust collection, we’d like to split them out into a separate content category but for now, they (mostly) are categorized with their closest cousin, the Textual Content family.  Our genealogical research is still very much […]