A single photograph in a personal collection or archive might be represented by any number of derivative files of varying sizes, in varying formats, all with different sets of embedded metadata. At the bit level, all of the variations of the photograph are unique. However, in practice, a particular individual or organization might just be interested in holding on to one copy of the image. You can get a sense of the kinds of permutations and variations of digital files we create in Cathy Marshall’s 2010 keynote for the Code4lib conference people, their digital stuff, and time (slides).
An organization can easily have 15 PDFs of the same article, each with a different cover page, but all of which are substantively identical. Again, at the bit level you have 15 unique articles, but if you had a trusted way to be able to identify these 15 PDfs it could cost you 15 times less to store the article over the long haul.
How do we go about being able to make these kinds of community dependent calls on what constitutes equivalent digital objects? How can we better operationalize our ideas about what accounts for significant differences between different digital objects for different potential user communities? This set of issues is one of my favorite issues identified in the recently released NDSA National Agenda for Digital Stewardship. I thought I would take this opportunity to talk through what I think is both particularly intriguing about the future of understanding digital equivalence and significance and mention some of the points that seem promising as ways we might be able to scale up the process of making judgement calls on equivalence.
Here is a bit from the equivalence and significance section of the report that I’m referencing.
Preservation research needs to map out the networks of similarity and equivalence across different instantiations of objects so that they can make better decisions on how to manage content, bearing in mind what properties of a given set of digital objects are significant to their particular community of use. Research is also required in order to characterize quality and fidelity dimensions and create methods for computing format-independent fingerprints of content, so that the fidelity of digital objects can be effectively managed over time.
The report goes on to identify two particularly interesting potential modes for developing ways to identify information equivalence that I thought some readers might like explained in a bit more depth.
Fuzzy Hashing and Degrees of Bit-level Similarity
You may be familiar with the concept of checksums and cryptographic hashes. They are a way to create something like a digital fingerprint for a file or bitstream. Most of the ways people generate hashes results in very minor differences in two files result in totally different hash values. For instance, two identical photographs with one cropped a single pixel smaller should generate completely different hash values. As a result, these hashes are great for telling us what two digital objects are exactly the same but are useless in telling us how similar two digital objects are.
In contrast, there are techniques for fuzzy hashing that attempt to identify the percentage similarity between the bitstream of two files. There is considerable potential for applying some of the work on fuzzy hashing to help digital content stewards make decisions about what minor differences between files do and don’t matter. An interview about the National Software Reference Library from last year discussed some of the work going on there on similarity digests that fits into this same area of research. In short, there are already algorithms out there we could be using to better understand, at the bit level, how similar or different a set of files are and there is considerable potential to apply these (and future algorithms) into curitorial workflows.
Comparing Rendered Content Algorithmically
Along with looking at bit level patterns, there are a range of promising approaches to analyzing and interpreting rendered content. For example, some image search systems will now give you the option to view similar or related images based on the qualities of the rendered photo. Beyond image comparison, the same approaches have the potential to identify similarity across audio and video and text files. Tools that could identify similar digital objects in these ways would be invaluable for both selection and for creating metadata about the relationships and connections between objects. All of this work on similarity has the potential to generate that kind of descriptive metadata and power visual interfaces for exploring relationships and connections between digital objects.
Both approaches to identifying bit level similarity and similarity in rendered digital content offer considerable potential value to stewards of digital content. Beyond continued basic research in these areas there is a need to begin translating existing work into tools and workflows for stewards of digital collections. In this respect, there is considerable potential for work exploring how to apply these different approaches to similarity in particular collection use cases. Applying these ideas of similarity in different situations will ultimately help us unpack the relationship between content similarity and the significant properties of particular sets of objects in particular stewardship and use contexts.
Comments (2)
The integrity of the piece was lost at the first sentence “A single photograph in a personal collection or a archive might be represented by any number of derivative files of varying sizes, in varying formats, all with different sets of embedded metadata.” I address the reality of Metadata often when I write, clearly in my submission on Orphan Works filed with Maria’s office back in February.
As fast as Authors are putting their Metadata into images/ARTS works, the Metadata is being removed by Technological advancements. HENCE technology is creating the very ORPHAN WORKS that technology wants to profit from….
I am hoping next week my site for my Center focused on PROTECTING PROPERTY rights will be up and running. I would like to link your piece to it as a lesson on how the ARTS experience is not factored in to the reality of Proponents of Preservation etc. So we are clear, ARTS want to work with you but while you are getting a weekly paycheck to do your job, we bet on ourselves and hence ‘delay’ payment that now is being taken without compensation for purposes we did not approve or know of until too late.
So when you are ready to work a week without a paycheck, then we will ponder your request for us to give our ARTS away for free
Carrie Devorah
One of the Original Members of LIMA
An Advocate for Copyright Integrity
http://www.carriedevorah.wordpress.com search ‘ Copyright ‘
“Teaching Legislators One Hearing At A Time the Value of 2D IP to the Arts”
At the UK Web Archive, we’re currently experimenting with using the ssdeep fuzzy-hash of the textual content of web resources as a fingerprint to track significantly similar content and non-identical duplicates. Too early to say how well this works, but we’ll blog about/publish the results as and when we can.
On render comparisons, the SCAPE project is developing a few tools in this area. In particular, the Pagelyzer is intended to characterise and compare renderings of web pages.
Note that, in principle, these two approaches could be combined: the features of the rendered image used for comparison can instead be used to construct a characteristic string which would then be fuzzy-hashed. This should make is possible to find images with similar features, for example.