The following is a guest post by Kate Zwaard, a Supervisory Information Technology Specialist in the Library of Congress Office of Strategic Initiatives.
The Library of Congress and its partners continue to work on ways to help users communicate and evaluate trustworthiness of the electronic material they are accessing.
The risk is higher for some content more than others. It wouldn’t be worth the time and effort it would take to alter this webpage, for example, because the pay-off would be low. But you could imagine some monetary incentives for illegally altering a federal regulation or law. That’s why government agencies, libraries and archives are so invested in thinking of ways to secure the chain of custody for electronic material.
I like to describe content authentication programs as consisting of two parts: 1) tools and evidence we provide to users so they can assure themselves that they’re looking at the “real” thing and 2) tools the stewards of the information use to reduce the risk of content being altered. A cornerstone of both are the methods we use to determine fixity, that is, that a file has not been accidentally or purposefully changed.
What is a Hash?
To make things more confusing, you’ll hear people use the following terms somewhat interchangeably: digital fingerprint, hash code, hash value, checksum, cyclic redundancy check, cryptographic hash, message digest, hash sum. They don’t all mean the same thing, but commonly used to refer to something that lets us check whether a file has changed.
A hash function transforms a string of characters into another (usually shorter) string of characters (a hash value). Thomas Jefferson once wrote to John Adams, “I cannot live without books: but fewer will suffice where amusement and not use, is the only future object.” Using the MD-5 hash function on this sentence gives us the value “5d8c5cc5faa7a59716d9f8f9dbf8de9a.” If I were to change the sentence to “I cannot live without ice cream sandwiches…,” the hash value (also sometimes called the message digest) would be completely different: 107d6c38b26a172bbf3adfba98c8c3b9. You can perform the same operation on a computer file, like a PDF or a JPG, which help us detect changes to files, intentional or not.
What is a Cryptographic Hash?
There are many ways (called algorithms) to calculate a hash value and some are more secure than others. Of course, “secure” can mean lots of things, but in this instance it means that it’s very, very difficult (almost impossible) to
- Figure out the input message for a given hash (pre-image resistance), and
- Find two different messages with the same hash value (collision resistance).
If a hash function meets these criteria, we call it a cryptographic hash function and say that it’s suitable for use in authenticating people, content, systems, etc. This is where the analogy to digital fingerprint comes from, because it would be impossible to derive a Benjamin Franklin given only his right index fingerprint and any particular fingerprint corresponds to one and only one human.
Digital Fingerprint, Checksums, CRCs
Checksums and Cyclic Redundancy Checks are hash functions commonly used to detect errors in files, but are not recommended for authenticating content, since they are more prone to attack than are cryptographic hash functions.
Understanding the basics of hash functions is important, because it is the underpinning for all content integrity and authentication tools, like digital signatures and security certificates. And, ultimately, we cannot say we’ve been successful stewards of the content under our care until we can say that it is exactly what we think it is.
Great run-down of the various types of “hashes.” Based on some of the presentations I heard while involved with the PhotoMetadata project (a LoC, NDIPP project) about this process, I did some research of my own. I wanted some fairly easy way to verify image file transfers (though the same methods can be used for any type of file); and found that MD-5 hashes were sufficient for my purposes.
I summarized my findings on the following post and share it here as it might be of use to others: http://www.controlledvocabulary.com/imagedatabases/file-verification.html
Especially since the tools to do this are freely available (I note a few that are available for both Mac and Windows platforms).
As a side benefit, I found that I can use the same process on the contents that I intend to write to a CDR or DVD-R; and that this provides an easy means to check on media health. Just pop in the disc, run the MD-5 hash file and each file is hashed again and compared to the stored value on the disc. If any don’t match, it could be a sign that the media is failing — so you know to check your other backup and make a new CDR or DVDR.
Thanks for sharing.