The following is a guest post by Steve McCollum, Digital Media Project Coordinator, Office of Strategic Initiatives.
Central to any digital preservation strategy is making sure that the stuff you have is the right stuff. To that end, the Library of Congress endeavors to make sure that digital image files delivered by contractors in a variety of formats meet the standards for the given format, and in some cases satisfying institutional requirements over and above those of the standard itself. When digital files meet the standard for their format, they are said to be well-formed and valid (more discussion on these concepts below). This is an important first step in providing for long-term preservation of digital files.
Out of this need, the Configurable Image Validator (CIV) was developed by the Digital Conversion Services group in the Office of Strategic Initiatives. CIV is built on top of the JSTOR-Harvard Validation Environment (JHOVE). JHOVE provides an extensible framework to perform format-specific identification, validation, and characterization of digital objects. CIV encompasses all these features of JHOVE and extends functionality through a user interface. The CIV user interface provides easy access to selected formats in the JHOVE application and allows users to create, store, and share custom profiles they can use to validate files. The selected image files that CIV validates are TIFF, JP2, JPEG, PDF, and GIF. These formats were selected because they are the most common image files used at the Library of Congress.
CIV, because it is using JHOVE, adheres to the criteria laid out in the JHOVE documentation for assessing the validity and well-formedness of file formats. In the case of JHOVE/CIV, the concept of distinguishing between well-formedness (syntactic correctness) and validity (semantic correctness) was taken from XML. The detailed criteria for validity and well-formedness for the above mentioned file types can be found on the JHOVE web site, but in short, a file is said to be valid if it meets the specification for its indicated format and is said to be well-formed if it meets the “purely syntactic requirements for its format.”
As an aside, a valid and well-formed file does not guarantee that your digital files are acceptable and that you are home free. CIV will not catch files that are missing visual content, or a directory that is missing files i.e. pages from a book. That’s why it’s important to make sure you received everything you were expecting in your delivery by supplementing validation with a visual inspection of your files.
Here at the Library of Congress, Digital Conversion Services in the Office of Strategic Initiatives, the Prints and Photographs Division, and the World Digital Library Program all use CIV to assess the validity and well-formedness of digital image files. These service units also create and use custom profiles to check the presence of additional information in file metadata required by the Library or their particular service unit or program. Workflows vary among each of these units, but generally CIV is used upon receipt of delivery of the files to perform a first level of review before the more labor intensive task of visual quality checking begins.
How is it used? Let’s say, for example, a vendor is delivering a batch of 2,000 image files in the TIFF format. On the Library’s web site are embedded metadata requirements for TIFF image files. Among those requirements is that in the embedded metadata of the TIFF file, tag 315 (this field is reserved for artist or creator) must contain the value “The Library of Congress”. Because this is such a large delivery, checking each file manually would be incredibly time-consuming: in fact, you’d probably skip this step because of it. But with CIV such checks can be done quickly and rather easily. You would simply build a custom profile in which you assert that “Library of Congress” is the required value for the 315 field Next, tell CIV where the files are located (mapped drive, hard drive, local drive, etc.) and start the validation process with the click of a button. If the required metadata is not present, CIV will catch the error and inform the user that there is problem and identify the file or files that lack the data. The files can then be returned to the vendor with instructions to fix the metadata before any time has been spent on visual inspection.
CIV informs the user on the status of file validation by providing a visual indicator of its progress as well as creating a log file. If the validation is proceeding successfully, a green status bar is displayed. If a file does not validate, the bar turns red. In these cases, the user can consult the log file to see the error message and provide this information to the vendor to help with fixing the file. Screen captures of the application in use follow at the bottom of this post.
CIV is a stand-alone application and as such requires a human operator to perform the validation process. CIV is an example of how organizations can utilize JHOVE for their own purposes. The Library of Congress is happy to share the application and source code with others as one example of how to do this.
Looking into the future, integrating CIV (or something like it) into a workflow tool is being explored to automate the receipt and validation task. This next iteration of CIV will likely incorporate the next generation of JHOVE, called JHOVE2, which is being funded by the Library of Congress under the National Digital Information Infrastructure and Preservation Program.
Comments (2)
Would not it be simpler instead of returning the files to vendor to have the software fix the problem immediately by inserting missing meta data ? If it can find such things it should be able to fix ?
It would require engineering beyond what we’ve done so far. And JHOVE, so far as I know, doesn’t inherently provide the functionality to edit embedded metadata, so it would at a minimum require the incorporation of a new tool. Your suggestion has a lot of merit though, and is worthy of more discussion.