The following is guest post from Amy Rudersdorf, Director, Digital Information Management Program at the State Library of North Carolina.
We’ve got high hopes . . .
Just what makes that little ole ant
Think he’ll move that rubber tree plant?
Anyone knows an ant can’t
Move a rubber tree plant
But he’s got high hopes, he’s got high hopes
He’s got high apple pie in the sky hopes…
–Lyrics for “High Hopes” were written by Sammy Cahn with music by Jimmy Van Heu
An automated ingest process for born-digital publications of and by North Carolina state government has seemed completely outside the realm of possibility for the small staff at the State Library, employed to fulfill a legal mandate to preserve these important documents. It was, if you will, our rubber tree plant. It would have remained so, too, without the 2011 IMLS Sparks! Ignition grant that enabled us to dedicate staff, work with a contractor, and obtain resources to make our apple pie hopes a reality in the form of our Capture, Ingest, and Checksum tool.
So, what does CINCH do and why should your institution care? At a basic level, CINCH is a series of micro-services that are designed to locate targeted files on the internet and download them in a “preservation-ready state.” This includes maintaining the files’ integrity by running checksums throughout the process, virus checking, and enhancing the files’ context with automated metadata extraction.
Think of CINCH as an assembly line, with a series of deliberate actions being performed on a single file in progression (although, in reality, CINCH can process up to 4,500 of files per batch). At each step, CINCH attempts to perform a desired action. Whether successful or not, each action and its result is recorded for the user’s review. The final outcome is a zipped file that contains the original files targeted by the user, their associated metadata, and an audit trail document outlining every action taken upon the files.
Users upload a manifest containing their list of targeted files, which can consist of any combination of up to 11 different file types (.csv, .doc, .docx, .gif, .pdf, .png, .ppt, .pptx, .txt, .xls, .xlsx). Micro-services are first performed on the remote files: hashing, file duplication check, file type verification, assessing filenames for uniqueness and evaluating file sizes. Files are then downloaded to the server where the next series of micro-processes are run: virus scan (and immediate deletion if something sinister is detected), verification that the last modified date has not changed on download, checksum hash, de-duping based on filename and checksum, and metadata extraction. The files, along with the metadata and any error reports, are then packaged for download. Finally, users receive an email alert that their files have been fully processed and they have up to 30 days to download their package for local ingest.
The beta version of the software is free for download and local install from github. A stable downloadable release, as well as a hosted version for any North Carolina institution’s use, will be available in August 2012.
Here at the State Library, we’re feeling pretty confident that CINCH has met that huge goal we set for ourselves last year – to push our rubber tree a step or two forward. And, we hope that with this tool, other small and mid-sized institutions will be able to share our sentiment that digital preservation isn’t quite so “apple pie in the sky.”
Thanks to IMLS, our funder, and to our project partners: State Archives of North Carolina, North Carolina Libraries for Virtual Education (NC LIVE), Elon University’s Belk Library, and the J. Murrey Atkins Library at the University of North Carolina at Charlotte.