BitCurator’s Open Source Approach: An Interview With Cal Lee

Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill

Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill

Open source software is playing an important role in digital stewardship. In an effort to better understand the role open source software is playing, the NDSA infrastructure working group is reaching out to folks working on a range of open source projects. Our goal is to develop a better understanding of their work and how they are thinking about the role of open source software in digital preservation in general.

For background on discussions so far, review our interviews with Bram van der Werf on Open Source Software and Digital Preservation, Peter Van Garderen and  Courtney Mumma on Archivematica and the Open Source Mindset for Digital Preservation Systems and Mark Leggott on Islandora’s Open Source Ecosystem and Digital Preservation. In this interview, we talk with Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill about BitCurator.

Trevor: The title of your talk about BitCurator to the NDSA infrastructure working group explained it as “An Open-Source Project for Libraries and Archives that Takes Bitstreams Seriously.” Could you unpack that a bit for us? What does it mean to take bitstreams seriously and why is it important for archives to do so?

Cal: Computers store and process information through physical mechanisms, such as turning transistors on/off and changing/detecting the magnetic properties of the surface of a disk.  However, software is designed to deal with bitstreams, which are abstractions of those physical properties into sequences of 1s and 0s.  As I’ve expressed elsewhere, the bitstream is a powerful abstraction layer, because it allows any two computer components to reliably exchange data, even if the underlying structure of their physical components is quite different. In other words, even though the bits that make up the bitstream must be manifested through physical properties of computer hardware, the bitstreams are not inextricably tied to any specific physical manifestation.  So the bitstream will be treated the same, regardless of whether it came off a hard drive, solid state drive, CD or floppy disk.

The bitstreams can be (and often are) reproduced with complete accuracy.  By using well-established mechanisms – such as generation and comparison of cryptographic hashes (e.g. MD5 or SHA1) – one can verify that two different instances of a bitstream are exactly the same. This is more fundamental than simply saying that one has made a good copy. If the two hash values are identical, then the two instances are, by definition, the same bitstream.

In our everyday use of computers, we luckily don’t need to worry about bitstreams.  We focus on higher-level representations such as documents, pages and programs.  We click on things, copy things and open things, without having to worry about their constituent parts.  But those responsible for the long-term preservation of digital information need to attend to bitstreams.  They need to ensure the integrity of bitstreams over time by generating and then periodically verifying the cryptographic hashes that I mentioned earlier.  They also often need to view files through hex editors, which are programs that allow them to see the underlying bitstreams (presented in 8-bit chunks called bytes), so they can identify file types, extract data from otherwise unreadable files, figure out the underlying contents and structures of files, and even reverse engineer formats in order to bring otherwise obsolete files back to life.

Bitstreams are also important when it comes to preserving the information acquired on removable media such as hard drives, flash drives, CDs or floppy disks.  Well-established practices in the field of digital forensics involve using a write blocker to ensure that none of the bits on the medium are accidentally changed or overwritten, and then creating a disk image.  A disk image is a perfect copy of the bitstream that is read off the disk through the computer’s input/output equipment.  It essentially allows librarians and archivists to retain all of the contents of a disk without having to rely on the physical medium.  This is important, because the medium will not be readable forever, so the bits need to be “lifted” off and placed in other storage.  It’s also important because there are many forms of data stored on the disk that may not be replicated correctly simply by copying and pasting the files from the disk.  The standard forensics software that creates a disk image also generates a cryptographic hash of the entire disk image (as opposed to the hashes of the individual files), so someone in the future can verify the disk image and ensure that none of the bits have changed.

The process for creating a disk image begins by being able to read the physical media. For example, a 3.5 inch disk like this.

The process for creating a disk image begins by being able to read the physical media. For example, a 3.5 inch disk like this.

Trevor: Disk images are an important part of that bitstream focus. At its core, BitCurator functions to help create disk images and then enable a user to carry out a range of operations on disk images. Could you tell us a bit about how your team is thinking about disk images themselves as a format? For example, to what extent is the image the artifact and the process of creating an image a preservation action? Or, conceptually is the image more of akin to a derivative of the artifact?

Cal: As I explained earlier, a bitstream is the same bitstream regardless of how it’s physically stored.  So if you navigate to a file that’s stored on your computer and send it to me as an email attachment, and I then save it to my computer, my copy of the bitstream will be exactly the same as your copy.  The associated metadata, such as the file name and timestamps could be completely different, but the file as a bitstream will not change (assuming there has been no corruption of the file along the way).  We can verify this by generating hashes on the two copies and seeing that they match.

This same set of relationships applies to disk images.  If you create a disk image of a floppy disk and send it to me, I’ll then have the exact same bitstream that you have.  If you create another disk image of that disk, it should also be exactly the same (again, assuming no data loss due to hardware failure).  It is this disk image that we need to treat as the “original” in a digital environment.  This is true for two fundamental reasons.  First, software on your computer doesn’t have access to the underlying physical properties of a disk the same way that a reader has direct access to the physical properties of a printed page.  The bitstreams that computers read, manage and process are always mediated through the computers’ input/output equipment.  So, except in extremely rare cases of heroic recovery, there’s no practical value in treating the contents of a disk as anything other than the stream of bits that can be read through the I/O equipment.  In other words, for practical purposes, the disk image is the disk.

The second reason to treat the disk image as the original is that the physical disk will not be readable forever.  The industry will abandon support for the hardware and low-level software/firmware required to read it.  The performance of the medium (its storage capacity and input/output transfer rate) will become less acceptable over time – ever try to store a terabyte of data on floppy disks?  And the bits will eventually be lost through natural physical aging.

This doesn’t mean that the artifactual properties of hardware are never important.  Understanding the original hardware can be important to knowing what the user experience was like at the time.  And taking pictures of original media in order to reflect things written on them can be a good way to reflect aspects of the creator’s intentions and work habits.

Here you see the interface for Guymager, the tool BitCurator uses to create disk images.

Here you see the interface for Guymager, the tool BitCurator uses to create disk images.

Trevor: How is the BitCurator team approaching interoperability between this tool and other digital preservation tools?

Cal: Probably the most important answer to your question is that all of the BitCurator software is distributed under an open-source license.  This means that people can download, manipulate and redistribute whatever parts they find useful.

We’re also in regular contact and collaborate with people involved in various other development activities.  For example, Courtney Mumma from Artefactual Systems is on the BitCurator Development Advisory Group, and we work closely with Artefactual to ensure that the BitCurator software and its data output are structured and packaged in ways that can be incorporated into Archivematica.  Mark Matienzo is also on the DAG, and we’ve had many discussions with him about how the BitCurator software can play well with ArchivesSpace.  Similarly, we strive to stay abreast of related software development activities being carried out within collecting institutions, such as the valuable work of Peter Chan at Stanford, Don Mennerich at the New York Public Library, Mark Matienzo at Yale, and activities outside the US that are represented well by the documentation that Paul Wheatley has developed for the Open Planets Foundation.

Kam Woods, who is the BitCurator Technical Lead, carries out extremely important liaison activities between our team and not just developers in the cultural heritage sector but also developers of standards and software in the forensics industry.  This is particularly important for BitCurator, because we’re repurposing, adapting and repackaging many existing open-source digital forensics tools.  Identifying and managing software dependencies is an ongoing process.

Viewing reports on a disk image in Bulk Extractor

Viewing reports on a disk image in Bulk Extractor

Trevor: Could you tell us a bit about the design principles at work in the BitCurator project? That is, instead of trying to build things from scratch you seem to be bringing together a lot of open source software created for somewhat different use cases and make it useful to archives. Why did your team develop this approach and what do you see as its benefits and limitations?

Cal: Almost twenty years ago, in a book called Darwin’s Dangerous Idea, Daniel Dennett argued that complex systems evolve through what he called the “accumulation of design.”  New products, services and theories and various other human products build off of existing ones.  Software development is no different.  Programmers know that it’s usually better to make use of existing code than to build it from scratch.  Why write the code required to write text to the screen, for example, if someone else has already done that?  Open-source software facilitates this process, because reusing someone else’s code doesn’t require the negotiation of permissions or payment.

Code adaptation and reuse is a particularly powerful proposition for the application of digital forensics to digital collections, because there is a great deal of powerful software that has already been developed, and it’s unlikely that collecting institutions would ever have sufficient resources to develop such tools completely on their own.  As someone who has been working with digital archives for many years, I’ve been amazed by how many tools being developed for digital forensics can be applied to the problems we face.  A great place to see leading-edge development in this space is the Digital Forensics Research Workshop, which is an annual conference that publishes its papers in a journal called Digital Investigation.  I’ve been particularly grateful for the open-source (or public domain) software developed by Simson Garfinkel at the Naval Postgraduate School and Brian Carrier of Basis Technology.

Of course, all design decisions involve costs and benefits.  The main challenges of using software developed by others are that your specific use case may not have been the primary priority of those developers, and as I mentioned earlier, you have to stay on top of dependencies with that existing software as their (and your) software evolves over time.  The BitCurator team and I believe strongly that these costs are well worth the numerous benefits.  And we’re working to support the kinds of use cases that are most important to collecting institutions.

Visualizations of some of the file system metadata created though BitCurator reporting functions.

Visualizations of some of the file system metadata created though BitCurator reporting functions.

Trevor: Could you tell us a bit about how you are thinking about the sustainability of BitCurator? For example, are you thinking about building a community of users and developers? What kinds of future funding streams are you looking to?

Cal: There are various elements of BitCurator that are designed to build capacity and ensure the sustainability of our activities. I’ve already explained that the software is distributed under an open source license, so diverse constituencies will be able to extend our tools at will.  Members of the BitCurator team have been offering a lot of continuing professional education opportunities (including a module for Rare Book School and classes for the Digital Archives Specialist program of the Society of American Archivists), which help to build and cultivate a community of users.  There’s a BitCurator user group that interested professionals can join, and our project wiki includes an increasing body of documentation to help people to install and use the software.

A significant focus of the second phase (October 2013 to October 2014) of BitCurator is to devise and implement a sustainability plan.  This is being overseen and coordinated by Porter Olsen, who is the Community Lead for BitCurator.  We’re currently exploring a variety of membership models.  We should have a much more detailed answer to your question in the coming year.

Trevor: Could you tell us a bit about how you are trying to engage and build a community around the software? What kinds of approaches are you taking and to what ends are you taking those approaches?

Cal: I’ve already talked about most of them within the context of sustainability.  The two issues (sustainability and community building) are closely related.  The products of the BitCurator project will ultimately be sustainable if there are professionals working in a variety of institutions who value them, use them, and contribute back to their ongoing development through evaluative feedback, bug reports and code revisions/enhancements.  In addition to our educational offerings and guidance resources, we’ve also published many papers/articles about this work and given talks at a variety of conferences and other professional events.

Porter Olsen is taking on many new engagement activities this year.  Among other things, this includes site visits and webinars.  The first two webinars that Porter is offering have filled up within a few days of announcing them, so there seems to be a lot of interest.

Trevor: It strikes me that one of the biggest opportunities and challenges here is that there is a significant literacy gap within the community around how to deal with born digital archival materials. For example, if you were making a tool to turn out finding aids there would be relatively solid requirements within the archives community of practice. In contrast, in working with born digital archival materials there is still an extensive need for developing those practices and a significant lack of knowledge about the issues at hand among many in the archives profession. First off, do you agree with this perspective? Second, if so how are you approaching designing a tool while the archives community is still simultaneously bootstrapping its way into working with?

Cal: I agree with you that the landscape is currently undergoing dramatic evolution.  This is what makes the work so fun and so fulfilling.  Professionals in a diverse range of collecting institutions are developing workflows that involve digital forensics tools and methods.  They’re learning from each other and making changes as they go along.

This is also a very exciting situation for an educator.   I don’t know if they always believe me when I tell them this, but today’s students in a program like the one at UNC SILS will be defining and establishing archival practices of the future.  If you want to continuously take on new challenges and creatively developed entirely new ways of working, then this is a great profession to join right now.  If you want a profession that’s safe and predictable, I recommend looking elsewhere.

Trevor: How has your work on BitCurator shaped your general perspective on the role that open source software can and should play in digital preservation? I would be particularly interested in any comments and connections you have to some of the interviews we have already done in this series. For reference, those include Bram van der Werf on Open Source Software and Digital Preservation, Peter Van Garderen &  Courtney Mumma on Archivematica and the Open Source Mindset for Digital Preservation Systems and Mark Leggott on Islandora’s Open Source Ecosystem and Digital Preservation.

Cal: It’s hard for me to argue with much that Bram, Peter, Courtney or Mark have said to you.  I think we are of a like mind on many things.  The curation of digital collections is a collective endeavor, and it can benefit greatly from open-source software development.  But it’s definitely not a panacea.  We have to learn from each other, assist each other, and celebrate each other’s victories.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.