Toward a Library of Virtual Machines: Insights interview with Vasanth Bala and Mahadev Satyanarayanan

The following is a guest post by Trevor Owens, Digital Archivist with the Office of Strategic Initiatives.

We are excited to continue our Insights series of interviews, featuring conversations between National Digital Stewardship Alliance Innovation working group members and individuals working on projects related to preservation, access and stewardship of digital information. In this installment, Jane Mandelbaum, IT Project Manager with the Library of Congress’s Information Technology Services division and co-chair of the National Digital Stewardship Alliance Innovation working group, talks with Vasanth Bala (pictured on the right), of IBM T.J. Watson Research Center, and Mahadev Satyanarayanan (pictured on the left), from the School of Computer Science at Carnegie Mellon University.  They offer insights about their work on the Olive library, a project which intends to create a library of virtual machines.

What is it that you are trying to do with the Olive Library and how does it relate to digital preservation?

Virtual Machine (VM) technology can be applied to a wide range of problems. We are currently working on building a digital library for executable software content in the form of VM images. We have prototyped this system and we believe it offers a significant value proposition – it allows for long-term preservation of executable content.

Could you unpack that for us a bit? What kinds of executable content are you talking about?

The challenge is to be able to execute software in the future. Let’s say you have a perfectly working program on a computer.  You don’t know what it needs to work; you just know it works. You don’t know what else on the computer affects it or doesn’t affect it.  If you could preserve this hardware for posterity, it would continue to work. Our approach would be to virtually preserve the machine — it’s the equivalent of taking a snapshot of everything above the hard disk. Now, for example, if you are a graduate student and you write a piece of software, you don’t have a good way to preserve it once you leave school.  As another example, if you have a CD that contains a software distribution, you can’t guarantee it will run in 5 years.  Our approach would provide a way to preserve the software and the environment in which it runs. The ability to re-execute software long after its creation, possibly decades, with no loss of fidelity.  We do not require any changes to the original software in order to achieve this capability.  Instead, we achieve it by precisely re-creating the hardware execution environment in the form of a VM.  All the original software layers, including the operating system, are preserved unchanged.

Do you have a few examples of what value this approach to software preservation could provide?

For example, NASA has a unique problem. Any interplanetary mission takes a long time to reach destination.  Software loaded and used on earth and on the probe needs to be able to function together 5-10 years from now when the probe arrives at its destination.  As another example, the ACM (Association for Computing Machinery) has a digital library of all of its published papers. They are very interested in enhancing it with capability for people to be able to publish a VM that includes the software used to collect data in the paper, so people can reproduce it.  It also may provide a way to publish things that are not easily written about in static text, but require an executable environment.

For website publishers and archivists, it is not always easy to reproduce dynamically-generated content that was published on the website on a specific time. We could capture the content directly from the publisher’s machine.

More broadly, as all fields of scientific investigations rely on complex simulation and visualization software, the ability to archive these software artifacts in executable form becomes essential for reproducibility of scientific results. Software preservation also enables long term data preservation. Today’s data formats may become obsolete tomorrow, unless the software applications that process those formats are also preserved

Could you tell us a bit more about the system you are envisioning here? What would it look like? How would it work?

We are calling this the Olive Library. Our goal is to store millions of these images, in a way that indexes the contents so people can search for what they need. This would involve a form of simple and seamless delivery of executable content over the network to your PC. Something like a YouTube of virtual machines. Your request for a preserved VM would be streamed through the network and made accessible on your computer.

Our idea is to have a library which would provide some basic services — like other digital libraries.   It would be a logically-central place that maintains the VMs. It would also contain a streaming engine to deliver them. There are some significant issues that will continue to come up in the future. For example, we would need ways to improve robustness and trustworthiness of this kind of library. One idea is safeguards to make sure you don’t publish something that you don’t have rights to (like YouTube can now detect if a frame sequence is stolen from a movie, and can not publish because you don’t have rights). Another issue is how to identify and search for a particular program.  Many people may have named their programs with the same name, or the same program may have multiple names.   Let’s say you publish a Python app on your machine.  You call it “python123″.  Someone else has something called “pythonxyz”.  It would be good if the library provides a way to identify the program with a signature/fingerprint.

Based on your work and areas of interest, what kinds of work would you like to see the digital preservation and stewardship community take on?

Much attention has been paid to the preservation of digital content like text, audio and video. By contrast, very little effort has been put into the preservation of software applications. It would be nice to have the stewardship community champion this cause, and urge other organizations to invest in software preservation. We think of this as a key part of a whole ecosystem.  We are very interested in getting more community involvement with this kind of project.  We would be interested in having any interested parties contact us.

4 Comments

  1. lentigogirl
    September 21, 2011 at 12:11 pm

    fascinating – thanks!

  2. Anton Angelo
    September 21, 2011 at 6:24 pm

    This is fantastic way to collect proprietary software images: the question is, are you allowed to?

    These images will provide a wonderful source of material for those doing in depth research on software evolution in the future.

  3. Ria
    September 26, 2011 at 3:00 am

    If this research does work, and become generally available, it will be a huge advantage to help libraries with data curation.

  4. Dianne
    September 27, 2011 at 10:50 am

    I’ve seen a demo of this technology, which is very important and very significant. It allows practitioners and researchers to restore, view and use software and documents that were created on obsoleted operating environments. Thanks to IBM, CMU and the Library of Congress for their leadership and participation in Olive (On-line Interactive Virtual Environment).

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.