Open source software is playing an important role in digital stewardship. In an effort to better understand the role open source software is playing, the NDSA infrastructure working group is reaching out to folks working on a range of open source projects. Our goal is to develop a better understanding of their work and how they are thinking about the role of open source software in digital preservation in general. For background on discussions so far, review Open Source Software and Digital Preservation: An Interview with Bram van der Werf of the Open Planets Foundation and Archivematica and the Open Source Mindset for Digital Preservation Systems. Continuing this series, I am excited to interview Mark Leggott.
Mark Leggott is University Librarian at the University of Prince Edward Island and the founder of the Islandora open source software project. For those who arent familiar, Islandora connects the Drupal and Fedora open software applications, acting as a kind of glue between the content management and presentation capabilities of Drupal with the long term preservation features of Fedora. Mark recently presented on the Islandora work to the the NDSA infrastructure working group and Ive set this interview up as a chance for us to suss out some of ideas and points he presented.
Trevor: How would you describe Islandora? On the call you referred to it as the glue and you referred to it as an ecosystem. I would be interested to hear you describe the software for those who arent familiar and also talk a bit about why you are using terms like ecosystem.
Mark: Islandora is probably most often referred to as a framework, but I prefer the term ecosystem, maybe because of my background in Biology. Either way the terms refer to the fact that Islandora consists of a number of separate open source systems that are brought together into a single software context to provide best-practice digital preservation services.
This is illustrated at a high level in the diagram below. The various components are held together (that Islandora glue you referred to) in different ways, for example: the Tuque API provides communication between Drupal and Fedora; the micro-services engine uses a Java messaging service to link disparate software applications into a service oriented architecture; the Taverna science workflow engine is being integrated via WSDL wrappers around Python services. This allows us to be extremely agile with the Islandora ecosystem and has allowed us to integrate a wide range of open source (and proprietary where desirable) software systems in relatively quick order.
Trevor: On the call, you mentioned some of the previous open source software projects you had started and some of the lessons you have learned and applied from those projects to your work on Islandora. Could you walk through those projects and what you took away as the lessons learned?
Mark: My first open source project was an ISO-compliant ILL system called OpenILL. It was very early days in what we currently think of as open source software, so the approach was a little different. We built the system in Cold Fusion and Java, so while it was easy to code, the sustainability factor was not there given where Cold Fusion drifted. (The Java component is still available for download). We attempted to port the system to Java/PHP/Drupal, but never got very far and I got distracted on my next project. It was also a good context in which to learn the lesson of diversity: any ecosystem, whether natural or manmade, needs a diverse set of technologies and players to be a long-term success.
The 2nd one was a system called Martini, which we chose to signify something you take with an Olive to make it taste good: the Olive in this case being the software for accessing the output of the Olive OCT system. We were building a newspaper archive at the University of Winnipeg (still accessible) and Martini was the Cocoon/Lucene-based engine we built to serve up very complex image and XML output. I still think Martini is one of the best newspaper engines around, providing article-level retrieval and some nifty image manipulation, but it was highly tailored to a specific OCR output. Again, ensuring a diverse ecosystem around the code was not a highlight of the project, despite some technically sound code development.
Both experiences taught me that encouraging a wide range of participation in an open source project is key to success – that is evident with the Islandora project in the roll both the University of PEI and DiscoveryGarden Inc. have played, and continue to play, in its development. The increasing number of organizations that are coming on side with developers is also part of building a strong and healthy ecosystem. The other key is a flexible and modular architecture, as discussed elsewhere.
Trevor: Could you explain some of the design principles that inform how you are building the software? How are you designing for the particular concerns of digital preservation?
Mark: One of the core design principles with Islandora is the separation of data from presentation. We store all data and information about the data in the Fedora backend, and present those information bits via the Drupal front-end. Our goal has always been to focus on the stewardship of the digital crown jewels of the organization, whether that be providing an open and standards-based framework for the storage of assets and associated metadata, or providing a range of preservations services. Our use of micro services for data transformation also facilitates this focus, as we can switch out best-practice applications easily. For example, we are working with Artefactual on the integration of Archivematica into the stack.
This approach does sometimes lead to concern from the Drupal side of the house, as it is not always easy to leverage the full suite of Drupal modules against the data stored in Fedora. This will change in the latest release (Islandora 7), which will have a new module (currently called Bridge) which will use the new Drupal entities approach to allow for a synchronization between Drupal and metadata stored in Fedora. This will allow a more familiar Drupal-y approach to customizing the front-end display of metadata.
This approach also makes it easier to integrate Islandora (e.g the Drupal and middleware components) with other non-Fedora backends. Recent efforts showed the ability to integrate Islandora with Databank and Merritt with a modest amount of effort. We recognize that institutions will eventually have to migrate all or parts of their digital library systems and we want to make that process work without losing information or value.
Trevor: Islandora makes extensive use of a range of existing open source software tools. Could you first tell us a bit about the kind of tools you are using and then talk through how you are thinking about tensions between dependencies on these tools and the ability to swap out and change between different components?
Mark: Islandora is more properly a framework, making heavy use as suggested of a wide range of best-practice tools. Our goal is to allow any application (proprietary or open source) to be integrated into the stack if it makes good sense to the institution. A good example is our use of OCR software. In the early days we only had support for ABBYY, as the best practice OCR software of choice. Now, Tesseract is the default OCR engine and it does a great job, although other engines can still be swapped in or out. Tools for extracting technical metadata are another example – we have used EXIFTool, Tika, FITS and others. The ability to choose a tool is critical, as it allows specific use cases to be supported with a minimum of effort. This approach also has downsides, such as complexity based on dependencies of the various components. While this makes the installation of Islandora more challenging, we feel the benefits vastly outweigh the costs.
While these tools and services can be integrated into Islandora in a number of ways, we are moving forward with a standard approach based on our micro services engine. The basic approach here is to develop listeners (using PHP or Python), which know how to communicate with services, and communicate with the engine to make sure that data is passed on to Fedora in the appropriate way. We also recognize that there are other best-practice micro service engines out there, and our current project to integrate Taverna into Islandora is an example of how we hope to facilitate this.
Trevor: You mentioned that one of the goals behind the Islandora approach was that “the crown jewels” (collection content and descriptive, technical, and preservation metadata) would need to be able to be migrated into future systems, could you tell us a bit about how the system supports that goal?
Mark: The key to this approach is the use of Fedora as the repository for all data and information about it. Fedora provides an open architecture that makes it easy to get information in and out. Each asset is stored in its own subdirectory in the Fedora filesystem, along with an XML file (FoXML) that contains all of the descriptive, technical, administrative and preservation metadata. This information can be accessed in any number of ways to migrate a complete Fedora repository to another system. For example, you could work with the native FoXML files or export them to METS to move to another system.
I think one of the best demonstrations of the power of this approach is our new sandbox coming with the new release of Islandora 6 and 7 (later in February), which allows both Drupal 6 and 7 to be used on the same Fedora instance. Those who work with Drupal will know that migrating from one release to another is fraught with challenges: the fact that Islandora can facilitate this in a seamless way is the best way to demonstrate how our approach contributes to an effective long-term stewardship of digital data.
Trevor: Could you tell us a bit about the design and development process. On the call you mentioned that you work on an agile development model and I would be curious to hear a bit about the design and development history in his context. So I suppose the question is where have you been and how did you get here?
Mark: The Islandora project has gone through a range of approaches and processes in our first 5 years when it comes to coding. Our initial team consisted of 3 people, and we managed our code in Sourceforge and used a PM/ticketing system called Redmine. We still use Redmine internally at UPEI for projects, but the project now uses Github for code maintenance and the number of contributors is in the dozens. Most of the code is written and maintained by staff at UPEI and discoverygarden inc., which a commercial SaaS spin-off from UPEI. discoverygarden inc. writes the majority of the code currently as it works on client projects, and they use an Agile approach for all development. That includes using the OnTime suite, Github, Jira and Jenkins in a scrum environment where daily stand-ups are followed with longer bi-weekly assigment and demo meetings, culminating in ongoing releases, including 2-3 major releases per year. UPEI also puts a considerable effort into enhancing the Islandora code, often in conjunction with internal projects, such as our 2nd generation IR, islandscholar.ca. As Islandora sees greater adoption around the world we are seeing more institutions stepping up to the plate to contribute code to the project. A good recent example is the WARC module Nick Ruest (York University in Canada) is building.
Trevor: When we are talking about digital preservation one of the biggest issues is sustainability. How does the Islandora project support sustainability of the software?
Mark: As mentioned previously this is my 3rd open source project and sustainability has always been one of my primary interests. I think we have been extremely successful in that goal, with a strong and regularly updated codebase, an active community with implementations in over a dozen countries, 3-4 annual Camp events, a primary commercial support and services company as well as an increasing number of non-profit and commercial entities supporting Islandora. The Islandora team at UPEI is in the process of creating a non-profit entity to maintain the project, including a community support model to help sustain the initiative.
I have been a strong proponent of open source for most of my career and I feel it is important for leaders and staff at our institutions to realize the power of open source by contributing to the sustainability of projects in multiple ways, whether by contributing staff expertise, implementing improvements, or providing cash to a project. UPEI is a small institution (28 fulltime staff) and we not only implement open source for all our business requirements, but we also contribute cash and resources to projects like Drupal, Fedora Commons, Evergreen and PKP. Even with all this I still think the most important step libraries can take in transforming their organizations is by using open source: if we can encourage institutions to use Islandora then I think we have succeeded.
Trevor: You talked a good bit about what Islandora is doing to stimulate an open source community. Could you give us a rundown of the different ways that Islandora sustains and stimulates an open source community and explain the role that each of those ways plays?
Mark: The Islandora project values participation in various ways:
We have 2 online groups, a general islandora group and one with a technical focus. Together the 2 groups have over 500 members and activity on the general list in particular is significant. Like any vibrant open source community the open lists provides a primary means of participation.
We hold an annual Islandora Camp in PEI and are expanding this in 2013 to have 3-4 annual camps, one in PEI, the U.S., Europe and Australia/Pacific. These are 3-day events with both traditional and unconference events and continue to be a critical way to provide training and outreach to the global Islandora community.
- Whenever possible we record conferences and other educational events and will be offering additional online training options starting in 2013.
- Our continuous integration framework and Github repositories facilitate more effective code maintenance and testing and will be important to encouraging more developers to contribute directly to Islandora.
- We are developing comprehensive documentation and will be providing printed copies this year.
- We are working with the Hydra and Fedora communities to ensure the systems can work together for the long-term.
Trevor: In the NDSA infrastructure working group we have been discussing the idea of articulating something about some of the key reasons that open source software is particularly important for key parts of digital preservation systems and workflows. I would be curious to hear what thoughts you have that might fit into this line of thinking?
Mark: Open source software is important for any technical requirement, but especially those that steward the core assets of the institution. Some of the key roles an open source system can play in this context include:
- An open approach to storing digital assets, ideally directly in the file system, thereby facilitating management and migration in the long-term.
- A modular and flexible framework to micro-services for data transformation, integrity checking and other elements of a strong preservation approach. Modularity in services allows adoption and integration of new best-practice tools without having to wait for a vendor or project to do so and gives an institution the ability to determine what tools best fit its needs.
- Software code that is open to scrutiny and long-term maintenance and improvement, or that can be forked as needed to ensure the interests of the organization can be accommodated. It is critical to be able to determine how software works and what it does to assets in order to ensure effective long-term stewardship.
- Open source projects encourage creativity and experimentation, leading to improvements in software that can respond to the ongoing challenges of managing digital assets.