Archivematica and the Open Source Mindset for Digital Preservation Systems

Peter Van Garderen & Courtney Mumma

I  had the distinct pleasure of hearing about the on-going development of the free and open-source Archivematica digital preservation system twice this year. First, from Peter Van Garderen at the CurateGear conference and second from Courtney Mumma at a recent briefing on the project for staff at The Library of Congress. Peter and Courtney both hold archives and library graduate degrees and work for an open-source technical services provider.  I am pleased to have the chance to talk with both of them in some further detail.

Trevor: Tell us a bit about Archivematica.

Courtney: Archivematica is a software system that is designed to maintain standards-based, long-term access to collections of digital objects. It processes digital objects from ingest to access in compliance with the ISO-OAIS functional model. It uses a micro-services approach to invoke a number of integrated open-source tools that perform granular processing tasks such as virus checking, checksum verification, file format conversions, etc. Users monitor and control the processing workflows via a web-based dashboard. There is a brief screencast at archivematica.org that demonstrates this functionality.

Archivematica 0.9 dashboard

Trevor: Could either or both of you explain some of the design principles that inform how you are building the software? How are you designing for the particular concerns of digital preservation?

Peter: Our design principles are driven by the problem of digital preservation. We have to figure out how to keep existing digital information objects accessible, usable and authentic so that they can be used at some undetermined point in the future on some yet to be created technology platform. So far our profession has come up with three strategies or variations on these: emulation, migration, normalization. Digital preservation is risk management. Since time travel is the only way to judge which of the strategies will be most successful, we have to hedge our bets. Therefore, Archivematica is being designed to implement all three strategies.

Archivematica helps to reduce the risk of technology obsolescence, incompatibility and complexity. It does this by analyzing and monitoring the original digital files as well as making them available for emulation or migration at some later point. It also implements normalization to preservation standards for those file formats where a better open specification exists. One of the most influential factors in selecting preservation formats is community adoption, e.g. whether institutions like the Library of Congress have selected the same format as its preservation format. Therefore, we are currently working on our Format Policy Registry which is intended to provide an ongoing format watch service to Archivematica users as well as to crowdsource and share empirical information about the format policy decisions being implemented in Archivematica installations. Namely, whether and how users customize default normalization rules to determine whether any changes are required to Archivematica’s default format policies.

Archivematiica overview

Archivematica’s design also reflects our strong belief in the importance of standards-based systems for curating and managing digital collections over the long-term. Therefore, the Archivematica AIP uses Bagit, METS, PREMIS, and Dublin Core. If you are using proprietary or custom data standards, the cost of moving that system data to a newer platform or sharing it with other platforms will be prohibitive, thereby putting the long-term accessibility of those digital objects at risk. We have seen this repeatedly in other legacy data migration projects.

Similarly using overly-complex or proprietary technology stacks to preserve digital collections can also be a significant risk. We are trying to solve the problem of technological incompatibility, so why introduce an over-engineered or closed-code software stack with all its dependencies on top of the digital files you are trying to keep accessible long-term? It just adds another barrier and potential point of failure for future access. Instead, we have designed Archivematica to work from the filesystem up. We use a basic and long-proven UNIX pipeline design pattern wherein the standard output of one processing task becomes the standard input to the next. These can then be grouped into so-called digital curation micro-services and chained into sophisticated and comprehensive workflows.

Lastly, another Archivematica design principle is to work with existing collections management tools and storage architectures (e.g. network storage devices, LOCKSS, cloud storage). Archivematica is intended to fill the digital preservation services gap for existing repository management applications rather than try to replace them or replicate their functionality. For example, we are working with our pilot project partners to integrate the Archivematica pipeline with systems like DSpace, ContentDM, ICA-AtoM, and Archivists’ Toolkit. They’ll continue to use those tools for collections management, cataloging and public access whereas Archivematica serves as a back-end support system that helps to implement digital preservation services and workflows for the digital materials managed by those other systems.

Trevor: How are you thinking through various issues around scale in Archivematica? What kinds of bottlenecks would you expect to find and how would you think about addressing them?

Peter: The main bottleneck in processing large batches of digital objects is caused by limited computing resources. If you only have one processor available to the system and its performing a format conversion on a 30 GB video file most other system activity is frozen until that task is completed. Therefore, we have implemented a simple client/server and job queing strategy to allow us to create processing clusters. The Archivematica server routes micro-service tasks to other processors (either bare metal or virtual machines) which report back when the task is completed. In the meanwhile, the user can continue to work with the server to monitor and process other objects. We are also currently testing distributed filesystems to reduce disk input/output bottlenecks.

Trevor: The platform makes extensive use of a range of existing open source software tools could you first tell us a bit about the kind of tools you are using and then talk through how you are thinking about tensions between dependencies on these tools and the ability to swap out and change between different components?

Courtney: We did a detailed use case and activity diagramming analysis of the OAIS functional model to determine what processes were actually required from a digital preservation system. This functional analysis has been enhanced by two years of pilot project implementations. We look for mature open-source tools with compatible licensing that can perform these functions and then wrap them into the Archivematica micro-services. Examples include Clam AV, FFMPEG, ImageMagick and File Information ToolSet. We spend a lot of time evaluating and testing tool options.

The goal of a micro-services based system is to eliminate complex application dependencies by selecting granular tools that do one thing really well and letting those tools operate in isolation from the other tools in the pipeline. This means that maintenance, upgrade or replacement of one those tools remains as simple as possible, without necessarily being restricted by software dependencies further up in the software stack.

Trevor: In the NDSA infrastructure working group we have been discussing the idea of articulating something about some of the key reasons that open source software is particularly important for key parts of digital preservation systems and workflows. I would be curious to hear what thoughts you have that might fit into this line of thinking?

Peter: Well, one thought to consider is that we are trying to tackle technology incompatibility which is often caused by proprietary software and formats. Therefore, it’s a little ironic to implement proprietary software tools to manage this problem. The more practical point is that many of our tools are in their infancy. Our knowledge and practices are evolving. We need to get as many qualified eyeballs and hands on the code as possible and open-source makes this possible. It allows a wide user base to tinker and tweak and collectively improve our knowledge base. The publicly available project management tools of open-source projects like Archivematica (e.g. discussion group, chat room, issue tracker, wiki documentation, code repository) help to further facilitate a culture of collaboration and encourages open dialogue about what works and what needs improvement. We see this in action almost daily now in the Archivematica project and its very exciting and encouraging to be part of.

I also think it is important to consider where the digital preservation community should be investing its limited time and money. Financial sustainability is a key aspect of digital preservation. The open-source development model encourages users to stretch their investments by pooling their technology budgets. Whether that is to hire their own technical staff or third-party contractors to work on open-source tools. Rather than each organization paying separately for commercial licenses, the cost to develop new tools and features is only incurred once and then made available under free license for the shared benefit of the community at large.

The Open-Source Ecosystem

Trevor: In conversations about free and open source software we often hear about the difference between “free as in beer” and “free as in speech”. Courtney, in your talk you explained Archivematica as being “free as in kittens.”

Courtney: Peter uses puppies but I use kittens! Nevertheless, the idea is that free beer is gratis. You don’t pay for it. Free speech is not about money, its about a freedom or right that’s been granted to you. Similarly, open source software gives you the freedom to study, enhance and re-distribute software code. These are the fundamental characteristics which make free and open-source software such a powerful force. However, I like Peter’s idea to remind us all that there is no free lunch.

Peter: Yes, it always costs money to run technology projects. That said, we do believe that well-supported, open-source projects like Archivematica offer significant total cost of ownership savings by giving away well-documented, high-quality software for free. This then leaves hardware and the cost of hiring your own staff or external contractors to install, integrate, upgrade and provide end-user support. So when someone says, “here, have some free software” it’s kind of like getting a free kitten (or puppy) in that it is cute and exciting but it has to be cared for to grow and thrive.

Trevor: What kinds of things do you see organizations needing to do to feed and take care of these system kittens?

Courtney: Well, as Peter just mentioned, there is a financial commitment anytime an organization decides to implement a software system. Even if the organization uses free software and existing hardware they have to account for the cost of staff to implement and maintain new systems. So the organization needs to allocate resources, if only staff time, to ensure its ‘system kitten’ is fed and happy.

Peter: Hopefully, the organization will go further and recognize that the strength of open-source software lies in its community of users and developers and make a commitment to contribute to it somehow. The main point I’m trying to get across when I make the kitten or puppy analogy is that using open-source software shouldn’t be a passive activity. The users have a stake in the welfare of the technology so they should get involved in its ongoing maintenance and design. This will be to the benefit of the systems they are now running but it goes even deeper to addressing the serious technical capacity gap we still have in our community. Other professions like mechanics and doctors have to adapt to new technology, learn new tools. Archivists and librarians are no different. Sure, not all of us need to be programmers but the materials under our care are changing, they are increasingly, overwhelmingly digital. Participation in open-source projects enables a kind of active learning. Firstly by de-mystifying the technology, secondly by empowering the user with the ability to influence how that technology is shaped. Whether that comes from jumping into a mailing list discussion about design choices, submitting a bug report or contributing code patches.

Courtney: There’s also a ‘feel good’ factor to most open-source projects. It can be tremendously satisfying to participate in community-based initiatives and mak contributions you know will be of free benefit to others. There’s a lot to be said for gaining the respect and praise of your colleagues. We try to get people as excited about the possibilities of the project as we are. I tweet about it. We get out to most of the professional conferences, host user groups and participate in and facilitate unconferences like CurateCamp. We also draw a lot of our own inspiration and confidence from these interactions, never forgetting that they are all critical to the exchange of knowledge that moves our technologies forward.

Trevor: I did an interview with Bram van Der Werf from the Open Planets Foundation a while back focused on the role of open source software in digital preservation. From your talks about the platform I feel like there are some strong connections between his vision and yours. Are there specific points he makes that you think fit with your approach?

Peter: Firstly, I want to say that we’re fortunate to have a guy with Bram’s industry experience working in our field. I really like his vision for OPF, to nurture open-source culture and practices while providing more stability and quality for the tool portfolio we all rely upon. Aside from his OPF duties he acts as a kind of mentor to open-source project managers like myself. I really appreciate his insights and thought that he articulated some poignant ideas in that interview. I think the most important one was that building digital preservation systems is less about the technology and more about the human capacity and infrastructure that is put in place to maintain it. That echoes a lot of what we’ve been just been talking about.

Trevor: Thanks Peter and Courtney. It is great to have a chance like this to talk through and explicate a lot of the thinking that is going on behind your work on this project. I would encourage any of the readers to post their own reflections and questions in the comments.

6 Comments

  1. Karen Carani
    November 20, 2012 at 3:44 pm

    We just had a great presentation for the Infrastructure working group. Overall the use of open source solutions is to be able to rely on a community for support and improvements while your own needs can also be addressed. The evolution of service providers around open source solutions is another way to get support for use of the technologies without having to do it yourself. And they also contribute back to the community.

    The risks for an open solution vs vendor solution are the same. One allows for building a community rather than paying a fee. If tools are agnostic then even if one part of the community fails (breaks up) you still have the strength of the other tools in the mix.

    The service providers need to balance the needs of the client and the larger community.

  2. Courtney Mumma
    November 20, 2012 at 4:38 pm

    Thanks so much, Karen. Those are exactly some of the essential points Peter and I hoped to get across in today’s NDSA Infrastructure Working Group discussion. ~ Courtney

  3. Gail Truman/Truman Technologies
    December 6, 2012 at 2:07 pm

    To add a little to Karen’s observations (as a very strong supporter of
    open source versus vendor supplied)…
    The risks of an open source solution versus one from a vendor are
    different (they are not the same).

    To expand on this claim…One thing that we get from open source is the source code itself (or access to it and to people who can understand it). Compare that to a vendor solution – in that case we get the object code (the executable). If the vendor goes away we cannot look at the code and find out why something broke, how to access the data stored by that executable code. It’s a huge risk for preservation/long-term access needs.

    I was an archive product manager while at Sun Microsystems. One of my products was the Storage Archive Manager (SAM-FS). Customers use it for long-term preservation (policy based data migration, checksums, etc). A problem for customers was/is that Sun (and now Oracle) only distributes the binary/executable and not the source (there’s more beneath this statement – a separate discussion, since SAM is available open source – “kinda”).

    Customers with long-term access/preservation needs often ask for the
    source to be put in Escrow for this very reason – should Oracle / Sun go
    away, they absolutely need access to the source.

    Therein lies the biggest difference (pertaining to preservation – which
    is what NDSA is about) between open source versus vendor-supplied
    “proprietary” applications.

    Open source, when done with a service provider behind it (such as
    Archivematica, Islandora (Discovery Garden/Truman Technologies), redhat, etc), has the strength of the crowd (crowd-sourcing!). I love it. And if the service provider goes away, there are hundreds of skilled people in the crowd who can fairly quickly start a new service company, or provide support.

  4. Gail Truman/Truman Technologies
    December 6, 2012 at 2:12 pm

    I should add that the comments are my own opinions and views and do not necessarily reflect those of any company for whom I have worked in the past.

  5. Matt Gorzalski
    December 5, 2013 at 9:22 pm

    Late response from me on an older post. My institution is exploring digital preservation options and I’m interested in Archivematica. It looks like a one-stop solution, but considering the complexities of digital preservation, that seems too good to be true. Is Archivematica intended to operate by itself or in conjunction with the many tools available? This blog post mentions that it incorporates tools like BagIt so I assume Archivematica is intended to be the only tool necessary. The website says Archivematica 1.0 is coming soon…any clearer idea as to when?

  6. Courtney Mumma
    December 6, 2013 at 1:53 pm

    Hi Matt,

    Archivematica is a suite of many open source tools knit together and combined with other functionality via micro-services to achieve OAIS-compliant preservation actions on your digital content. Users have many configuration options and opportunities to make decisions about their content as it goes through Ingest.

    Archivematica comes packaged with AtoM (Access to Memory), which is a free and open source content management and access system where users send their DIPs (consisting of digital access copies and metadata) for archival arrangement, description an access. Archivematica is also integrated with other access systems and the lead developers are keen on growing this list with development partners.

    Archivematica is also agnostic with regard to archival storage, that is where you put your AIPs (consisting of METS.xml with PREMIS, original and preservation copies, and bagged). You can choose to use a storage system or your own networked storage, but we do recommend regular integrity checking and georemote backup.

    You can find all this information elaborated upon and more on the archivematica.org wiki, or email the discussion list if you have more questions: https://groups.google.com/forum/#!forum/archivematica.

    1.0 is the first production, non-beta, release, so we’ve tested it publicly and with our development and installation partners far more rigorously than in the past. You can already get the code via github and update it to get bug fixes; however, downloading the packages via an Ubuntu 12.04 OS, either in a VM or a bare metal install, is the best way to go. Those packages are currently being tested internally and will be released within the next couple weeks.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.