A few months back, during the Personal Digital Archiving 2013 conference, I was struck by how much interesting research was being done in the field of digital preservation. Everything from digital forensics to gamification, all of it thoughtful, much of it very practical and applicable. Still, I couldn’t help wishing that there was even more going on.
In NDIIPP we often interact with granting organizations and get a peak at the types of things proposers are hoping to get funded. While many useful things are proposed and get funded, I’m struck more by the types of things that I don’t see as often: proposals for practical, applied research that directly address long-time digital stewardship challenges or that build on other stellar research to establish a focused advance towards solutions. Many of the issues that need more focus are the types of things that cause organizations to wait on digital stewardship because the problems aren’t solved yet.
So I started writing down a list of things that might merit further attention from researchers and funders. I haven’t done an exhaustive search to see what’s currently being done in these areas (please point things out in the comments!) nor have I thought through all the challenges of doing these types of research (that’s for the researchers!) but I do think these merit further attention.
My inspiration for encouraging applied research is the work NDIIPP did back in 2005 with the Archive Ingest and Handling Test project. The AIHT was designed to test the interfaces specified in the architectural model for NDIIPP. The researchers ended up discovering that “even seemingly simple events such as the transfer of an archive are fraught with low-level problems, problems that are in the main related to differing institutional cultures and expectations” (from its final report (PDF)).
The observations that came out of these discoveries, rather than being irritating sidebars to the “real research,” actually provide ample practical value to future researchers engaged in similar digital preservation activities.
The GeoMAPP project took a similar approach to try and surface unexpected results by having the participants transfer their geospatial data collections back and forth between the different states, exposing each to new approaches and the challenge of “last mile” transfer, storage and network infrastructures.
This is the kind of unexpected knowledge that can come out of applied research, the kinds of efforts that might be applied to some of the areas below:
Format Migration: What happens to any particular file when you migrate the file from one version of software to another? What happens when you migrate from one software type to another, for example, converting files from one type of word processing software to another? What changes happen to the file and the information inside and can these changes be quantified and measured? How can we quantify the changes that happen and determine if they have any import for digital preservation actions? Is it possible to do this all of this at scale and be able to manage the changes in a coherent way?
There is often talk in the digital stewardship community about format obsolescence and the need to address this issue in the future. The need to address format obsolescence has become a truism in the digital stewardship community, and while it may be a vexing problem, there is still doubt about how acute the problem might be. Still, we’ll need answers to the questions above in order to determine whether the need to address format obsolescence through migration is worth the cost of doing so.
Fixity Checking: How often do we need to check the fixity value of any particular digital file to ensure that it remains the same? Is there a risk in touching files too much? Is there an optimal amount of contact that will ensure authenticity while limiting risk and cost? Will regular fixity checking give us more accurate error rates for different types of digital storage? Are there increases in error rates based solely on fixity checking? What are the actual computing costs of checking the fixity of digital files at scale?
Bill Lefurgy described the importance of file fixity in an earlier post as “critical to ensuring that digital files are what they purport to be, principally through using checksum algorithms to verify that the exact digital structure of a file remains unchanged as it comes into and remains in preservation custody.” The NDSA is making efforts to uncover member approaches to file fixity through its regular “storage survey,” while individual members are aware of the value to regularly check the fixity of the digital materials under their purview. The Scape project is looking at this, as is the computer industry. Still, it’s the digital preservation community that is taking the lead in considering these issues, and much more work needs to be done to get some basic data on what happens when we do these types of activities.
Email Archiving: What are the main challenges of email archiving? How can preserved email be made accessible? Is it possible to “weed” irrelevant email messages from those that are archival through automated processes? How can email attachments be preserved along with the messages themselves? How much storage does an average email archive require?
Email archiving is a prime concern for archival institutions, especially those in government. Email archiving solutions are strongly weighted towards the type of email system employed by the organization, and as such, much of the research in the backup and storage of email has been ceded to the information technology industry. It’s uncertain whether the IT approach takes archival concerns into consideration, however, and there remains a shortage of research on email from the archival perspective that might inform IT industry practices. The Collaborative Electronic Records Project focused on the preservation of email, and there has been some research on the archival side into tools that make email archives accessible, such as Muse. Chris Prom’s definitive DPC Technology Watch Report on Preserving Email (PDF) suggests a wide range of potential research paths, but it’s unclear if more practical work has built on his excellent observations.
Thoughts on the above questions? Areas that you think need further research? We’d love to hear your thoughts in the comments.
Thank you for this post, Butch.
I’d like to point out some more resources that may be useful to readers.
For email preservation, I’d encourage having a look at the final reports from the InterPARES 3 Project’s “Keeping and Preserving E-Mail” and “Guidelines and Recommendations for E-Mail Records Management and Long-Term Preservation” (Team Italy). PDFs can be found here: http://www.interpares.org/ip3/ip3_general_studies.cfm#gs05b
The Archivematica project has also documented our work with e-mail preservation planning here: https://www.archivematica.org/wiki/Email_preservation and here: https://www.archivematica.org/wiki/Email
Good call–though most of your examples seem to focus on storage and its limits. You may be interested in the Guggenheim’s emulation testbed “Seeing Double,” which paired works running original hardware with their emulated counterparts. Visitors to the exhibition were surveyed about the relative success or failure of each emulated work in capturing the spirit of the original.
The results turned out to vary by demographic–a conclusion examined in the forthcoming book Re-collection (http://re-collection.net/).
Thanks Courtney and Jon, this is exactly the type of information I was hoping would be shared!
Curiously enough my colleague Victoria Lain in the reserach department at The National Archives (UK) has published this blog post today: Academic engagement and research strategies which describes our current research priorities, many of which are in this arena.
Butch, this is a good summary, thanks.
One observation is that the community is getting hung up on research when a lot of these things are already available – fixity management, format migration, email etc are all in use now in cloud based solutions like Preservica or in your data center using SDB (as used by David at The UK National Archives). Whilst research is important so is getting started !
Thanks for the note. I agree that a lot of these solutions are already there. And I guess the word “research” in the title is a little misleading. Perhaps I’m looking for case studies. Something along the lines of “here’s what we did, here’s what we used, here’s what we observed and here’s what we recommend.” I still think of that as “research” but with the goal of exposing solutions for others to use.
Great post Butch!
I’m a strong believer in focusing our limited DP resources on targets that have been prioritised by the users or practitioners at the coal face of digital preservation. I agree very much that practical research and development, working with real data, is the way forward. I would definitely like to see more work in the vein of AIHT. We’ve got to learn by doing.
This page captures the preservation needs (and a variety of tools and solutions to meet those needs) from over a hundred different digital practitioners from many different organisations around the world. Possibly the biggest collection of it’s kind. Analysis of this data suggests that one of the biggest practitioner needs is for better characterisation. In other words, practitioners want to know more about their data, and they want better automated tools to help them meet that need.
This stuff lines up pretty well to your “here’s what we did, here’s what we used, here’s what we observed and here’s what we recommend” need. We try and capture everything, as the “we used this tool on this data and it didn’t work” is as valuable to share as the successes.
Regarding your migration/obsolescence comments… The consensus is growing that stereotypical format obsolescence is not the problem it was once made out to be. It’s possible to find software to migrate (or emulate) almost any data. However, I think you’re right to highlight the problems with how accurate the migration (or rendering) is. We have migration tools, but we don’t have tools to quality check the migration. This excellent report provides some of the evidence needed on the pitfalls of migration:
And finally, on the email preservation issue, there are plenty of practical approaches explored here (sourced from the first link I provided above):
Thanks so much for taking the time to share those resources! I’ll definitely spend some time with them and I hope our readers will as well.