The following is a guest post from Meg Phillips, Electronic Records Lifecycle Coordinator for the National Archives and Records Administration.
“What’s the bare minimum I can responsibly do with my electronic stuff?” was one of the central questions on the table at CurateCamp Processing. The unconference, focused on Processing Data / Processing Collections, was a great way for a group of thought leaders and practitioners to surface issues keeping them up at night, compare notes, and start charting a path forward. The theme for this CURATEcamp framed a series of discussions on how archivists and librarians think about processing digital collections compared to the ways programmers, software developers, and engineers think about processing data. We worked on a lot of different issues, but I found one particularly interesting: what do recent discussions in the archival community about minimal processing mean for digital materials?
The CURATEcamp format allows all participants to propose and collectively select and organize the sessions the group is most interested in discussing. One of the sessions that resulted from this process focused on how the archival principles of “More Product, Less Process” (MPLP) as Mark Greene and Dennis Meissner describe in “More Product, Less Process: Revamping Traditional Archival Processing,” apply to the processing of digital materials.
The participants in this session wanted to explore whether we could reach a professional consensus around what must be done to all digital files. This question would also reveal what the community considers intensive processing that might be applied to only some collections or files. We wanted to benefit from MPLP’s rational approach to allocating resources. If we could figure out how to apply these concepts, the maximum number of collections would be usable by the greatest number of people in the electronic realm as well as the physical.
There are clearly some differences in managing paper and electronic objects, but there are similarities, too. One important difference is that some processing steps for digital objects can be automated. Even if the actions are performed at the item (or file) level, as is often the case in the electronic world, this kind of data processing is not necessarily a bottleneck that creates backlogs. Similarly, there are opportunities to use content searching to locate electronic items that aren’t available for physical items. On the other hand, as for physical records, processing steps done by humans, or steps that require a great deal of analysis, can create processing backlogs even if they can be applied to many files at a time.
The participants in the session agreed that the minimum elements may not always be the same for all institutions. Some institutions have legal environments or strong researcher expectations that, for example, the original media will be preserved, or restricted information exempt from Freedom of Information Act requests must not be released to researchers. (There was another session at the CURATEcamp specifically about using automated tools to speed the review of collections for restricted information, an intriguing related topic.)
In spite of the importance of the particular institution’s environment, by the end of the meeting session participants were able to sketch out a preliminary list of minimal processing steps, which follows:
- Establish fixity, for example through hash codes, so changes to files can be detected
- Make a backup copy to reduce the risk of loss
- Provide write-blockers to ensure that files can’t be changed accidentally or intentionally Document the chain of custody and provenance and provide some archival context for the materials
- Provide some way of discovering that the materials exist and of finding materials within the collection.
One interesting thing about this list is what it doesn’t include. The list does not include identifying the format of the files, validating that files are well-formed, or migrating files to more researcher-friendly formats. The first few topics didn’t even come up. Someone did suggest that providing files to researchers in formats they could use might be essential, but others believed that format migration and emulation were so complicated that they should be considered intensive processing, not minimal processing. However, as MPLP reminds us, minimal processing may not be sufficient for all collections.
The hour available to generate this list at the CURATEcamp went by in a flash, and I was impressed that we were able to generate even a first draft like this. However, we didn’t have time to systematically poke holes in the ideas reflected here or debate a lot of other options. I would be thrilled to hear from blog readers about their thoughts and reactions.
What do you think?
Would it be possible, or even desirable, to have a community definition of minimally acceptable processing for born digital archival content?
What are your opinions about the items we came up with at this session?
What would you add or subtract from the list, and why?
I like the idea of a minimally-acceptable processing workflow!
I do think the notion that file-format detection is arduous and out-of-scope could do with rethinking. Quite a few tools exist to automate this now, such that even if remediating/migrating files isn’t on the table (which I completely understand, though at least some of this work is automatable as well!), recording what file formats you have and what condition files are in should be.
Generally agree with this list as the minimum workflow, and it’s close to what we do with our web archive.
The issue of format profiling is an interesting one. IMO, there’s not much point in doing it before you need the information, or treating the result are archival quality information, as the identification tools are being constantly improved. Where possible, it is much preferable to go back and re-identify the formats as required, perhaps piggybacked on the fixity checking process.
However, making the content discoverable means parsing it, and this generally involves a format identification step. So I would argue that your workflow is doing identification, but as a means to discovery rather than an end in itself. And I would argue that that is precisely the right approach!
I would agree with bit-preservation being the minimum. Many people (from David Rosenthal http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html , to this comment author today http://arstechnica.com/information-technology/2012/08/for-one-cent-a-month-amazon-glacier-stores-your-data-for-centuries/?comments=1&post=23188512#comment-23188512 ) have suggested that obsolescence isn’t really a problem. So as long as you can serve up the files you were given then as far as ‘digital preservation’ is concerned, the bare minimum needed ought to be to be able to do just that: serve up the files you were originally given. Almost everything else can be taken care of at the point of serving them up, e.g. identifying format, finding a rendering environment etc. In fact you might be better off waiting till the future for doing format identification as the tools will likely be more functional and effective then.
An exception to this might be to at least try to capture any metadata that might come along with the files that might help to maintain your ability to serve up the content (partly captured) in the files with full integrity in the future. For example capturing any information about the intended rendering software environment at point of ingest would be a good idea.
Another exception might be to at least attempt to identify compound digital objects at point of ingest or before so that they can be grouped together and/or so any dependent components can be ingested together and not missed from any transfer. For example linked spreadsheets or movie files that come with subtitles in an additional file. One tool for doing some of this is available here: http://sourceforge.net/projects/officeddt/
I would add virus checking to that list. An archive that could potential infect and damage its users’ computers is not a trustworthy resource. Because we can’t depend on users to have up-to-date virus protection installed, the duty should fall on us.
Great comments! Keep them coming.
I like the idea of adding virus scanning to the list, and identification of compound records would be great, although it may be challenging to require for all cases at this point.
Should the minimum standard be something archives could achieve for all records?
Although virus scanning is important, I think it is vital to consider precisely when it should be done. It is tempting to do it early, during the ingest process, and to treat this judgement are archival metadata. However, virus scanner are constantly being improved, and so early scanning may fail to catch new viruses until the detectors catch up. To make sure every detectable virus has been caught, it is necessary to scan upon delivery rather than on ingest. Email can be scanned for viruses as it is sent, and I think the same should apply to any digital output.
It’s not clear whether this should be on this workflow list, as it seems to focus on ingest and care rather than delivery/access/use.
Andy, your point about what can be done at the moment of delivery is a good one. To some extent I think this gets at the heart of the matter for thinking about how computational and archival processing fit together.
Something like virus scanning, that can be run as an automated process at different points can (to some extent) likely be abstracted out of the archival processing workflow.
It strikes me that File format detection, another computational process, is likely to have some similar issues. In this case, I would likely read “Provide some way of discovering that the materials exist and of finding materials within the collection” as generating something like a manifest which would at least list file names and extensions. So even creating a manifest gives you some format information. Beyond that, you are going to want to do file format characterization too, but as tools for doing that are also likely to improve in time, and this is the kind of thing that could be run and rerun as a batch process over files, it may be better to get a lot of this kind of computational processing out of the actual archival processing steps in a process.
The separation of archival processing and other computational processes makes me a little uneasy, because from the user’s perspective things like pre-ingest normalisation versus on-the-fly migration are indistinguishable. I’m keen to consider delivery processes as part of the overall archival service because it helps make it clear that you don’t have to do everything during the ingest phase. YMMV, of course.
Also, I would like to thank Euan (#3 above) for reminding me of the one thing I would consider adding to this core workflow – dependency analysis. Many formats can depend on external resources (fonts, other documents, web pages, plugins, etc.) and it is important to catch these dependencies as early as possible so that collecting them can be part of the ingest workflow. Unfortunately, while OfficeDDT and some PDF tools can help with this, much more tool development work is needed.
One thing that appears to be missing from the list is establishing that what is received from the creator is what was agreed would be transferred. There may be problems with the transfer or the sender may send the wrong material. It happens.
I would say the first three requirements are easier and less time-consuming than the fourth. The first and third points are actually the same thing, while the second sentence of the third point is a separate point altogether. Also, I believe that imaging of removable media should be an explicit part of these recommendations. So I would rewrite as follows:
1) If files are being taken in to a repository on removable media (hard drives, floppy disks, CD-R, DVD, thumb drives, SD cards, etc.), create images of the media, with writeblockers in place to ensure that files can’t be changed accidentally or intentionally.
2) Establish fixity through hash codes, so changes to files can be detected. (This can be done as part of the imaging process for removable media. Automated metadata creation can happen during imaging also.)
2) Make a backup copy of the data/images to reduce the risk of loss.
3) Document the chain of custody and provenance and provide some archival context for the materials.
4) Provide some way of discovering that the materials exist and of finding materials within the collection.
I would group together number 4, virus detection, and file format identification under the general term of “analysis.” The first three parts (preservation) can happen first, while some aspects of analysis can be put off until later, for reasons that others have articulated well already. In any event, analysis can become a time consuming process, even if what you have in mind is “less process.” But the first three parts need not be time-consuming (depending on volume of data and media).
If digital content arrives as part of a collection that also includes paper, then perhaps MPLP means processing the paper portion as you normally would, do steps one-three as described above for the the digital portion, take an hour (or a day) to skim through the digital content to get a rough idea of what’s there, and add a description of that content to the scope and content note of the finding aid for the rest of the collection.
This is very, very oversimplified because it does not take into account privacy issues, copyright issues, or the issue of how and whether the data can actually be serviced to researchers. What you can read of the records is also dependent on what software you have (as is imaging, for that matter). But that’s my very rough blueprint.
Whoops, sorry! I didn’t correctly number my own list! I made this into *five* requirements, so the first sentence of my third-to-last paragraph should read “I would group together number 5, virus detection….”
OCLC Research convened a panel of experts to come up with the minimum requirements for getting control of backlogged born-digital content. The report was published today: You’ve Got to Walk Before You Can Run: First Steps for Managing Born-Digital Content Received on Physical Media.
Great discussion! Is it worth moving to the Curate Camp wiki so we can shape the list based on the excellent suggestions?
Great idea Paul, feel free to take this and fiddle with it on the CurateCamp, or for that matter, the OPF wiki. Wherever you think folks will be game for tweaking and reworking it.
If this discussion is truly about the minimum acceptable processing, why would you want to image the media? This is overkill in terms of preservation and it is a process that does not easily scale.
If you are performing fixity checks and creating copies of the files, why would you need to apply “write blockers?” The former are fairly easy to automate at scale. The latter may not interoperate across multiple file systems/repositories/operating systems and is redundant if you put the former in place.
Ricky, Thanks for the link to the OCLC report! It’s interesting that lots of us saw the need for something similar. OCLC was addressing a slightly different situation than the CURATEcamp session did, but there’s notable overlap in the types of steps we identified.
In answer to Mark Conrad’s question, media imaging (and the use of write-blocking software or hardware used while doing it) is now regarded within the community of archivists as standard, baseline preservation practice, not “overkill.” That’s why the OCLC report issues the same instructions. There’s really no way around it: you want a bit-for-bit representation of what’s on the media in order to fully capture all available data and its context. The write-blocking part is most important at the imaging phase, although it’s probably advisable to create some write-block security around the preservation images. (Processing would of course be done on image copies.) But the write-blocking should have no bearing on working across multiple file systems or repositories: each new environment for the images should have its own write-blocking procedure to protect the image as it was read from the original media. Fixity checks are another way of guaranteeing the data hasn’t changed.
No, imaging is not a process that scales, although where I work (NYPL), we managed to image close to 80 floppy disks (both 3.5″ and some 5.25″) within about two weeks, and that was with several people only doing it when they happened to have some spare time (and learning how to do it, too). Once this step becomes part of every archivist’s regular workflow (we’ve only started establishing our digital processes in the last few months), it will be far quicker.
I’m also leery of disk imaging from a risk management perspective. As a repository holding the papers of two later-identified pedophiles, what if we had images of their hard drives containing deleted but not overwritten illegal graphic content? What would our potential liability be, or damage to our public image if this data were unintentionally exposed, stolen, hacked? And this is only an extreme example.
Archivists traditionally have taken a relatively conservative view about personal privacy and the written record, but I hear little about this topic w.r.t. born digital content.
Institutional context should be a major part of one’s decision making process in this arena.
John, To me this is more of an argument for why it might be better to create logical disk images than to create forensic ones. The logical disks don’t create bit for bit copies but instead simply bring over the structure of the files themselves. So, you wouldn’t have copies of deleted but not yet overwritten things.
With that said, in either case, it is unlikely that you are going to go through and read every file on a given disk as part of accessioning it. In this scenario though, the risks for having a copy of the disk and having the disk, or its hard copy equivalent seem to be largely the same. In either case, archivists make relatively course folder level decisions about what to keep and what levels of security to keep it at. For example, in cases where there was rather sensitive material you could keep the images on a non-networked system, or on a networked storage system one could encrypt such potentially sensitive materials.
Still, when a disk is imaged there are actually very potent tools available for identifying some categories of illicit content that could be used to expunge that content. (For example see how the FBI makes use of tools created to work off the hashes provided by the National Software Reference Library)
My experience is that considerations of privacy are very much in the mix of many organizations work in this area. In many cases this can create a kind of paralysis which can keep an organization from starting to take the necessary preservation action of imaging a majority of their disks.
Would it be possible, or even desirable, to have a community definition of minimally acceptable processing for born digital archival content?