Over the last few months a team of librarians, archivists, curators, engineers and other technologists in the NDSA have been working to draft a simple chart to help prioritize digital preservation work. After iteratively developing this document and workshopping it at Digital Preservation 2012 we are excited to publicly share it for comment.
Why Define Levels?
NDSA members felt like there was great basic digital preservation information, like NDIIPP’s personal digital archiving materials, and extensive and substantial and comprehensive requirements for being recognized as a trusted digital repository. However, the working group felt there was a lack of solid guidance on how an organization should prioritize its resource allocation between these two ends of the spectrum.
This is a working draft and we know there are things we haven’t yet addressed, or things that should come before other things, or things that should be in a different box. Please take the time to tell us what you think.
How to read the levels
The overall idea with the document is that all the things in the first level are either necessary prerequisites for things in the second to fourth levels or are themselves the most pressing things to address. To some extent, the goal for this diagram is that you could use it to start getting your proverbial digital boxes off the floor, and then work your way up to level four where you are much more protected against risk of loss.
These levels have been developed with the following ideas in mind.
- This is useful for developing plans — not a plan in itself: This is not a digital preservation cookbook; we believe these are elements that are important but not sufficient for addressing digital preservation requirements.
- These levels are non-judgmental: Organizations have different resources and priorities, and as a result need to think about how to best allocate those resources to meet their specific needs.
- These levels can be applied to collection(s) or system(s): These levels function coherently with everything from individual case by case collection level decisions as well as issues for an entire centralized digital library.
- This is designed to be content and system agnostic: This is only about generic issues. Specific kinds of content (e.g., documents, audio interviews, video, etc.) are likely to have their own nuances, but these levels and factors are generic enough that they are intended to apply to any digital preservation situation.
Working Draft of the Levels of Digital Preservation Chart
You can also download a printable PDF draft chart.
How you can get involved?
- Review the document, think about it a bit, see if you think specific things should be moved around, or that the document needs to address some other factor, and then. leave a comment here.
- Alternatively, feel free to go and blog about this on your own site and then post a link to your reactions here on this post.
- Send this link out to some of your colleagues or some of the list
What the levels are not meant to address
The levels explicitly do not deal with broader issues related to collection development practices, critical policy framework decisions, general issues involving staffing or particular workflows or life cycle issues. Those are all critical, and in many cases are handled quite well by existing work (like the OAIS model, and the TRAC and TDR standards). The levels do not represent any specific type of hardware, software, system, organization, or product.
Project Team
- Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard University
- Abbie Grotke, Web Archiving Team Lead, Library of Congress
- Amy Kirchoff, Archive Service Product Manager, ITHAKA
- Kris Klein, Digital Programs Consultant, California State Library
- Jane Mandelbaum, IT Project Manager, Library of Congress
- Trevor Owens, Digital Archivist, Library of Congress
- Meg Phillips, Electronic Records Lifecycle Coordinator, National Archives
- Shawn Rounds, State Archivist, Minnesota Historical Society
- Jefferson Bailey, Strategic Initiatives Manager, Metropolitan New York Library Council
- Linda Tadic, Executive Director, Audiovisual Archive Network
- Robin Ruggaber, Director, Online Library Environment, UVA
Comments (29)
Super work team!
For Information Security, I suggest that by Level 4 that logs and audit information should also include “attempts” (to access, delete). And taking action as necessary – even an unsuccessful attempt to remove a file should be noted. [I have designed systems around 17a-4 archives where rule-based action triggers a workflow (an alert to a manager – or even a shutdown of that user’s session if they are attempting a rm -rf * for example).] For Level 3, I suggest that the log includes more than access.
Nice job! RE: potential audiences for this document.
The non-professional interested in personal archiving, as well as the non-technically savvy professional will not understand some of the jargon (fixity, ingest, authorization, transformative metadata, obsolescence monitoring process, etc.). Do you have another (“personal archiving”) version of the Chart in the works?
As for the professional reader–I’m surprised not to find any mention of “partner” in the tool/chart. Given that you’re referring to ingest and storage system (which could also be a storage service), it might be good to refer to a data management “partner” explicitly rather than implicitly.
Great work! I really like this way of breaking things down, and I can see how this could be really helpful for those starting off in digital preservation. I have some suggestions for minor additions, and some (perhaps more controversial) thoughts on the file format section.
Firstly, can this go on a wiki somewhere (my usual comment on these things!)? Apart from the obvious benefit of being able to edit collabortively, it would make it easy to add links to further information (particularly useful to help explain terminology, re: Stephen’s comment).
I’m not sure I like the level headings. I think I’d prefer a scale based on priority (and perhaps ease) rather than a functionally driven one. I think you’re kind of implying the former, but the latter then skews this (eg. see my comment below on migration/emulation).
Storage and Geo Location: It’s also useful to use different storage technologies in at least one copy/node to avoid common mode failure.
Storage / Security: Avoiding any one person having write access to all copies is probably rather sensible, although not necessarily easy to implement.
File Formats: I have all kinds of issues here, although I think they are mainly down to concerns about misinterpretation of the wording used.
“Encourage use of limited set of …formats” this always worries me. When a repository restricts what formats can be submitted or ingested, the user can often be forced to migrate between formats themselves. With no metadata, or QA. This feels really dangerous to me. So who is being encouraged by this statement, and in what way? If it’s simply advice to users to generally choose sensible formats when they can, that’s ok. If it’s going to encourage them to migrate and throw away the original, that’s alarm bells territory for me.
“Validate files against their file formats”. Why? Experience is beginning to show that adherence to a file format spec does not necessarily mean it is free from preservation issues. Equally, a file that does not conform may be perfectly ok, and render correctly in all the relevant software. I’d like to see more focus on identifying preservation risks. Validation can be useful tool in informing risk in some cases, but is not a useful goal in itself.
Again, with “Monitor file format obsolescence threats” is this the most critical concern? Many are starting to question the traditional view of all or nothing file format obsolescence being the main threat to digital collections (eg. Rusbridge, Rosenthal, etc). Personally I don’t think it is in most cases. The real threats are the more subtle ones that distort the content or destroy parts of it’s meaning. Take the example with PDF, where the main threats that have been described and encountered aren’t directly about obsolescence:
http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format
Following on from this, I strongly dislike the “Perform migration, emulations…” part. This implies that to do preservation properly you *have* to do migration or emulation. While the National Archivies of Australia normalizes all the digital objects it ingests, I’m not aware of any other organisation that does this! Migration in particular is fraught with difficulties and IMHO should only be undertaken when absolutely necessary. This of course links back to the file format obsolescence question. If none of your files are obsolete (or appear to be heading that way), then why migrate/emulate? Simply adding “where necessary” to the end of the migration/emulation statement would relieve my concerns considerably!
Perhaps it would make sense to include something the blank box on access and information management record keeping. For instance, who has access to data? Who has write permissions?
Perhaps also use records would be interesting — another way to Know Your Data is to know how it is being used.
Gail: Good point about logging attempts – maybe that could be the start of a level 5 🙂 . When you say that level 3 should include more than accesses – what else do you think it should include?
Stephen: We have thought about producing some documents to supplement this one, including a glossary of terms. Do you think that this would help with the jargon problem you identified? We were also thinking that each cell of the table would be ‘clickable’ and link to a fuller explanation of what’s meant in the cell. Good point about use of partners or outsourcing for some of these functions. One way we could address that is to clarify in the text accompanying the table that these functions do not necessarily need to be performed by your organization. In that case they would need to make sure that their partner or service is doing these things.
Paul, I want to say 2 things re your last point. Normalisation on ingest because no matter who you are or what your background is you can’t predict the future. NAA believes it’s better to be prepared for the possibility of format obsolescence than not. just because no one else is doing it yet doesn’t mean the NAA approach isn’t a viable one, irrespective of the issue of format longevity. Only time will tell. In any case NAA preserves all the bitstreams too, so future options are never limited.
I can’t see emulation in itself as anything but an access approach. You don’t ‘do’ emulation as a preservation action because you don’t need to run an emulator unless you want to access an object. So do you start building emulators when you ingest formats or when a consumer requires access?
I second the great comments from Paul.
In reply to Andrew:
I think normalisation is a great thing to do if you can afford it. Unfortunately until it can be demonstrated that normalisation can successfully preserve target objects (e.g. show that the content that you intended to keep before normalisation was still there after normalisation) then it can’t and shouldn’t be relied upon as a **preservation** strategy. The reason I think its worth doing anyway is that it can be used to provide access derivatives that may or not include the same content as the originals These access derivatives can be useful for servicing a need to quickly get at (for example) some text from a file normalisation is often a good option. However as a long term preservation strategy Normalisation is not only unproven but also provably practically ineffective for many types of objects. The reason for this is that the in most cases the cost to verify the process is too high to make it feasible and if you cannot verify the integrity of your process and the objects it produces then you can’t confirm it is working and can’t rely on it.
As for your comment about emulation. The vision for emulation as a business as usual preservation strategy does have emulation enacted as a just-in-time process not a just-in-case one like normalisation (arguably a much more economic approach over the long term given the low use-rates for most objects in archives/libraries). However that does not mean that is all there is to enacting emulation as a strategy or that there is nothing that needs to be done earlier in the preservation process.
To implement an effective emulation strategy you also need to take steps at point of transfer to ensure you have access to the necessary software environments, or at least ensure that you know which environments will be necessary for maintain access to (preserving) the transferred objects long term.
Building emulators is an ongoing process that all digital preservation institutions arguably have an imperative to engage with. It is quite a technical process though and most institutions may struggle to find staff able to undertake it. However there are equally important steps that less technically skilled staff can undertake. Building disk images of installed application environments is a relatively straight forward task for less-skilled staff. Booting a raw environment in QEMU and installing software is very similar to installing software on a new pc and does not take a lot of training to be able to undertake. Additionally the community and each institution would benefit greatly from additional documentation of software applications and environments, something that most non-technically trained staff should also be able to undertake successfully. Both of these tasks would help any digital preservation greatly if they were going to implement an emulation based digital preservation strategy.
Apologies for the mangled sentence in the middle of the first paragraph of my comment above, I meant to write:
“These access derivatives can be useful for servicing a need to quickly get at (for example) some text from a file, and for this and similar reasons normalisation is often a good option. “
Andrew: Good points, and I don’t want to criticise NAA for what is (as you say) a well prepared approach. But I worry a lot about guidance material that gives the impression that migration has to be performed to ensure effective preservation. In my experience this is simply not the case in the majority of instances (and worse, migration often destroys content as we don’t have effective QA tools). NAA’s “normalise all just in case” approach has one significant downside: it takes up precious resources. Given that we work in a field that’s always struggling to resource it’s work effectively, there’s a danger that encouraging organisations to needlessly migrate will impact on other preservation activities that they definitely should be doing as a higher priority.
I found this debate particularly interesting (I think you were there?):
http://www.ncdd.nl/blog/?p=2951
Two organisations with very similar aims, and polar opposites in preservation strategy. This suggests to me that we’re still a long way from undisputed digital preservation best practice. We therefore need to be careful what we advocate, as we don’t yet have the evidence to back up the approach.
I won’t respond to your emulation comments as I believe Euan C is going to jump in!
By normalising on ingest you are attempting to predict the future. You’re betting now which file formats are going to be most accessible in the long-term. OK, keeping the original bitstream as well means it doesn’t matter if you bet wrong, but it creates a big overhead at ingest. The figures we’re developing so far at The National Archives (UK) suggest that in practical terms formats are to extent normalising themselves anyway – the top 20 file formats represent 90 of the total files in our own EDRMS and in, for example, material selected for preservation from London 2012. Also, the vast majority of the material could well never actually be accessed by anyone (the long-tail of the long-tail) – see Oliver Morely’s comments at ICA2012
Paul, Euan, thanks good discussion. However, no approach has yet demonstrated that it can successfully preserve target objects. We’re all betting after a fashion that a particular approach will work out over time. And I still can’t see that the cost argument favours emulation over migration. Sure, with normalisation the cost is very much upfront, at ingest.
But Euan, the way you’ve described what needs to be done for an emulation based approach seems to suggest that the costs are never ending. I can’t disagree with what you and David say about use and the long tail but we can’t predict what will be used or when so there’s no way of taking a gamble about formats (ie leaving them as they are until some user wants access) that isn’t, to my mind, excessively risky. And keeping all those software environments documented, operating and accessible is an enormous cost, much more expensive that normalisation on ingest. Or so it seems to me. And on top of that you have to build the emulators.
And David, I wasn’t convinced by what Oliver Morley said at ICA2012 – so little will be used and formats last longer than we thought so “we don’t do migration”. I think that’s a complete abrogation of the archival mission, sorry.
Paul, I can’t agree with what you say about migration destroying content. I’d be interested to see evidence for that. Obviously NAA doesn’t agree at all, or they wouldn’t have adopted that approach, and I can’t see that they are wrong to have done so. I am not saying migration doesn’t cause change, but for NAA their conceptual understanding (the performance model) means they have accepted the fact of that change and have undertaken to live with it. If a normalisation results, over time, in an unacceptable loss of content or metadata then NAA would use the preserved bitstream to take a different action, as necessary.
While I certainly agree with you that guides shouldn’t suggest that it’s migration or nothing, I equally can’t be comfortable with the idea that migration/normalisation can’t be a viable part of an armoury of techniques to preserve digital objects over time. Just as I would never say emulation doesn’t have it’s place, I can’t be happy with views that suggest emulation is the ONLY way to preserve digital objects over time.
We’re probably getting a bit off topic for this post, so Trevor or Bill, please stop us if we should be taking this one somewhere else!
Andrew: Always good to debate this stuff with you!
I’d like to clarify that I wasn’t arguing that emulation is cheaper than migration. I’m suggesting that we should save money by not migrating until it’s absolutely clear that we need to migrate. Do nothing, until we have to do something. The advantage of waiting is that we’ll then know what our migration target is. At the moment (as I understand it) NAA doesn’t have a target so it invented it’s own XML formats to migrate to. They, presumably, will need to be migrated to another format at a later date to make them usable anyway? Unless they start writing viewers for their XML formats, which would be even more costly. If not you then have to migrate twice (once now to NAA XML, and once in the future from NAA XML to whatever format is currently usable) I’m then tempted to again make the suggestion that migrating once (when you’re sure it needs to be done) would be cheaper, easier and more accurate.
From my lengthy experience with preservation at the British Library, and working with practitioners from an array of organisations on these preservation challenges…
http://wiki.opf-labs.org/display/REQ/Digital+Preservation+and+Data+Curation+Requirements+and+Solutions
…the biggest preservation concern for me happens when digital objects are altered. As soon as you change the content (rename it, move it, migrate it, etc) you introduce the opportunity to break it in some way (scary). And quite possible not even realise (very scary). Specifically with regard to migration, we have very few tools to adequately validate that important elements of a digital object are not lost during a file format migration. I’ve seen plenty of examples of this (see Quality Issues and Bit Rot Issues at the URL above).
So I would (cheekily) turn your question around. Where is the evidence that file format migrations have been successful at not losing important properties (at NAA or elsewhere)?
I wouldnt’ say that our current approach means we will never do migration (and where for example we are preserving digitised material in JPEG2000 it is highly likely that we will also create some access copies in “ordinary” JPEG, though we may do that on the fly to some extent, rather than as a fromal migration or normalisation action). It’s more that at the moment we don’t see that it’s worth the upfront cost, or the possible losses that result during migration. In our earlier experiments with mgiration we found all sorts of changes in the migrated content which could well have affected how users perceived it.
The question is what are we preserving, and how much assistance do we have to give to the ultimate (potential) users of it. We don’t pre-emptively translate from Norman French or medieval Latin, or transcribe from secretary hand. Is normalisation/migration that different conceptually from such actions?
David, I was quoting Oliver with the “we don’t do migration”. That’s what he said at ICA. I wasn’t sure that was right from what I know of the TNA approach. It would be good to see the results of your migration tests – have they been published?
Paul, I want to clarify the NAA approach. NAA migrate from received formats into formats they perceive as being better candidates for long-term preservation, eg. MS Word to Open Office, audio formats to FLAC, image formats to PNG, etc. The XML component is only a wrapper around the new format. See the Xena documentation on sourceforge.
Sorry, have to go but will continue this discussion later. I do want to address a few more of your points Paul, but still ask the same question: what approach has had time to show it works or not? We haven’t been doing this long enough to know.
From what I remember of Oliver’s talk when he gave it here before going to Australia, in context it meant “we’re not going to do migration by default, or in anticipation”, not that it’s ruled out entirely. Normally we will simply accession in the original format, and that will also serve as the presentation copy.
There was no formal methodology behind the migration tests, more just try a few candidates out, look at the results and realise they looked awful and were not fit to be displayed to the public. As a result there was no published report equivalent to the Kiwi “Rendering Matters” report.
David
I should point out that Oliver’s complete slides are at http://www.nationalarchives.gov.uk/documents/the-national-archives-digital-strategy.pdf and Tim Gollins’s early “Parsimonious Preservation” paper from 2009 which first started to articualte the view taht migration is unlikely to be necessary as a BAU approach is at http://www.nationalarchives.gov.uk/documents/parsimonious-preservation.pdf
I believe that a critical issue has been missed and that is the quality, longevity and interface to the media. Putting 2 copies on say DVD provides likely less value than 1 copy on enterprise tape. The group has not addressed things like optical media, hard error rates which are VERY poor in comparison to say enterprise tape or even LTO tape. This chart in my opinion from a technology point of view give the librarian a false sense of security. Without dealing with and appropriately understanding the implications of media reliability including media failure and silent corruption, the data is at risk.
Thanks for these comments everybody! I’m going to try and roll up the discussion so far into some concrete suggestions for improving the chart. So here is my list. Please use this as another moment to further open up the discussion again. Do these ideas address your concerns? If not, how would you suggest refining this?
1. Storage Media Section
*Include points on quality, longevity and interface to storage media via: Henry Newman: Great point on the quality, longevity and interface to the media. The levels refer to a “storage system” and talk about getting content off heterogeneous media. I could see value in adding that the level one requirement explain “storage system” as a nearline or online storage system (Spinning disk or Magnetic Tape).
2. Information Security Section
* Move the maintain logs of who performed what actions on files down to level three and add “attempts” (to access, delete) to that point via Gail.
* Add “Avoiding any one person having write access to all copies” into level four via Paul Wheatley.
3. File Formats Section
*Add “When possible” to the beginning of “encourage use of limited set of known and open file formats and codecs.” The idea here was that “encourage” should be really light weight and this might help clarify that. Regarding Paul’s question on this, we put this point in to deal with the wide variety of contexts in which partner organizations are doing digital preservation. For example, a lot of the digital material that folks are working to preserve is actually material that is being digitized (images, recorded sound and moving image). In which case you have a lot of control you can exercise in deciding on file formats. At the same time, there are born digital projects that might decide upfront that the significant properties for a given collection will allow them to batch process or otherwise extract the information that they care about, do some QA, and leave them with something that is just far easier to ensure access to in the future.
* Drop “Validate files against their file formats” from the file formats section. I think Paul makes a good case that it isn’t particularly useful here. We could change this to something like “generate file format characterization metadata to inform monitoring file format obsolescence issues”.
* Per Paul’s suggestion, add “when necessary” to the end of the migration/emulation statement. I don’t think there was any desire in the group that folks would migrate or emulate everything.
Additional points:
* Add a glossary of terms to make the jargon more accessible. Re Stephen Chapman
* Re: Paul’s point on seeing the chart priority driven instead of functionality driven and the headings. That is how this was originally designed and the functionality labels were appended when it looked like that is how things were shaking themselves out.
Good summary Trevor. I can stand behind most all of these suggestions.
The only one I’m hesitant about is the removal of validation from the file format section. While I agree with Paul that determining that a file is valid according to its format specification doesn’t mean it’s without preservation risks, it still does seem like very useful information to me that could be combined with other information about the content to assess preservation risks.
These suggestions look good, thanks for the great summarizing, Trevor! As Andrea said previously, we are definitely interested in expanding out a glossary and providing more information behind each part of the chart, as we move this along, to help educate and guide folks interested in using the levels.
This draft has been circulated to a USGS group attempting to establish policy regarding best practices for preserving our science data. The group agreed that this draft will be very useful. We will be adding an additional column called level 0, Baseline, Current Status allowing our science centers to document where they are as del a roadmap for progress.
Well done!
Andrea: for level 3 My comment about looking more than access – I suggest that Any action against a file/object such as removal, rename, edit (i.e. if a file has been opened, achanged, and re-saved) are events to monitor and capture in logs.
This generated some great comments! One of the things I think this chart should do is document where there is a professional consensus about useful preservation actions for the sake of guiding institutions that are just getting started. Where there is no consensus, I would prefer for the chart to be silent rather than take sides in a debate that’s ongoing.
If a responsible preservation repository might choose to skip a step (like format validation), maybe we should leave it off and let more comprehensive documents than this cover the pros and cons.
For the possible actions to address format obsolescence, adding the language Paul suggested (and Trevor accepted) should solve the problem. Does everyone agree that monitoring for format obsolescence is generally a good step to take? Then, if your monitoring tells you that you have a problem, you would probably do *something* but you might not do anything prospectively – there are valid differences of opinion about that.
I have a couple of suggestions of my own, too (now late in the game – sorry!). Levels 3 and 4 on Information Security seem to be in the opposite order from what I’d want to do, but perhaps I’m misunderstanding them. (I’d want to detect and log actions that could change records before I’d want to track any access of the records.)
The highest level of “Storage and Geographic Location” recommends “all copies” in geographic locations with different disaster threats. That seems exessive. What if I’m actually maintaining 7 copies, including backups of various kinds. Every single copy must be in a different geographic region, and if I happen to create an 8th copy at some point, I have to send it to Antarctica? My guess is that the actual recommendation is more like “At least 3 copies in 3 geographic locations.” Again, what is the professional consensus about how many is enough?
BTW, thanks to Trevor for the great summary!
Trevor, were you planning on updating this post with the revised table, or publishing it somewhere else online? It might be nice to have it in a wiki somewhere, with version history, etc.
Hi Ed, I think the current plan is to roll in all the changes discussed above and then go present it to a few targeted groups that we haven’t heard much from. (Specifically folks at some smaller cultural heritage organizations.) Then with that input to firm it up and put out a “Version 1.” We would then do a blog post when we put out that version. With that said, everyone should feel free to take this and tweak it or change it on their own and link back to that here. NDSA doesn’t have a public wiki, so we don’t necessarily have a good wiki place to work on this, but if anyone wants to work on this, or make their own version they should feel free to.
Why not post on this website an actual HTML table instead of an image of a table? That would improve readability and accessibility. (Remember those Federal accessibility standards!)
I’ve posted an updated copy of the NDSA levels of digital preservation on the blog. So please move any continued discussion over to there. http://blogs.loc.gov/digitalpreservation/2012/11/ndsa-levels-of-digital-preservation-release-candidate-one/
Hi. I’m only finding this draft now, but would like to offer a few comments.
* [Applause!] Developing a common set of preservation service levels is terrifically helpful. Preservation services providers need to be honest and explicit about the sorts of risks they can mitigate. In other departments within our institutions, putting a little spin on how we characterize our accomplishments and commitments is acceptable. Preservation is different, and needs to be careful to never over sell its capabilities.
* Not only does this Levels of Digital Preservation help to communicate honestly and explicitly with our clients and constituents, it also helps drive home the point that if data is made in certain ways (encoded in preferred file formats), organized in certain ways, and packaged with specific bits of information (discovery, technical, and administrative metadata), preservation organizations can offer more services and higher levels of protection.
* Suggestion: I would like to see a preservation level that includes content rendering services. Many repositories are coupled with content delivery services: page-turners, image viewers, audio streaming services, video streaming services. While It isn’t necessary — or desirable — to make content available through a single delivery system, institutions that commit to providing a core set of delivery systems (content rendering applications), greatly enhance the preservation of certain kinds of digital resources.
A user might prefer accessing a collection through her preferred image viewer or page turner that calls up content through an API to the digital repository, but on the dark day when that user’s application fails, disappears, or can no longer render a format held by the repository, the organization’s core delivery systems can ensure that preserved content can be rendered and made useful.
I’m not suggesting that organizations will be able to provide delivery systems for all of the content types they collect and store, but the existence of or absence of an integrated delivery system could be a factor that distinguishes between preservation service levels.
Thanks,
Bill
Interesting discussion. I think that you should write more on this topic, it might not be a taboo subject but generally people are not enough to speak on such topics.