NDSA Levels of Digital Preservation: Release Candidate One

November 20, 2012

In software development a release candidate is a beta version with the potential to be the final product. Welcome to the release candidate for the NDSA Levels of Digital Preservation. After some fantastic commentary on the blog, and presentations at a series of conferences to solicit feedback, I’m excited to share this revised version of the levels for further commentary and discussion. Based on all the feedback we received, the small NDSA action team have worked up these tweaks and revisions.

I’ve provided some of the context for this after the grid, for more background on the project I would suggest reading over the original post on the levels of digital preservation project.

NDSA Levels of Digital Preservation

	Level One (Protect Your Data)	Level Two (Know Your data)	Level Three (Monitor Your Data)	Level Four (Repair Your Data)
Storage and Geographic Location	Two complete copies that are not collocated For data on heterogeneous media (optical disks, hard drives, etc.) get the content off the medium and into your storage system	At least three complete copies At least one copy in a different geographic location Document your storage system(s) and storage media and what you need to use them	At least one copy in a geographic location with a different disaster threat Obsolescence monitoring process for your storage system(s) and media	At least 3 copies in geographic locations with different disaster threats. Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems
File Fixity and Data Integrity	Check file fixity on ingest if it has been provided with the content Create fixity info if it wasn’t provided with the content	Check fixity on all ingestsUse write-blockers when working with original media Virus-check high risk content	Check fixity of content at fixed intervals Maintain logs of fixity info; supply audit on demand Ability to detect corrupt data Virus-check all content	Check fixity of all content in response to specific events or activities Ability to replace/repair corrupted data Ensure no one person has write access to all copies
Information Security	Identify who has read, write, move, and delete authorization to individual files Restrict who has those authorizations to individual files	Document access restrictions for content	Maintain logs of who performed what actions on files, including deletions and preservation actions	Perform audit of logs
Metadata	Inventory of content and its storage location Ensure backup and non-collocation of inventory	Store administrative metadata Store transformative metadata and log events	Store standard technical and descriptive metadata	Store standard preservation metadata
File Formats	When you can give input into the creation of digital files encourage use of a limited set of known open file formats and codecs	Inventory of file formats in use	Monitor file format obsolescence issues	Perform format migrations, emulation and similar activities as needed

NDSA Levels of Digital Preservation Goals

The goal of this document is to provide a basic tool for helping organizations manage and mitigate digital preservation risks. This document does not deal with broader issues related to collection development practices, critical policy framework decisions, general issues involving staffing or particular workflows or life cycle issues. Those are all critical, and in many cases are handled quite well by existing work (like the OAIS model, and the TRAC and TDR standards).

This is useful for developing plans — not a plan in itself: This is not a digital preservation cookbook, what we detail here is necessary but not sufficient for ensuring digital preservation.
These levels are non-judgmental: Organizations have different resources and priorities, and as a result need to think about how to best allocate those resources to meet their specific needs.
These levels can be applied to collection(s) or system(s): These levels function coherently with everything from individual case by case collection level decisions as well as issues for an entire centralized repository
This is designed to be content and system agnostic: This is only about generic issues. Specific kinds of content (e.g., documents, audio interviews, video, etc.) are likely to have their own nuances, but these levels and factors are generic enough that they are intended to apply to any digital preservation situation.

Each level begins to address a new area. Level 1 addresses the most likely risks in the short term. As you progress through the levels they address mitigation of risks over the long-term.

Project Background:

There is both very basic digital preservation information, like NDIIPP’s personal archiving materials, and extensive and substantial requirements for building and managing a digital repository. However, the working group felt there was a lack of solid guidance on how an organization should prioritize its resource allocation between these two ends of the spectrum. The goal of this project has been to develop a tiered set of recommendations for prioritizing enhancements to digital preservation systems (defined broadly to include organizational and technical infrastructure). This is defining targets for at least three levels of criteria for digital preservation systems, at the bottom level providing guidance to “get the boxes off the floor” and at each escalating level offering prioritized suggestions for how organizations can get the most out of their resources.

Your Chance for Additional Input:

It’s our intention to leave this up here for a bit as a release candidate, get some more feedback on how it is all hanging together, and in a little while come back make any final tweaks and then call this version one. So, right now is your chance to give another round of input.

What Would you Link to in the Cells?

As a next step in this project, our group had discussed adding links out to relevant web accessible materials on the topics and terms in each of these cells. We would be thrilled to hear any suggestions for material to link to tho help people act on the activities in each of the cells.

Comments (16)

Ed Summers says:
November 20, 2012 at 11:01 am

I like the way this is shaping up. One (perhaps contentious) question: are the lines between the various flavors of metadata well defined?

I see:

administrative metadata
transformative metadata
technical metadata
descriptive metadata
preservation metadata

When I think about them, they kind of blur together in an unsatisfying way. But perhaps others in the community are clear on the distinctions? One downside to this grid presentation is that since Metadata is a row, you need to fill in the boxes, or else it will look funky.

In a way, I kind of see metadata as being a cross cutting concern that supports the other activities. But I also can see how NDSA members could balk at the idea of not having a special row for their precious metadata.
Paul Wheatley says:
November 20, 2012 at 11:24 am

This is looking great. I don’t think I can criticise much, given that I think almost all of my previous feedback (and that which I was party to at CurateCAMP) has been incorporated!

I’d like to see at least some of the links going to locations that can be amended and improved by the crowd. Links to Stack Exchange questions or appropriate wiki pages would be great. Most of the entries could be posed as questions and answered pretty well on Stack, I think.

I’ll certainly be linking to this from web resources I’m putting together.
Paul Wheatley says:
November 20, 2012 at 11:26 am

I agree with Ed, but I assume this will be rectified when links are added?
Courtney Mumma says:
November 20, 2012 at 6:11 pm

At iPres 2012 Toronto, we discussed this document during one of the breakout sessions. In summary, the group saw a need for a “Level Zero”, one identifying “something you can point at that you suspect you are responsible for.” Content identified at this level may not even necessarily be preserved; the point is that an organisation needs to identify that it has content that needs to be evaluated for preservation.

Also, the group felt that the level captions (such as “Know Your Data”) are not very useful and recommends removing them from the chart. Or, if they are essential, defining them better (in which case Level Zero might be something like “Claim Your Data”?)

Finally, contrary to the general, perceived pattern of each successive level requiring more resources than the previous one, the group felt that this was not the case in the “File Formats” functional area, but that a case could be made for making file format requirements more substantial/robust as preservation levels increase.

We worked within a google spreadsheet and made edits in the course of our discussion, including populating the Level Zero column and adding a Rights functional area. To see our work, follow this link: https://docs.google.com/spreadsheet/ccc?key=0Aow9_JA5GUnYdEZvZkRpSFVGU21nSXJKTk1ITktMVUE

Best regards on making this document something useful for the community. We look forward to watching its development over time.

~Courtney
Courtney Mumma says:
November 20, 2012 at 7:04 pm

Oops, in previous comment, I meant to say one of our “CurateCamp” breakout sessions! Sorry all, Courtney
Nelson Valente says:
November 20, 2012 at 8:36 pm

I like the way this is shaping up.
Nelson Valente says:
November 20, 2012 at 8:37 pm

I like the way this is shaping up.

Nelson Valente
Chris Dietrich (US National Park Service) says:
November 21, 2012 at 12:53 pm

This looks really good. Its the kind of thing we need at the NPS to be able to demonstrate to practitioners and managers that we aren’t just “making things up” when it comes to preservation recommendations. I’m glad to be part of the NDSA when straightforward practical guidance like this is produced!
Gary McGath says:
November 22, 2012 at 6:33 am

The content is good, but I have doubts about the subtitles. Everyone is concerned with protecting, knowing, and when necessary repairing data. In some writing I’ve been working on I defined levels of backup and archiving strategy by organizational level and importance, using the terms “personal data collections,” “small businesses and organizations,” “larger businesses and organizations,” and “critical data collections.” That doesn’t sound as snappy, but something of that flavor might better convey the intent.
Trevor Owens says:
November 27, 2012 at 9:38 am

@ Ed: I think we should probably work up some text that explains what we mean by each of those kinds of metadata. Early in the groups discussions we decided that we wanted to make metadata a row to try and disaggregate the and prioritize different types of metadata. As an aside, I once heard a co-worker refer to something as being part of “the metadata problem” to which responded “Metadata is a genre of problems in our world not A problem, it’s like 30% of our problems.” So, in general, I see the goal of the metadata row as articulating what you absolutely need to be keeping track of and moving up toward some of the more exhaustive structured ways of thinking about that metadata as you get further into the rows. I like that because it helps to solve what I’ve seen as a problem where some small orgs that might have a bunch of content they care about on one staff members desktop machine who get interested in digital preservation and start learning about PREMIS.

@Gary: I similarly have doubts about the subtitles. We started with this just being levels 1-4 with no titles, but several folks in the group thought it would be helpful to have them and in discussion it seemed that when we prioritized our thoughts about what one should do first that they ended up largely falling into these kinds of buckets. Of course, you are right, at least the first 3 are things that everyone is doing to some extent. With that said, their ends up being a shift in emphasis from the former to the later as you move thorough the chart (largely, because the former get’s prioritized as the latter is only possible when the former is already largely taken care of. So, as I see it, there is value in thinking of the four column titles not as rigid categories of activity, but as areas of emphasis that take focus at the different levels.
Paul Wheatley says:
February 12, 2013 at 8:37 am

Hi Trevor,

Today’s posting of the new NDSA Glossary prompted me to return to this post and review where things are up to. What is the status of the levels document with regard to the CurateCAMP work that Courtney referred to above? Thanks
Trevor says:
February 12, 2013 at 9:46 am

Thanks for the question Paul. The levels group has a paper in on this for the Archiving conference. It does a good bit in explicating the levels and gives examples of how members have started using them. So once that goes up we will distribute it more broadly and continue this discussion.
Paul Wheatley says:
February 12, 2013 at 10:44 am

Hi Trevor,

Thanks for responding. So are you saying that (for example) the Level 0 suggestion wasn’t incorporated? The long and unclear feedback loop between contributions and new versions is not going to help to encourage more community interactions. This is a shame, as there has been a lot of enthusiasm and interest in this excellent doc (as I witnessed first hand at the iPRES CurateCAMP.

I’ve posted a more detailed comment on the approach on Butch’s glossary blogpost.

Cheers

Paul
Nick Krabbenhoeft says:
March 21, 2013 at 3:41 pm

Has the paper on using the levels been published? It’s not clear what “standard” metadata is in Level 3 and 4? And what technical and preservation metadata is not present in Level 2 administrative metadata, but present in Level 3 and 4?
Rex Sanders, USGS says:
May 1, 2013 at 5:46 pm

Overall, great work. Thanks.

We’re modifying this document to meet USGS needs, changing the jargon to something we might recognize (e.g. check fixity -> verify checksum), and adding USGS-specific stuff.

Suggested addition for Data Integrity, Level 4 (using USGS wording):

• Create, store, and verify a second, different checksum for all content.

Technically, this isn’t hard to do. If you have a SHA-1 checksum from ingest, compute, store, and verify a SHA-256 checksum. Helps address this problem:

Newly computed checksum and stored checksum don’t match. Which is corrupted: the file, or the stored checksum?

If *both* stored checksums don’t match, chances are very high that the file has changed unexpectedly. If only one stored checksum doesn’t match, chances are very high that the stored checksum was changed.

We’ve implemented this scheme on a couple of internal systems, didn’t take much work.

We also capture, store, and check the operating system’s file modification time, as an additional check on file integrity, but that might be overkill.

Also, there are some tricky time-related traps in automatically maintaining several data copies. For example: Every day, your system copies data from the primary server to secondary servers, and you verify checksums. Checksum verification fails on the primary server, so now you want a copy from a secondary server. You need to (a) recognize that you have a problem to fix and (b) get the copies from the secondary servers, *before* the next scheduled copy. Otherwise, you are copying garbage to the secondary servers. There are several solutions to this problem, but the first step is recognizing the problem. I’ve experienced this exact problem for decades with traditional backup tape schemes. Don’t know how to capture that issue, or a recommendation, into a pithy cell-length guideline.
Rex Sanders, USGS says:
May 1, 2013 at 5:57 pm

Under Storage & Geographic Location, Level 2:

• Document your storage system(s) and storage media and what you need to use them

We could not understand the phrase “and what you need to use them”, so we dropped it from the USGS document.

Can someone explain, or point to an explanation?

Add a Comment Cancel reply

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.

Name (no commercial URLs) *

Email (will not be published) *

Comment: