In software development a release candidate is a beta version with the potential to be the final product. Welcome to the release candidate for the NDSA Levels of Digital Preservation. After some fantastic commentary on the blog, and presentations at a series of conferences to solicit feedback, I’m excited to share this revised version of the levels for further commentary and discussion. Based on all the feedback we received, the small NDSA action team have worked up these tweaks and revisions.
I’ve provided some of the context for this after the grid, for more background on the project I would suggest reading over the original post on the levels of digital preservation project.
NDSA Levels of Digital Preservation
(Protect Your Data)
(Know Your data)
(Monitor Your Data)
(Repair Your Data)
|Storage and Geographic Location||
|File Fixity and Data Integrity||
NDSA Levels of Digital Preservation Goals
The goal of this document is to provide a basic tool for helping organizations manage and mitigate digital preservation risks. This document does not deal with broader issues related to collection development practices, critical policy framework decisions, general issues involving staffing or particular workflows or life cycle issues. Those are all critical, and in many cases are handled quite well by existing work (like the OAIS model, and the TRAC and TDR standards).
- This is useful for developing plans — not a plan in itself: This is not a digital preservation cookbook, what we detail here is necessary but not sufficient for ensuring digital preservation.
- These levels are non-judgmental: Organizations have different resources and priorities, and as a result need to think about how to best allocate those resources to meet their specific needs.
- These levels can be applied to collection(s) or system(s): These levels function coherently with everything from individual case by case collection level decisions as well as issues for an entire centralized repository
- This is designed to be content and system agnostic: This is only about generic issues. Specific kinds of content (e.g., documents, audio interviews, video, etc.) are likely to have their own nuances, but these levels and factors are generic enough that they are intended to apply to any digital preservation situation.
Each level begins to address a new area. Level 1 addresses the most likely risks in the short term. As you progress through the levels they address mitigation of risks over the long-term.
There is both very basic digital preservation information, like NDIIPP’s personal archiving materials, and extensive and substantial requirements for building and managing a digital repository. However, the working group felt there was a lack of solid guidance on how an organization should prioritize its resource allocation between these two ends of the spectrum. The goal of this project has been to develop a tiered set of recommendations for prioritizing enhancements to digital preservation systems (defined broadly to include organizational and technical infrastructure). This is defining targets for at least three levels of criteria for digital preservation systems, at the bottom level providing guidance to “get the boxes off the floor” and at each escalating level offering prioritized suggestions for how organizations can get the most out of their resources.
Your Chance for Additional Input:
It’s our intention to leave this up here for a bit as a release candidate, get some more feedback on how it is all hanging together, and in a little while come back make any final tweaks and then call this version one. So, right now is your chance to give another round of input.
What Would you Link to in the Cells?
As a next step in this project, our group had discussed adding links out to relevant web accessible materials on the topics and terms in each of these cells. We would be thrilled to hear any suggestions for material to link to tho help people act on the activities in each of the cells.
I like the way this is shaping up. One (perhaps contentious) question: are the lines between the various flavors of metadata well defined?
When I think about them, they kind of blur together in an unsatisfying way. But perhaps others in the community are clear on the distinctions? One downside to this grid presentation is that since Metadata is a row, you need to fill in the boxes, or else it will look funky.
In a way, I kind of see metadata as being a cross cutting concern that supports the other activities. But I also can see how NDSA members could balk at the idea of not having a special row for their precious metadata.
This is looking great. I don’t think I can criticise much, given that I think almost all of my previous feedback (and that which I was party to at CurateCAMP) has been incorporated!
I’d like to see at least some of the links going to locations that can be amended and improved by the crowd. Links to Stack Exchange questions or appropriate wiki pages would be great. Most of the entries could be posed as questions and answered pretty well on Stack, I think.
I’ll certainly be linking to this from web resources I’m putting together.
I agree with Ed, but I assume this will be rectified when links are added?
At iPres 2012 Toronto, we discussed this document during one of the breakout sessions. In summary, the group saw a need for a “Level Zero”, one identifying “something you can point at that you suspect you are responsible for.” Content identified at this level may not even necessarily be preserved; the point is that an organisation needs to identify that it has content that needs to be evaluated for preservation.
Also, the group felt that the level captions (such as “Know Your Data”) are not very useful and recommends removing them from the chart. Or, if they are essential, defining them better (in which case Level Zero might be something like “Claim Your Data”?)
Finally, contrary to the general, perceived pattern of each successive level requiring more resources than the previous one, the group felt that this was not the case in the “File Formats” functional area, but that a case could be made for making file format requirements more substantial/robust as preservation levels increase.
We worked within a google spreadsheet and made edits in the course of our discussion, including populating the Level Zero column and adding a Rights functional area. To see our work, follow this link: https://docs.google.com/spreadsheet/ccc?key=0Aow9_JA5GUnYdEZvZkRpSFVGU21nSXJKTk1ITktMVUE
Best regards on making this document something useful for the community. We look forward to watching its development over time.
Oops, in previous comment, I meant to say one of our “CurateCamp” breakout sessions! Sorry all, Courtney
I like the way this is shaping up.
I like the way this is shaping up.
This looks really good. Its the kind of thing we need at the NPS to be able to demonstrate to practitioners and managers that we aren’t just “making things up” when it comes to preservation recommendations. I’m glad to be part of the NDSA when straightforward practical guidance like this is produced!
The content is good, but I have doubts about the subtitles. Everyone is concerned with protecting, knowing, and when necessary repairing data. In some writing I’ve been working on I defined levels of backup and archiving strategy by organizational level and importance, using the terms “personal data collections,” “small businesses and organizations,” “larger businesses and organizations,” and “critical data collections.” That doesn’t sound as snappy, but something of that flavor might better convey the intent.
@ Ed: I think we should probably work up some text that explains what we mean by each of those kinds of metadata. Early in the groups discussions we decided that we wanted to make metadata a row to try and disaggregate the and prioritize different types of metadata. As an aside, I once heard a co-worker refer to something as being part of “the metadata problem” to which responded “Metadata is a genre of problems in our world not A problem, it’s like 30% of our problems.” So, in general, I see the goal of the metadata row as articulating what you absolutely need to be keeping track of and moving up toward some of the more exhaustive structured ways of thinking about that metadata as you get further into the rows. I like that because it helps to solve what I’ve seen as a problem where some small orgs that might have a bunch of content they care about on one staff members desktop machine who get interested in digital preservation and start learning about PREMIS.
@Gary: I similarly have doubts about the subtitles. We started with this just being levels 1-4 with no titles, but several folks in the group thought it would be helpful to have them and in discussion it seemed that when we prioritized our thoughts about what one should do first that they ended up largely falling into these kinds of buckets. Of course, you are right, at least the first 3 are things that everyone is doing to some extent. With that said, their ends up being a shift in emphasis from the former to the later as you move thorough the chart (largely, because the former get’s prioritized as the latter is only possible when the former is already largely taken care of. So, as I see it, there is value in thinking of the four column titles not as rigid categories of activity, but as areas of emphasis that take focus at the different levels.
Today’s posting of the new NDSA Glossary prompted me to return to this post and review where things are up to. What is the status of the levels document with regard to the CurateCAMP work that Courtney referred to above? Thanks
Thanks for the question Paul. The levels group has a paper in on this for the Archiving conference. It does a good bit in explicating the levels and gives examples of how members have started using them. So once that goes up we will distribute it more broadly and continue this discussion.
Thanks for responding. So are you saying that (for example) the Level 0 suggestion wasn’t incorporated? The long and unclear feedback loop between contributions and new versions is not going to help to encourage more community interactions. This is a shame, as there has been a lot of enthusiasm and interest in this excellent doc (as I witnessed first hand at the iPRES CurateCAMP.
I’ve posted a more detailed comment on the approach on Butch’s glossary blogpost.
Has the paper on using the levels been published? It’s not clear what “standard” metadata is in Level 3 and 4? And what technical and preservation metadata is not present in Level 2 administrative metadata, but present in Level 3 and 4?
Overall, great work. Thanks.
We’re modifying this document to meet USGS needs, changing the jargon to something we might recognize (e.g. check fixity -> verify checksum), and adding USGS-specific stuff.
Suggested addition for Data Integrity, Level 4 (using USGS wording):
• Create, store, and verify a second, different checksum for all content.
Technically, this isn’t hard to do. If you have a SHA-1 checksum from ingest, compute, store, and verify a SHA-256 checksum. Helps address this problem:
Newly computed checksum and stored checksum don’t match. Which is corrupted: the file, or the stored checksum?
If *both* stored checksums don’t match, chances are very high that the file has changed unexpectedly. If only one stored checksum doesn’t match, chances are very high that the stored checksum was changed.
We’ve implemented this scheme on a couple of internal systems, didn’t take much work.
We also capture, store, and check the operating system’s file modification time, as an additional check on file integrity, but that might be overkill.
Also, there are some tricky time-related traps in automatically maintaining several data copies. For example: Every day, your system copies data from the primary server to secondary servers, and you verify checksums. Checksum verification fails on the primary server, so now you want a copy from a secondary server. You need to (a) recognize that you have a problem to fix and (b) get the copies from the secondary servers, *before* the next scheduled copy. Otherwise, you are copying garbage to the secondary servers. There are several solutions to this problem, but the first step is recognizing the problem. I’ve experienced this exact problem for decades with traditional backup tape schemes. Don’t know how to capture that issue, or a recommendation, into a pithy cell-length guideline.
Under Storage & Geographic Location, Level 2:
• Document your storage system(s) and storage media and what you need to use them
We could not understand the phrase “and what you need to use them”, so we dropped it from the USGS document.
Can someone explain, or point to an explanation?