The following is a guest post from Megan Phillips, NARA’s Electronic Records Lifecycle Coordinator and an elected member of the NDSA coordinating committee and Andrea Goethals, Harvard Library’s Manager of Digital Preservation and Repository Services and co-chair of the NDSA Standards and Practices Working Group.
As part of the effort to publicize the NDSA Levels of Digital Preservation and as a way to continue to invite community comment on it, several members of the Levels group wrote a paper about it for the IS&T Archiving 2013 conference. The paper is The NDSA Levels of Digital Preservation: Explanation and Uses is available online.
At the conference, we got interesting comments and one significant suggestion to improve the paper from Christoph Becker, Senior Scientist at the Department of Software Technology and Interactive Systems, Vienna University of Technology. We wanted to present the suggestion he made here and ask for help from all of you to resolve it.
Christoph wrote that the major aspect of the levels that he would adjust is the label for the last function, “file formats.” You can see the table here. He pointed out that file formats are just one aspect of a larger preservation challenge related to how data (the bitstream) and computation (the software) collaborate in creating the “performances” that we really care about. New content is often not even file based. Format is just one element out of many that could be significant in preservation, and in some cases the format itself is almost meaningless. Often the real issues are related to specific features or feature sets (e.g. encryption), invalidities and sizes. (Petar Petrov tried to include part of this problem into his blog post about content profiling.) If you consider research data, for example, the format could be known to be XML-based but have no schema available. The real preservation challenge might be that the data requires a certain analysis module (found here) running on a certain platform, which is dependent on distributed resources — a certain metadata schema (found there), and certain understanding of semantics (found over here).
Christoph’s suggestion is that the overly-specific label “file format,” in the levels puts forward too narrow a view of the problem in question. The label could skew the real challenge since it excludes part of the problem (and part of the potential community). He suggested possible replacements for the “file formats” label. “Diagnosis and action”? “Issue detection and preservation actions”? “Understandability”? For him, in fact, this is the heart of preservation, and if you look at the SHAMAN/SCAPE capability model that Christoph works on, the preservation capability really is all about the last two rows (operations include metadata), assuming that the bitstream is securely stored and managed.
We (Andrea and Meg) think that Christoph has a valid point, but we’re still not sure of the best label to capture the suite of interrelated issues that need to be addressed in the last row of the Levels chart. Christoph’s suggestions make sense in isolation, but they would overlap with activities in other rows of the chart, and don’t quite convey the concept we originally intended.
- Do you think “file formats” is clear enough as shorthand for these kinds of issues, given where most of us are in our practical digital preservation efforts, or does this need to be changed?
- What label would you use for the last row of the chart? (Content characteristics? Usability? Just plain formats (without “file”?)
- Are there other changes you think we should make to improve that row?
- Any changes you’d recommend to other parts of the chart?
In the Archiving 2013 paper, we said that any comments received by August 31, 2013 would influence the next version of the Levels of Digital Preservation, so please suggest improvements! We may come back to you again over the summer to help resolve other issues.
Comments (12)
Hi Andrea and Meg,
I agree with you that Christopher Becker has a very good point. Thank you for allowing public comment on the opportunity.
My take on it is that whether it is migration, emulation or any other chosen *representation* method of the record, it is the *continuity* of the information content, form and/or context that is achieved or expected. So, the different techniques appear more like a few means to an end.
Hence, Digital Continuity appears appropriate, in my humble opinion.
The logical background to my suggestion is to be found in the concepts of *Disaster preparedness, readiness or recovery* and *Business continuity* which is the outcome projected by the use of any and all of the prevention or cure of activities. Also, as another justification for the suggestion (I wouldn’t say that the idea came to me ex nihilo), due credit should be paid to the Archives of New Zealand in their advertisement of a position in Digital Continuity, which, to my amazement, sums up what Digital Archivist are all about in general, and in particular represents very well, I hope, the last column about perennisation of the digital object and content.
In sum, Digital Continuity or Digital Permanence are possible options. Digital Continuity wins my vote because it implies a continual of actions to make the records accessible in the future, which is exactly what it is: a constant struggle against inaccessibility.
Christopher raises a good point. Understandability or Usability seems to be more on point, especially when we get to the last row/last column where there might be the need for external software or migration actions/format transformations in order to make the data stored meaningful (and understandable/usable). I always think of digital preservation encompassing the managed activities necessary for ensuring the long-term retention and usability of data. But let’s here from others too – good issue to raise!
Tibaut’s comment came in at the same time – Digital Continuity is rather good!
While Christoph Becker has valid concerns, I would argue that the label is OK the way it is. Or, the label could be generalized to simply “Formats”.
I see the Levels of Digital Preservation table as a valuable practical guide for a broad audience that includes nascent digital preservationists, and people from other professional disciplines who manage digital assets as a collateral duty. “Smartening up” the label risks alienating those who are new to digital preservation or are trying to simply understand what actions they can take to preserve digital content.
Perhaps Christoph’s well-founded concerns could be addressed in another table row, or in the narrative, of an evolved document.
Thanks for the opportunity to comment!
Gail is right about usability, which is the purpose of digital continuity, as I, too, pointed out in the last sentence of my first comment. We do agree Gail!
Since the means used to achieve the objective of accessibility now and later could differ, let’s, perhaps, focus on the concept of paradigm. Each means will represent a paradigm, such that migration is a paradigm just as valid as emulation. The addition of the term *Paradigm(s)* would indicate the choice and the possibility of using multiple paradigms.
Now, based on how one perceives the concept, the following terms can be combined with *Digital* to create the meaning most would agree with: Continuity, Accessibility, Perpetual and Useability. I will start with my own set of combinations, knowing that many more can be obtained:
Continuity Paradigm(s) ; (digital may be omitted)
Digital Continuity Paradigm(s)
Perpetual Useability Paradigm(s)
Continual Useability Paradigm(s)
Permanent Useability Paradigm(s)
Useability Continuum (Paradigm(s) may also be omitted)
Which is catchy?
Tibaut
——-
NB: *Useability* is the spelling encountered in the ISO standard not to be confused with Usability (in computing).
As David Rosenthal has been arguing for years, formats don’t really ever go obsolete. However software does, or at least it becomes unavailable. So perhaps “Software Obsolescence” would be a better label for that row. It is the obsolescence of compatible software that makes files in a particular format inaccessible, and migration and emulation are ways to resolve that issue.
The other topics covered in that row: ensuring files are created using formats (e.g. open formats) that are either:
1. likely to always have software available to interact with them, or
2. likely to be able in the future, to have software created from scratch to interact with them,
are also ways to avoid the inaccessibility problem associated with software obsolescence.
In addition to using that label it might also be useful to include an action to “monitor file formats in use to ensure compatible software is available for interacting with the files”, and “monitor software available within the archive’s context and within the designated community”.
Thrilled to see all this discussion! There are already a lot of great ideas in here and I hope to see more pop in. While I like some of the ideas behind the use of continuity and usability as terms, I think they just aren’t specific enough. I mean, the whole thing is about ensuring long term access, so couldn’t those terms describe all of the rows? I think Chris’ suggestion to just make it “formats” is interesting. I could imagine someone might think it is about actual physical media in that case, so it could always be “digital formats.” To Euan’s point, in my mind format obsolescence would be somewhat interchangeable with software obsolescence. In both situations we have a file in a format that we can no longer render. I think it still makes sense to focus on the format of the file in that the file is the thing with the content that I actually care about.
What I like about the idea of formats (and hence the focus on file formats) as a basis of work in digital preservation is that it gives us something to standardize around. Most folks have files that they want to access in the future. While the functioning of a system, or interaction between a range of components is itself important, I think that quickly gets away from some of the principle goals of the Levels of Digital Preservation. Namely, that the goal is to get people up and running.
Lastly, I think there is an important question for different organizations to answer that makes some of this a bit of a contested space. That is, is your digital preservation operation more like a butterfly garden or a natural history museum? If you are running a digital butterfly garden you need to spend a lot of time thinking about how you are going to keep all these digital butterflies happy and healthy, there are a lot more environmental conditions to take care of. You are maintaining a menagerie of living things. In contrast, if you are running your project like a natural history collection of butterflies you pin them down, spray them with stuff and try and put them in a place where they won’t get too dusty. Both of these approaches are about keeping butterflies around, the former gives you a much richer experience of butterflies but takes a ton of time, the latter lets you keep a lot more butterflies around and attend to them a lot less. To attempt to land this, when you focus on files you can pin them down, record some format information and get fixity info and move on and enhance your approach to do more sophisticated things later. If you aren’t focusing on files I think you start getting into this much more complex maintaining a butterfly garden space.
(I should qualify this by saying I know little about butterfly collections, but it works as a metaphor in my head, so hopefully it does in yours too.)
Great discussion. I find myself largely in agreement with Chris Dietrich, in that keeping the labels as basic as possible is important. If we get too far into the weeds (and they are thick and deep) the practicality of the model declines. I could see a label such as “format sustainability,” which covers (implicitly in my mind, anyway) all the issues discussed. To Trevor’s point, I’d say “sustainability” would mean something different to the gardener and museum person, but the concept stands.
Also, I wonder if the description in Level 1 for “formats” could be improved. Not every institution can have input into file creation, etc. The wording might be better as: “Establish flexible ingest criteria for formats; encourage the use of prefered formats and codecs.”
Great to see this well-informed discussion. I will try to comment a bit on the diverse issues raised. The woods are indeed thick and deep!
Continuity of information content is exactly what matters in my opinion too, but that is a goal-oriented view (which I strongly support and which is the perspective of our capability model as well). Digital continuity and permanence in this sense are great overarching goals (and much better targets than “preservation”, which is really the task to do, not the goal to achieve). But the levels table follows a different structure, one about what to do first and what to look at next, and continuity might also apply to other rows and columns, so I am not sure if it fits.
On the other hand, there is goal-orientation in other rows too, concerned with quality – information security, integrity, fixity. Trying to structure a row like this, understandability would arise as a natural label.
While I enjoy thinking of the butterfly garden and see a lot of truth in the metaphor, there is a danger of forgetting the computational “magic” that is actually going on. Trying to pick up the metaphor, with a digital butterfly collection, you do not really hold a butterfly at all. You hold sets of molecules (DNA?) and (hopefully) instructions on how to “make butterflies happen”, how to “perform butterflies” using these molecules. Only if you add the magic of life to these things you are keeping – only when you start the computation – does it become a butterfly as we know it, rather than a set of molecular structures in a coffin with a stamp on it saying “to be resurrected using X”. This counts not just for the garden, but also for the museum – the visitors want to look at butterflies, but in its digital incarnation, there are no butterflies in the drawers. They are just performed when someone requests to see one. (I apologise, I am not a butterfly collector.)
In that sense, I agree with a lot of what Euan says, in particular about monitoring. But the entire discussion on obsolescence seems quite misdirected sometimes, and exactly by labels such as “file formats”.
The immediate next step from saying “file + format + obsolescence” is the inherently misled assumption that obsolescence would be a binary yes/no property of a format or a file. Of course, if you see it like that, you could argue that a format is never “obsolete” – there is always somewhere, somehow a way of rendering the objects. (Also if the objects are software.) But this really seems to miss the point.
The rabbit holes here are deep and I don’t want to wander too much. In the end, if the costs of accessing an object are in a reasonable relation to its (expected) value and its verifiable quality, we are good. If not… we might have failed our goal of providing continued access. Instead of arguing whether obsolescence exists, we need to ask how well connected the artifacts (of whatever nature) we are preserving are to the contemporary computing ecosystem: How expensive is it for the users to access the information content successfully? (The SCAPE project has more to say about continuous monitoring and continued access.)
Back to the question: Calling the row “formats” or maybe “format” seems better to me than “file formats”, since it gets rid of one of the possible misunderstandings. And it follows the very worthy goal of keeping it simple and basic. “Content characteristics” was something Andrea suggested in her response email to comments, which I think merits consideration too, but might be mistaken for significant properties (which are absent so far…).
In general, I like the idea of complementing the table with one or more well-founded narratives, which could serve well to aid understanding and avoid misunderstandings. Understandability or Interpretability might be the best storyline in the end.
What do you think?
Finally, if you are interested in a goal-oriented perspective that is independent of formats and files and whether you believe in obsolescence or not, consider joining our efforts to create an open, systematic capability maturity model for preservation assessment, as part of our project BenchmarkDP. The discussions happening here certainly influence that model too.
We are right now running a survey on how to asses and improve preservation capabilities in organisations, which is still open and waiting for your response: Link to the survey
We are organising a workshop at IPRES this year, for which we are still accepting contributions: workshop webpage
and we will come forward with an open capability model, first case studies and requests for comments soon.
The project website is at http://benchmark-dp.org/
Christoph
(University of Toronto and Vienna University of Technology)
I agree we need to keep the labels as straightforward as possible, and of course that obsolescence is a sliding scale – a proxy for the economic cost of access. With this in mind I’ve been trying to determine what I think is most likely to be missed when focussing only on format.
I’ve come to the conclusion that when concentrating on the bitstream itself, the biggest risk is underestimating the degree to which the rendering of that bitstream depends on other contextual information. Format is effectively a single contextual hint as to how to interpret a bitstream – a link between the data and the software that it needs. For mature, well specified formats, it is often sufficient, but in other cases it is not.
Therefore, my suggestion would be that the row is called ‘Data Formats and Runtime Requirements’, and that the cells should make it clear that we need to check if a bitstream has significant external dependencies beyond those implied by format (e.g. special software, external fonts, hardware dongles, etc.). If these dependencies exist, it could recommend that we attempt to compensate (e.g. procure software/hardware, gather or even embed fonts, etc.).
If we attempt to collect these dependencies, we are butterfly wranglers. If we discard them, we are butterfly collectors.
Finally, in my opinion, any information about how an individual should interpret or understand an authentic rendering should be kept distinct from how we ensure that authenticity, and therefore that aspect belongs in a separate row.
I particularly appreciate the additional comments of Christoph and Andrew.
What does not seem obvious to me, is the focus on format. Granted, will will have a format at the beginning and in the end, is a truism. That we will also have bit streams at the beginning and in the end is almost a permanent truth. But the process we chose, sorry to repeat, the paradigm decide on is what makes the difference: perpetual Format migration or digital objects encapsulation in platform-independent application for permanent access, or even preservation of record creating/reading application through emulation? That’s a choice to make based on economics and policy decisions. No matter the choice, the outcome, or at least the intention of the choice/action is to provide access or some type of continual access or life of the object in relationship to our understanding of it.
I am not suggesting copy of ideas without due diligence of constructive criticism and reasonable adoption, but perhaps, it helps to glance at what’s happening elsewhere: the U.K. has labeled it somewhere : Use DROID to manage digital continuity (http://www.nationalarchives.gov.uk/information-management/our-services/dc-file-profiling-tool.htm).
If I may ask, whenever the means to record the information change (as biological DNAs can retain information stored), should we change the term format? When medical devices are used to read the information, will we speak in terms of softwares and computer hardwares?
Digital continuity, or, better, Access continuum (or some expression to that effect) perhaps, in my humble opinion, settles those thorny issues.
Hi Andrea and Meg,
Most of what I wanted to say has already been said by others so I’ll try not to reiterate that.
In my understanding, the main issue that Christoph raises could be summarised in the OAIS terms as that of “representation information”, of which file format may only be one of the aspects. It is a good point, however, I am afraid that simply changing the label won’t fix the problem since the individual levels will still be described in terms of formats. (I consciously used ‘a good point’ and not ‘a valid point’ because what is valid is a subjective measure and depends on the answer to ‘valid for what?’. It might be a good point but not valid in the context of what you’re doing and why you’re doing it.)
From the discussion above, it seems you’ve got three options:
1. Re-label the row to expand the scope, in which case I think you may need to re-work the cells as well to make them broader/less format centric. OAIS actually does talk about understandability when explaining representation information, so I think it might be a good option to go with as the new label.
2. Keep it as it is, possibly with a minor modification of the label without changing the current scope, and adding an explanatory note in the text.
3. Combine the previous two – keep the row as is but add a new one.
Either of them will do. What you choose in the end will probably to an extent depend on the answer to what the intended purpose of the document is and possibly also who the intended audience is. Either way, I agree that the main benefit of the framework is to help people get started and it should therefore be kept simple.
I personally prefer #2. To answer your questions:
Do you think “file formats” is clear enough as shorthand for these kinds of issues, given where most of us are in our practical digital preservation efforts, or does this need to be changed?
I do. Many of us are used to thinking in terms of file formats, simply because given the nature of the data we deal with and the designated community’s knowledge base it’s all that matters or because it’s a pragmatic choice that allows us to keep things manageable. Having said that, I acknowledge that for many others who deal with complex data such as research data it won’t be sufficient. However, as a starting point it should be clear and would do.
What label would you use for the last row of the chart? (Content characteristics? Usability? Just plain formats (without “file”?)
I prefer ‘understandability’. ‘Usability’ or ‘format’ would be fine, too. I see ‘digital continuity’ as synonymous with ‘digital preservation’ and therefore relevant to the whole table, not only the last row.
Are there other changes you think we should make to improve that row?
If you re-label the row along the lines that Christoph suggests than the scope of all the levels should be expanded.
Any changes you’d recommend to other parts of the chart?
Add an explanatory note in the text.
Libor