Top of page

Born Digital Minimum Processing and Access

Share this post:

The following guest post from Kathleen O’Neill, Archives Specialist in The Library of Congress Manuscript Division continues our series of posts reflecting on CurateCamp Processing.

Meg Phillips’s earlier post on More Product, Less Process for Born Digital Collections focused on developing minimum standards for ingest and processing with the goal of making the maximum number of records available to the greatest number of users. The resulting output from a minimum processing workflow would be a bitstream copy of a file with accompanying metadata that is discoverable and available to researchers. But available does not necessarily mean accessible. Which lead me to wonder — is there a sufficient minimum standard for access to born digital materials?

Participants considering and admiring the emergent curatecamp schedule

Does a copy of the bitstream constitute a sufficient minimum level of access? It could, when files are in a readable format. A bitstream of a file in an obsolete format, however, might not provide access to the content. Are institutions obligated to provide software and tools to enable the researcher to access the content? Are institutions obligated to migrate file formats?

Regarding migration, one CurateCamp participant remarked, “migration sounds like a lot of process” and compared migration to translating texts for researchers (a comparison I heard echoed elsewhere at the conference). If you are a Martin Heidegger scholar, you would be expected to read German; reference staff would not translate text for you. Should users be expected to have a minimum level of technical expertise? Is that even a fair comparison?

And what of emulation, virtualization and disk images as means to access obsolete file formats and software? The technical and legal challenges associated with emulation and virtualization put those solutions out of reach for the majority of institutions. An informal show of hands revealed that few institutions were capturing disk images and none were serving them to users due to concerns regarding PII and donors restrictions.

The discussion about access was complicated by the fact that born digital humanities research is in its infancy. Archival institutions understand how researchers use paper records and can process and provide access accordingly, but know very little about the researcher of born digital records. Who are our users? And how will they be using the material? What level of technical expertise should be expected?

Several participants urged that institutions to partner with researchers, leveraging researchers’ technical expertise to make born digital collections more accessible, discoverable, and usable. The Accessible Visualization session demonstrated there are a dizzying array of tools for metadata extraction, visualization, and text analysis. A good place to begin is Bamboo DiRT, a registry of digital research tools.

In the meantime, is providing a copy of the bitstream to users a sufficient minimum level of access? Yes…and maybe. It depends on the file, it depends on the user. So in addition to the bitstream copy of the file, I would think a minimum standard for access would include some type of file viewer to increase the possibility the researcher could at least read the content.

What would your institution’s minimum standard for access include?

Comments (6)

  1. “If you are a Martin Heidegger scholar, you would be expected to read German; reference staff would not translate text for you. Should users be expected to have a minimum level of technical expertise? Is that even a fair comparison?”

    Yes, yes, yes! I’ve written before, digitial preservation has expected far too little of its end product users. If you are a scholar working on old materials there are skills you must have in your armoury. Not just Heidegger’s German, maybe the skills to read older writing in manuscript, etc. No-one will spoon feed you, far less the archivists where resources applied to spoonfeeding possible future users are at the expense of preserving more content for them.

    Many (most?) documents sit in archives for decades or more without being touched. The right person to put in the investment to understand a document or digital resource is the end user, especially the scholar.

    If there are digital resources in our archives that are of interest to the public at large who do not have those skills, someone will find a way to make them available. I am not equipped to read the Domesday Book to find information about my local town, but I can still get that information from a variety of sources, some for a fee. But if I want some arbitrary piece of ancient information, I may have to put in some serious effort!

  2. At The National Archvies here in the UK, we’ve articulated this idea as Parsimonious Preservation http://www.nationalarchives.gov.uk/documents/parsimonious-preservation.pdf

    At ICA 2012 in Brisbane our CEO also talked about the necessity or otherwise of migration, expressing the view that being able to read older formats is akin to palaeography http://www.nationalarchives.gov.uk/documents/the-national-archives-digital-strategy.pdf

    (I’ve mentioend these papers in comments on another Signal blog previously //blogs.loc.gov/digitalpreservation/2012/09/help-define-levels-for-digital-preservation-request-for-public-comments/#comment-7166.

  3. Yes, researchers need some minimal level of technical expertise (I’ve seen too many open reel audio tapes mucked up by carelessness or lack of knowledge), but I’m not sure that the translation comparison is a fair one. Digital is not a language. It’s many languages, softwares, hardwares, operating systems, etc. Technical savvy with certain of these does not necessarily translate to competency with all of them. If you are a Cory Arcangel scholar, yes you should be able to understand the medium he deals in, but if you are a Salman Rushdie scholar do you necessarily need to be an expert on word processing programs?

    There also seems to be an impediment here in the idea that interested scholar will figure out a way. This assumes that the files have enough information associated with them to be discovered or at least selected for some kind of migration/emulation work. Similar to my take on MPLP and audiovisual collections, What’s Your Product, there is a greater degree of processing required to get these materials to accessibility. I’m not sure if that level of processing goes down to migration, but I feel it’s deeper (and needs to be more aware of providing/informing researchers of equipment/software requirements) than what is involved with letting researchers browse minimally processed manuscripts.

  4. Joshua, I can agree with you about the importance of metadata for discovery. Clearly there is a need to make digital manuscript collection content discover-able. I think something like a listing of files and directories goes a good way in this direction.

    With that said, I think a key point about this content is that if you hand someone a collection of someone’s personal files there is a great bit they will be able to do with them with the basic stuff that comes for free on most computers. While the imagined Salman Rushdie scholar wouldn’t be able to do all the things they could with the emulated computer they would be able to open most of the image files and open a wide range of document files in either newer versions of programs or text editors.

    I like the comparison that Chris made to learning to read different kinds of script from different time periods. After this thin layer of minimum work, (getting the bits off the disk, and making them available) you can then start to think about ways to batch process what you end up with as a reaction to demand and future format obsolescence problems.

  5. This is a really late reply, but I wanted to respond to Trevor’s point:

    “if you hand someone a collection of someone’s personal files there is a great bit they will be able to do with them with the basic stuff that comes for free on most computers.”

    This is true of very basic files like standard image formats, text, and rich text. It’s not true of *any* proprietary format. If I open a Microsoft Word document in a text editor, it looks like this (direct C&P):

    N‹l×^CšãÅÕÏÝ[êÒZèOt›

    There’s no access to the content. Similarly, while it’s reasonable to expect users to open *contemporary* filetypes, archivists have to think in the long term, and today’s software often has a hard time with files created 20 or more years ago. ( PCGamer has a good article highlighting the challenge of working with software created for Windows 3.1 http://www.pcgamer-magazine.com/pcgamer/201208?sub_id=VPpJNaDh5sKz#pg48 )

    So while SOME competency can be expected, I think we have a responsibility to create access copies in open, sustainable formats whenever possible. I also think some guidance regarding emulation or virtualization is in order if we’re serving up disk images. Keeping a Mac Classic working over the long haul requires time and expertise, but pointing a researcher to vMac only takes a few sentences.

    The language metaphor, while initially appealing, falls short. It’s fair to expect researchers to speak the language a manuscript is written in. It’s not fair to give them the INK it was written with and expect them to suss out how that ink was originally laid on the page, which is what we’re really doing if we give someone a bitstream with no access information.

  6. Thanks Rachel, I can get behind you on the need value of offering guidance on emulation and virtualization for researchers. With that said, I think that guidance is likely best approached as a set of skills and knowledge that scholars should be building up and not part of what archives should be doing as part of the minimum level of processing to be done to make something accessible to scholars.

    The word minimum there is really critical. If you bring in creating access copies into the minimum level of work to be done with materials then you get something that is going to be a much slower process. Do you try and do this on a file by file basis? Do you try and batch process derivatives? Something else? I’d be interested to know what kind of step you imagine adding to Kathleen’s list.

    I think the language metaphor works better than the ink one. At the most basic level, files come with filenames and file extensions and if you have a manifest you have some contextual information. While, yes, you can’t get much from opening an old .doc file in a text editor, if you have the file extension and a date you are likely only a few steps away from knowing of a piece of software that can open it. (In the case of the .doc file, it wouldn’t take too long to find a copy of Open Office that would likely at least let you read what was inside it). I would rather have take level of hunting and pecking with a lot of archival material than having much more fine grained work completed to make a much smaller number of collections available with file by file access copies.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.


Required fields are indicated with an * asterisk.