Where are the Born-Digital Archives Test Data Sets?

By Butch Lazorchak and Trevor Owens

We’ve talked in the past on the Signal on the need more applied research in digital preservation and stewardship. This is a key issue addressed by the 2014 National Agenda for Digital Stewardship, which dives in a little deeper to suggest that there’s a great need to strengthen the evidence base for digital preservation.

But what does this mean exactly?

Scientific fields have a long tradition of applied research and have amassed common bodies of evidence that can be used to systematically advance the evaluation of differing approaches, tools and services.

This approach is common in some areas of library and archives study, such as the Text Retrieval Conferences and their common data pools, but is less common in the digital preservation community.

As the Agenda mentions, there’s a need for some open test sets of digital archival material for folks to work on bench-marking and evaluating tools against, but the first step should be to establish the criteria for data collections.  What would make a good digital preservation test data set?

1. Needs to be real-world messy stuff: The whole point of establishing digital preservation test data sets is to have actual data to be able to run jobs against. An ideal set would be sanitized, processed or normalized to the least extent possible. Ideally, these data sets would come with some degree of clearly stated provenance and a set of checksums to allow researchers to validate that they are working on real stuff.

2. Needs to be public: The data needs to be publicly-accessible in order to encourage the widest use, and should be available via a URL without having to ask permission. This will allow anyone (even inspired amateurs) to take cracks at the data.

3. Needs to be legal to work with: There are many exciting honey pots of data out there that satisfy the first two requirements but live in legal grey areas. Many of the people working with these data sets will operate in government agencies and academia where clear legality is key.

There are some data sets currently available that meet most of the above criteria, though most are not designed specifically as digital preservation testbeds. Still, these provide a beginning to building a more comprehensive list of available research data, on the way to tailor-made digital preservation testbeds.

Some Initial Data Set Suggestions:

The social life of email at Entron - a new study from user chieftech on Flicker.

The social life of email at Enron – a new study from user chieftech on Flicker.

Enron Email Dataset:  This dataset consists of a large set of email messages that was made public during the legal investigation concerning the Enron corporation. It was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) under the auspices of the Defense Advanced Research Projects Agency. The collection contains a total of about ½ million messages and was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

NASA NEX: The NASA Earth Exchange is a platform for scientific collaboration and research for the earth science community. NEX users can explore and analyze large Earth science data sets, run and share modeling algorithms and collaborate on new or existing projects and exchange workflows. NEX have a number of datasets available, but three large sets have been made readily available to public users. One, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second, the MODIS (or Moderate Resolution Imaging Spectroradiometer) data offers a global view of Earth’s surface, while the third, the Landsat data record from the U.S. Geological Survey, provides the longest existing continuous space-based record of Earth’s land.

GeoCities Special Collection 2009: GeoCities was an important outlet for personal expression on the Web for almost 15 years, but was discontinued on October 26, 2009. This partial collection of GeoCities personal websites was rescued by the Archive Team and is about a terabyte of data. For more on the Geocities collection see our interview with Dragan Espenscheid from March 24.

There are other collections, such as the September 11 Digital Archive hosted by the Center for History and New Media at George Mason University, that have been used as testbeds in the past, most notably in the NDIIPP-supported Archive Ingest and Handling Test, but the data is not readily available for bulk download.

There are also entities that host public data sets that anyone can access for free, but further investigation is needed to see whether they meet all the criteria above.

We need testbeds like this to explore digital preservation solutions. Let us know about other available testbeds in the comments.

14 Comments

  1. Cal Lee
    March 26, 2014 at 11:51 am

    Thank you for this very thoughtful summary of an important issue. We raised similar ideas in “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions.” See the section called “Develop and Share Corpora for Education, Research and Tool Development” (p.33) for this discussion, along with descriptions of a variety of other existing test data sets.

    http://www.bitcurator.net/docs/bitstreams-to-heritage.pdf

  2. Alexander Duryee
    March 26, 2014 at 11:58 am

    One sample set for audiovisual material is the mplayer samples collection (samples.mplayerhq.hu). It’s nearly ideal, meeting the requirements above (being collected from live samples for software testing), and provides a rather broad scope of AV formats/quirks.

  3. Euan Cochrane
    March 26, 2014 at 12:51 pm

    Great post.

    Unfortunately requirement (2.) is especially difficult to meet. Often the best examples you come across when working in digital preservation have access restrictions for any number of reasons.

  4. Kim Tryka
    March 26, 2014 at 12:54 pm

    For lots of xml-based STM article text, the PMC (PubMed Central) Open Access Subset:
    https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

  5. Mick Crouch
    March 26, 2014 at 3:43 pm

    This is something we’ve discussed a lot here. We’ve got stuff in-house we can play with, but we can’t share it with others. We want a set that’s enormous, complex and heterogeneous — lots of files, lots of different sizes of files, lots of different kinds of files, lots of examples of each file and lots of different versions of the same sort of file – MS Word files created by the different versions of word, all versions of PDF or TIFF – that sort of thing.

  6. Butch Lazorchak
    March 26, 2014 at 4:10 pm

    Thanks for all the great additions!

    I strongly recommend the Bitcurator report that Cal references above. It’s well worth your time.

    For ease of use I pulled out some of the dataset resources they mention in the report:

    Digital Forensics Tool Testing Images
    http://dftt.sourceforge.net/

    Computer Forensic Reference Data Sets (CFReDS)
    http://www.cfreds.nist.gov/

    Simson Garfinkel’s Digital Corpora Site
    http://digitalcorpora.org/

    M57-Patents Scenario
    http://digitalcorpora.org/corpora/scenarios/m57-patents-scenario

  7. Courtney Mumma
    March 26, 2014 at 4:14 pm

    Thanks for this. As open source digital preservation software developers (Archivematica and AtoM), we need sample content for testing features. It can be difficult to obtain sample content from memory institutions who sponsor feature development because of privacy and confidentiality issues.

    We had high hopes for the Open Planets format corpus (see https://github.com/openplanets/format-corpus) and provide it as a transfer/SIP source directory for Ingest on our online Archivematica demo site (sankofa.archivematica.org user/pass demo@example.com/demodemo) and as part of our download. Perhaps this post will serve as a reminder for the community to grow this and the other resources you mentioned.

    –Courtney C. Mumma, MAS/MLIS
    Archivematica Product Manager

  8. Andrew Perti
    March 27, 2014 at 1:15 am

    I would also add that if you’re looking to crunch a dataset in some novel way, Amazon Web Services offers grants to Universities and Nonprofits for space and cpu cycles.

    https://aws.amazon.com/grants/

    Can we count on LC to publish their ongoing Twitter dataset in a raw format at some point in the future?

    This article begs the question, what would the preferred format be for a publicly released dataset?

  9. Andy Jackson
    March 27, 2014 at 6:55 am

    I’ve been approaching this in a couple of different ways. Firstly, I was involved in setting up the aforementioned Open Planets format corpus (and a related sketch for a web-archiving-specific corpus: https://github.com/ukwa/webarchive-test-suite). I’d love to see more contributions to that kind of thing, but it seems that if we want significant volumes of openly licensed resources, we’ll need to fund their generation somehow.

    The second strategy has been to exploit what is available on the web, and index it based on features of interest, rather like a “digital preservation search engine”, so to speak. Specifically, we ran format tools over the *.uk part of the Internet Archive’s collection (which we hold a copy of for JISC) and made the results available (via a rather slow and clunky UI that will be replace before too long I hope) here: http://www.webarchive.org.uk/aadda-discovery/formats

    It lets you slice and dice the 295,991,283 resources by format (MIME type, extension) and some other features of potential interest (e.g. generating software).

    We’re currently working on improving the indexer and the UI, as part of the BUDDAH project (http://buddah.projects.history.ac.uk/). That project is more concerned with full-text indexing and link analysis, but given that this involves parsing almost every byte anyway, it seems sensible to take the opportunity to run some kind of digital preservation scan at the same time.

  10. Richard Lehane
    March 27, 2014 at 9:33 pm

    I’ve been working on a file format identifier lately and I’ve found I’ve really needed two different kinds of test sets.

    The kind of benchmarking suite described in this post would be great as a means of testing real world performance and matching up against other tools. Like some other commenters, I’d love to see such a suite hosted in AWS (perhaps in a requester pays bucket – http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html).

    But, in terms of the initial development, I’ve found that what I’ve really needed is a conformance suite (along the lines of the XML Conformance Test Suite: http://www.w3.org/XML/Test/) to make sure my tool covers off every feature of the PRONOM spec, including all the edge cases. This kind of suite doesn’t have to be big or even representative, it needs to be comprehensive. For a file format identifier, I’ve found Ross Spencer’s skeleton test suite (https://github.com/exponential-decay/skeleton-test-suite-generator) to be invaluable (though obviously other types of tools would need different suites).

  11. Sharon Leon
    March 28, 2014 at 11:12 am

    Just a note on the ongoing upgrade of the September 11th Digital Archive…. When we roll over to the new version in late June 2014, we’ll have a fully enabled API for the site that makes all of the public materials available to those who want to work with them.

  12. Butch Lazorchak
    March 28, 2014 at 11:18 am

    Sharon, this is great!

    And not to diminish your excellent news, but I do have an open question for all our researchers.

    In drafting the post we had considered referencing API availability as sufficient, but then had second thoughts. I wasn’t sure whether an API alone was enough for our criteria, and this reflects my own lack of knowledge as much as anything.

    Any thoughts on API vs. bulk download?

  13. Sharon Leon
    March 28, 2014 at 1:11 pm

    I’m certainly not an expert on these issues, but it seems to me that the sufficiency of the API access will depend on the kinds of things that your researchers want to do with it. 911DA certainly has the messiness of file types and content that you mention.

  14. Jim Safley
    March 28, 2014 at 2:21 pm

    @Butch Lazorchak

    Regarding whether an API meets the criteria of a digital preservation test data set, I agree that it does not meet your first criteria, “Needs to be real-world messy stuff.” While real-world, APIs regularly serve structured data, which is not messy.

    However, I wonder if messiness is an absolutely necessary requirement. It is certainly an important consideration, given that unstructured, heterogeneous data sets are most problematic for digital preservation. But structured data sets are not primed for preservation simply because they are structured.

    All too often, messiness emerges from structure. This is true because producers don’t typically have preservation in mind, and consumers demand structures that are intended for immediate use, resisting extraneous metadata such as provenance and checksum. (Not to mention that some APIs are just plain messy.)

    I think, at the very least, there is value in assessing the preservation readiness of an API and making recommendations on how to restructure and what metadata to include.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.