Insights is an occasional series of posts in which members of National Digital Stewardship Alliance Innovation Working Group take a bit of time to chat with people doing novel, exciting and innovative work in and around digital preservation and stewardship. In this post, I am thrilled to have a chance to hear from Doug White, Project leader for the National Institute of Standards and Technology National Software Reference Library. I heard Doug give an fantastic talk about his work at the CurateGear Workshop (see slides from the talk here).
Before we dig into the details of the project, you mentioned that the NSRL has already resulted in saving at least one person’s life. Could you walk us through exactly how that came about? I think it makes for a really compelling story for why software preservation matters.
Doug: Certainly; it was an unintentional circumstance. To begin, we often were asked if software may be borrowed from the NSRL, and the response was, “No, we are a reference, not a lending library.” But then we received a call from an Food and Drug Administration agent on a Friday afternoon in December 2004.
A medical supply company in Miami had received a delivery of botulin, which was to be processed into Botox and distributed. However, it was misprocessed, and a dangerous concentrate was distributed. The FDA had all of the information needed to identify the recipients, but the information was in a file created with a 2003 version of a popular business software application. The 2004 version available to the FDA could not open the data file. The manufacturer of the software was also unable to supply the relevant version.
It so happened that one of the agents involved in the case was familiar with the NSRL, and had in fact provided software to us earlier in the year. He called, explained the situation, and asked if we had the 2003 version of the software. We did! The agent then arranged for an FDA contact to come to NIST, get the software, and put it on a jet to Miami. The people working the case in Miami were able to install the old version, open the data file, and trace the paths of the botulin.
Several fortunate events occurred to enable this story to end on a positive note. We have a process in place should this occur again, though we consider the NSRL to be a “last resource.”
Trevor: I have heard you describe the National Software Reference Library as a library of software, a database of metadata, a NIST publication and a research environment. Could you give us a little background on the project and explain how NSRL serves these different functions?
Doug: The diagram below is an overview showing several facets of the NSRL. The path using red arrows involves our core operations, green arrows designate “derivative” operations, and blue illustrates some collaborative research.
The physical library is our foundation. At the inception of the project, in 2000, organizations were creating and sharing metadata describing computer files on a very ad hoc basis. If the metadata were questioned, it was highly unlikely that the original media were available to resolve the issue. The NSRL operates in the same fashion as an evidentiary locker, with the original media available in the event of a question.
The physical library has a parallel virtual library. NSRL has created bit-for-bit copies of the original media and images of packaging materials that are kept on a network storage device. I need to point out that the NSRL runs on a network disconnected from the Internet, and in fact, also disconnected from the NIST network infrastructure, using equipment and cables we installed. The media copies can be manipulated automatically, used by multiple processes and repeated physical contact with original objects is minimized.
From the packaging and media, we collect metadata from every application, from every file. We store the metadata in a PostgreSQL database. The database has several schemas, which act as conceptual boundaries around accession processes, the collection of software application descriptions by manual processes, the collection of content metadata by automated processes, storage processes and publication processes. The work processes and the technology are modular components that are easy to test, maintain, train, or reuse. The database metadata (with the exception of staff information) is available on request.
There is a subset of the collected metadata which is of use to investigators and researchers in the community in which NSRL participates, and the subset is published quarterly as NIST Special Database #28. The specific data includes:
- Manufacturer Name
- Operating System Name
- Operating System Version
- Product Name
- Product Version
- Product Language
- Application Type
- SHA-1 of file (digital fingerprint)
- MD5 of file
- CRC32 of file
- File Name
- File Size
The research environment allows NSRL to collaborate with researchers who wish to access the contents of the virtual library. Researchers may perform tasks on the NSRL isolated network that involve access to the copies of media, to individual files, or to “snapshots” of software installations. In addition to the media copies, NSRL has compiled a corpus of the 25,000,000 unique files found on the media, and examples of software installation and execution in virtual machines.
Trevor: Could you give us a brief overview of what exactly is the content of the library? What data and metadata do you collect and how do you work with it?
Doug: The library contains commercial software, both off-the-shelf shrink-wrapped physical packages and download-only “click-wrapped” digital objects. This includes computer operating systems, business software, games, mobile device apps, multimedia collections and malicious software tools.
Most of the software in the NSRL is purchased. We try to acquire everything the top selling lists. Some software we hear about by word of mouth, some by schedule (like tax programs each tax year, security, antivirus) and some by requests from law enforcement and other agencies. We accept donations from manufacturers and have paperwork to state we will not use the software license. We accept donations of used software as long as it is in useable condition but there is no guarantee that it will make it into the NSRL.
The data and metadata is detailed in documents on the NSRL website. To summarize, we collect accession data familiar to your readers; the information about the manufacturer and publisher, the minimal requirements listed, the number and types of media, etc. We also process the contents of the media to obtain metadata about the file system(s), directory structure, file types (based on signature strings) and many file-level metadata as I mentioned in the previous question.
NSRL makes minimal use of this metadata. We perform mock investigations using the metadata to measure the applicability. We investigate the randomness of the cryptographic algorithm results. We are constantly seeking related collections with which we could combine an index or translate a taxonomy, to cross-reference NSRL data with other sets.
Trevor: In the context of thinking about NSRL as a research environment it seems that the key value there is the corpus of software, the 23,809,431 unique files, that you have identified. Could you tell us about some of the research uses these have served so far? The audience for the blog varies widely in technical knowledge so it would be ideal if you could unpack these concepts a bit too.
Doug: The highest value, in my opinion, is the provenance and persistence of the collection. Given the virtual library, it is easy to apply new technology, new algorithms to the entire set or specific content automatically, while maintaining the the relationship to previous work and the original media.
NSRL has applied several cryptographic algorithms against the corpus, and statistically analyzed the results. This is an interesting measurement of the algorithm properties within the relatively small scope of binary executable file types. NSRL found that indeed there were no collisions among the 25 million files.
Working with a collaborator, we are able to define precise, static content sections of executable files, obtain a digital fingerprint of those sections, then identify those sections when they are present on a running computer. This can allow an investigator to determine that a program was running, even though the files do not exist on the computer.
Working with a collaborator, we are able to provide practical feedback on the development of an algorithm called a similarity digest. Currently, if you have two digital copies of the Gettysburg Address text, one which begins “Five score and …”, the two cryptographic hashes of the differing files will be extremely dissimilar, as intended. Two similarity digest results on the two Address files will be similar, and the similarity can be measured. Algorithms of this kind are also known as “fuzzy” hashes, and they tend to be impractical for very large sets. We are assisting in developing a practical implementation.
NSRL has in past limited metadata collection to the content of the application media. We have now acquired the resources and defined the processes to automatically install an operating system on a virtual machine, run the OS, perform noteworthy tasks, install applications, generate content, uninstall applications, etc. This enables the collection of metadata on dynamic system files, registries, log files, memory, various versions of user-generated files. We can use some of this metadata as feedback into our core process, and we have some research opportunities.
Another imminent collaboration is the creation of many word processing documents with created with different applications and multiple versions that contain the same text. A corpus of document tags or codes spanning versions and products has generated some interest.
Trevor: Could you tell us a little bit about the NSRL environment? What kinds of technologies and software are you currently using currently and what are you exploring for use in the future?
Doug: We are fortunate to have three contiguous rooms, one that houses the physical library, one that houses the data entry workstations, and one that houses servers and storage. The proximity of the rooms allowed us to pull our own cables, which makes that level of our infrastructure a controlled, known quantity.
The physical library has an alarmed, multi-factor entry control. The shelf system is a powered collapsing system which defaults to a closed, fire-retardant position. The environment is not kept within the recommended practices for archives; this was considered, but not implemented. Heat, fire, humidity and other risks are minimized to the best extent we can.
NSRL has strived to keep infrastructure implementations to hardware and technologies that can be quickly obtained and made functional in the event of a disaster. I would prefer to not name manufacturers at this time, but am willing to discuss those details with individuals.
In the second room, core work is performed using OpenSuSE Linux workstations for browser-based data entry and media copying. The Linux machines can be created in bulk or ad hoc using a net boot image. This room also contains a system used to perform software installations, so the NSRL can collect installed files, registry information and other artifacts of a running application. This room contains a computer attached to the internet on which NSRL downloads digital-only distributions of software. A photography stand and flatbed scanning stations are in this room, used to create digital photos of packaging, so these photos can be used for data entry and research instead of shelved material.
Movement of original packages and media is limited to the previous two rooms.
The third room is a computer server room with racks of equipment. The media copies are stored on a commercial, expandable network (currently 42TB) that is capable of access by Windows, Apple and Linux computers. We have several quad-core rack mounted servers that perform the automated distributed metadata collection tasks. A PostgreSQL database and an Apache webserver reside on one of the rack servers which is dedicated to these functions. The database is on local storage in that server.
The equipment described in the previous paragraph is duplicated, and that is the research environment. Media images, individual files, virtual machine slices and all databases are backed up across a dedicated fiber connection to storage several buildings distant. Verification of critical files is performed nightly. We also periodically ship copies of the critical files to NIST Boulder, CO, campus.
The software we use is mostly written in Perl, with some PHP for the browser-based data entry. Reuse is key, as is flexibility; the NSRL code is essentially a wrapper or application interface which calls third-party tools to manipulate media, files or systems.
We have a quality assurance process that involves loading NSRL quarterly candidate releases into several third-party digital forensics tools, in each publishing cycle.
We don’t anticipate substantial changes to our technology or software in the near future. If anything, we would revisit our internal database design, and address some issues that did not scale up as well as we expected.
Trevor: If other organizations have special collections would NSRL be interested in adding those collections to the reference library? If yes, what process would you suggest to someone interested discussing such an arrangement?
Doug: NSRL is very interested in pursuing loan arrangements with other institutions. Transfer of materials to NIST need not be a requirement. Please contact me, or any NSRL staff, via email@example.com or 301-975-3262.
Trevor: Are their more research uses or ways that you think the NSRL could play a role in digital preservation work and research? Further, if any of the folks who follow this blog are interested in exploring doing research involving the software corpus what should they put together and how should they go about getting in touch with your team?
Doug: We are new participants in the community, so I believe we are still at the point of introducing ourselves. I am hopeful that uses may be identified as our capabilities and activities are made known. This blog is a step in that direction, and I thank you for this opportunity. Anyone with questions regarding research access should contact me.
Trevor: As a final question, could you tell us a bit about how the NSRL came about? One of the tricky parts of digital stewardship establishing the value and need building and maintaining collections and I think the story of the need and uses that the NSRL serves offers a powerful frame for thinking about the kinds of coalitions and common needs that digital stewardship initiatives work to support.
Doug: Prior to NIST involvement in digital forensics, Law enforcement identified the need for automated methods to review the large number of files in investigations involving computers. The FBI “Known File Filter” project supplied hash values of known files, the NDIC “Hashkeeper” project supplied hash values of installed files and of “known malicious” data files. Several commercial and open source tools existed that each used different hash values (CRC32, MD4, MD5, SHA-1)
Hash values were exchanged informally throughout the entire community via email, FTP sites, etc. Investigators had to know where to find hash sets; investigators had to judge the quality of the hash sets. There was no central, trusted repository, and there were open avenues for conflicts of interest.
NIST was contacted because of its history of impartiality in research and standards development. Among the benefits of this involvement were :
- NIST is an unbiased organization, not in law enforcement, not a vendor
- NIST can control quality of data
- NIST can provide traceability by retaining original software
- NIST can provide data in formats useful by many existing tools
- NIST has distribution mechanism in the Standard Reference Data service
The result of this is a data set that is court-admissible, a process that is transparent, and a collection open to researchers.