What Does it Take to Be a Well-rounded Digital Archivist?

The following is a guest post from Peter Chan, a Digital Archivist at the Stanford University Libraries.

Peter Chan

Peter Chan

I am a digital archivist at Stanford University. A couple of years ago, Stanford was involved in the AIMS project, which jump-started Stanford’s thinking about the role of a “digital archivist.” The project ended in 2011 and I am the only digital archivist hired as part of the project that is still on the job on a full-time basis. I recently had discussions with my supervisors about the roles and responsibilities of a digital archivist. This inspired me to take a look at job postings for “digital archivists” and what skills and qualifications organizations were currently looking for.

I looked at eight job advertisements for digital archivists that were published in the past 12 months. The responsibilities and qualifications required of digital archivists were very diverse in these organizations. However, all of them required formal training in archival theory and practice. Some institutions placed more emphasis on computer skills and prefer applicants to have programming skills such as PERL, XSLT, Ruby, HTML and experience working with SQL databases and repositories such as DSpace and Fedora. Others required knowledge on a variety of metadata standards. A few even desired knowledge in computer forensic tools such as FTK Imager, AccessData Forensic Toolkits and writeblockers. Most of these tools are at least somewhat familiar to digital archivists/librarians.

Screenshot from the ePADD project.

Screenshot from the ePADD project.

In my career, however, I have also found other skills useful to the job. In my experience working on two projects (ePADD and GAMECIP), I also found that the knowledge of Natural Language Processing and Linked Open Data/Semantic Web/Ontology was extremely useful. Because of those needs, I became familiar with the Stanford Named Entity Recognizer (NER) and the Apache OpenNLP library to extract personal names, organizational names and locations in email archives in the ePADD project. Additionally, familiarity with SKOS, Open Metadata Registry and Protégé helped publish controlled vocabularies as linked open data and to model the relationship among concepts in video game consoles in the GAMECIP project.

The table below summarizes the tasks I encountered during the past six years working in the field as well as the skills and tools useful to address each task.

Tasks which may fall under the responsibilities of Digital Archivists Knowledge / Skills / Software / Tools needed to work on the Tasks
Collection Development (Interact with donors, creators, dealers, curators – hereafter “creators.”)  
Gain overall knowledge (computing habits of creators, varieties of digital material, hardware/software used, etc.) of the digital component of a collection. In-depth knowledge of computing habits, varieties of digital material, hardware/software for all formats (PC, Mac, devices, cloud, etc.). Tool:  AIMS Born-Digital Material Survey
Explain to creators the topic of digital preservation, including differences between “bit preservation” and “preserving the abstract content encoded into bits”; migration / emulation / virtualization; “Trust Repository”; levels of preservation when necessary. In-depth knowledge of digital preservation.Background:”Ensuring the Longevity of Digital Information” by Jeff Rothenberg, January 1995 edition of Scientific American (Vol. 272, Number 1, pp. 42-7) (PDF); Reference Model for an Open Archival System (OAIS) (PDF); Preserving Exe: Toward a National Strategy for Software Preservation (PDF); Library of Congress Recommended Format Specifications; NDSA Levels of Preservation; Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) (PDF).
Explain to creators how forensic software is used to accession, process and deliver born-digital collections when necessary – especially regarding sensitive/restricted materials. Special knowledge of making use of forensic software in an archival context.Tools: AccessData FTK, EnCase Forensic, etc.
Explain to creators the use of natural language processing/data mining/visualization tools to process and deliver born-digital collections when necessary. General knowledge of tools used in processing and delivering born-digital archives such as entity extraction, networking and visualization software.Tools: Stanford Named Entity Recognizer (NER), Apache OpenNLP, Gephi, D3.js, HTML 5 PivotViewer, etc.
Explain to creators about publishing born-digital collection metadata and/or contents in semantic web/linked open data vs. Encoded Archival Description finding aids/other HTML-based web publishing methods when necessary. Knowledge of linked data/semantic web/EAD finding aids / HTML-based web publishing method.
Explain web archiving to creators. General knowledge of web archiving, cataloging, delivery and preservation of web sites. Knowledge of web archiving software such as Heritrix and HTTrack. Knowledge of Wayback Machine from Internet Archive.
Explain to creators about the archives profession in general. Knowledge of establishing and maintaining control, arranging and describing born-digital archival materials in accordance with accepted standards and practices to ensure the long-term preservation of collections.
Copy files contained on storage media including obsolete formats such as 5.25 inch floppy disks, computer punch cards, etc. Knowledge of onboard 5.25 inch. floppy disk controllers and hardware interfaces and tools, including IDE, SCSI, Firewire, SATA, FC5025, KryoFlux, Catweasel, Zip drives, computer tapes, etc. Knowledge of file systems such as FAT, NTFS, HFS, etc.
Ensure source data in storage media will not be erased/changed accidentally during accessioning while maintaining a proper audit trail in copying files from storage media. Knowledge of write-protect notch/slide switch in floppy disks and hardware write blockers. Knowledge of forensic software (e.g., FTK Imager for PC and Command FTK Imager for Mac).
Get file count, file size and file category of collections. Knowledge of forensic software (e.g. AccessData FTK, EnCase Forensic, BitCurator, etc.), JHOVE, DROID, Pronom, etc.
Ensure computer viruses, if they exist in collection materials, are under control during accessioning. Knowledge of the unique nature of archival materials (no replacement, etc.), behavior of viruses stored in file containers and special procedures in using antivirus software for archival materials.
Accession email archives. Knowledge of Internet protocol (POP, IMAP) and email format (Outlook, mbox). Knowledge of commercial software packages to archive and reformat email (Emailchemy, Mailstore). Knowledge of open source software such as ePADD (Email: Process, Accession, Discover and Deliver) to archive emails.
Archive web sites. Knowledge of web archiving software such as Heritrix and HTTrack. Knowledge of legal issues in archiving web sites. Knowledge of web archiving services such as Archive-It.
Create accession records for born-digital archives. Knowledge of archival data management systems such as Archivists’ Toolkit (AT) with the Multiple Extent Plugin, etc..
Arrangement and Description / Processing   
Screen out restricted, personal, classified and proprietary information such as social security numbers, credit card numbers, classified data, medical records, etc. in archives. Knowledge of the sensitivity of personal identifiable information (PII) and tools to locate PII (e.g. AccessData FTK, Identity Finder). Knowledge of legal restrictions on access to data (DMCA, FERPA, etc.).
Classify text elements in born-digital materials into predefined categories such as the names of persons, organizations and locations when appropriate. Knowledge of entity extraction software and tools to perform entity extraction (such as Open Calais, Stanford Named Entity Recognizer, Apache Open NLP).
Show the network relationship of people in collections when appropriate. Knowledge of network graph and tools such as Gephi, NodeXL.
Create controlled vocabularies to facilitate arrangement and description when appropriate. Knowledge of the  concepts of controlled vocabularies. Knowledge of W3C standard for publishing controlled vocabularies (SKOS). Knowledge of software for creating controlled vocabularies in SKOS such as SKOSjs and SKOS Editor. Knowledge of platforms for hosting SKOS controlled vocabularies such as Linked Media Framework and Apache Marmotta. Knowledge of services for publishing SKOS such as Open Metadata Registry and Poolparty, Inc.
Model data in archives in RDF (Resource Description Framework). Knowledge of semantic web/linked data. Knowledge of commonly used vocabularies/schema such as DC, Schema.org and FOAF, etc. Knowledge of vocabulary repositories such as Linked Open Vocabularies (LOV). Knowledge of tools to generate rdf/xml, rdf/json such as LODRefine and Karma, etc.
Model concepts and relationships between them in archives (e.g. video game consoles) using ontology when appropriate. Knowledge of the W3C standard OWL (Web Ontology Language) and software to create ontologies using OWL such as Protégé and WebProtege.
Describe files with special formats (e.g. born-digital photographic images). Knowledge of image metadata schema standards (IPTC, EXIF) and software to create/modify such metadata (Adobe Bridge, Photo Mechanic, etc.).
Describe image files by names of persons in images with the help of software when appropriate. Knowledge of facial recognition functions in software such as Picasa, Photoshop Elements.
Use visualization tools to represent data in archives when appropriate. Knowledge of open source JavaScript library for manipulating documents such as D3.js, HTML 5 PivotViewer and commercial tools such as IBM ManyEyes and Cooliris.
Assign metadata to archived web sites. Knowledge of cataloging options available in web archiving services such as Archive-It or in web archiving software such as HTTrack.
Create EAD finding aids. Knowledge of accepted standards and practices in creating finding aids. Knowledge of XML editors or other software (such as Archivists’ Toolkit) to create EAD finding aids.
Discovery and Access   
Deliver born-digital archives. Knowledge of copyright laws and privacy issues.
Deliver born-digital archives in reading room computers. Knowledge of security measures required for workstations in reading rooms, such as disabling Internet access and USB ports, to prevent unintentional transfer of collection materials. Knowledge of software to deliver images in collections such as Adobe Bridge. Knowledge of software to read files with obsolete file formats such as QuickView Plus.
Deliver born-digital archives using institutions’ catalog system. Knowledge of the interface required by the institutions’ catalog system to make the delivery.
Deliver born-digital archives using institution repository systems. Knowledge of DSpace, Fedora, Hydra and the interfaces developed to facilitate such delivery.
Publish born-digital archives using linked data/semantic web. Knowledge of linked data publishing platform such as Linked Media Framework, Apache Marmotta, OntoWiki and linked data publishing services such as Open Metadata Registry.
Deliver born-digital archives using exhibition software. Knowledge of open source exhibition software such as Omeka.
Deliver archived web sites. Knowledge of delivery options available in Web Archiving Services such as Archive-It or in web archiving software such as HTTrack.
Deliver email archives. Knowledge of commercial software such as Mailstore. Knowledge of open source software such as ePADD (Email: Process, Accession, Discover and Deliver).
Deliver software collections using emulation/virtualization. Knowledge of emulation/virtualization tools such as KEEP, JSMESS, MESS, VMNetX and XenServer.
Deliver finding aids of born-digital archives using union catalogs such as OAC. Knowledge of uploading procedures to respective union catalogs such as OAC.
Prepare the technical metadata (checksum, creation, modification and last access dates, file format, file size, etc.) of files in archives for transfer to preservation repository. Knowledge of forensic software such as AccessData FTK, EnCase Forensic, and BitCurator, etc. Programming skill in XSLT to extract the information when appropriate from reports generated by the software.
Use emulation / virtualization strategy to preserve software collections. Knowledge of emulation/virtualization tools such as KEEP, JSMESS, MESS, VMNetX and XenServer.
Use migration strategies to preserve digital objects. Knowledge of Library of Congress Recommended Format Specifications. Knowledge of migration tools such as Xena, Adobe Acrobat Professional and Curl Exemplars in Digital Archives (Cedars) and the Creative Archiving at Michigan and Leeds: Emulating the Old on the New (CAMiLEON) projects.
Submit items to preservation repository. Knowledge of preservation system such as Archivematica, LOCKSS and preservation services such as Portico, Tessella and DuraSpace. Knowledge of preservation repository interfaces. Advanced knowledge in Excel for batch input to the repository when appropriate.
Preserve archived web sites. Knowledge of preservation options available in Web Archiving Services such as Archive-It. Knowledge of preserving web sites in preservation repository.

This list may seem dishearteningly comprehensive, but I attained these skills with years of experience working as a digital archivist on a number of challenging projects. I didn’t start off knowing everything on this list. I learned these skills and knowledge by going to conferences, workshops, attending the Natural Language Processing MOOC classes and through self-study by seeking resources available online. A digital archivist starting out in this field does not need to have all these skills right off the bat, but does need to be open to and able to consistently learn and apply new knowledge.

Of course, digital archivists in different institutions will have different responsibilities according to their particular situations. I hope this article will generate discussion of the work expected from digital archivists and the knowledge required for them to succeed. Finally, I would like to thank Glynn Edwards, my supervisor, who supports my exploratory investigation into areas which some organizations may consider irrelevant to the job of a digital archivist. As a reminder, my opinions aren’t necessarily that of my employer or any other organizations.

Library to Launch 2015 Class of NDSR

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015. As you may know (NDSR was well represented on the blog last year), this […]

DPOE Working Group Moves Forward on Curriculum

For many organizations that are just starting to tackle digital preservation, it can be a daunting challenge – and particularly difficult to figure out the first steps to take.  Education and training may be the best starting point, creating and expanding the expertise available to handle this kind of challenge.  The Digital Preservation Outreach and […]

Residency Program Update and Panel Preview for DP2014

The National Digital Stewardship Residency program just completed the first year of residencies in the Washington, DC area. The second, upcoming round of residencies will take place in New York and Boston, and both cities have recently announced the selection of residents and projects.  At this year’s Digital Preservation 2014, there will be a panel […]

NDSR Selects the Next Class of Residents for New York and Boston

The National Digital Stewardship Residency program has recently announced the next group of 10 residents selected for this prestigious program.  This Residency program, funded by the IMLS, has just completed its inaugural year, with 10 residents working in various organizations in the Washington, DC area.  The next round of the NDSR will begin in September […]

National Digital Stewardship Residents Conclude Program with Capstone Meeting

The following is a guest post by Kris Nelson, Program Management Specialist at the Library of Congress and Program Coordinator of the National Digital Stewardship Residency. A version of this article was originally published in the Library of Congress weekly staff newspaper The Gazette. The National Digital Stewardship Residency concluded the inaugural year of the […]

Register for CURATEcamp: Digital Culture, July 24th

Alongside this year’s Digital Preservation 2014 meeting, I am excited to announce that we will also be playing host to a CURATEcamp unconference focused on exploring the collecting, preserving and providing access to records of digital culture. For those unfamiliar with unconferences, the key idea is that the participants define the agenda and that there […]

Announcing Hosts and Projects for Next Round of NDSR

Some good news coming out of the National Digital Stewardship Residency program – the host institutions and projects for the upcoming year have now been announced!  The NDSR, an initiative to develop new professionals in digital stewardship through funded, post-graduate residencies, is wrapping up the first successful year of the program, held in 10 different […]