What Does it Take to Be a Well-rounded Digital Archivist?

The following is a guest post from Peter Chan, a Digital Archivist at the Stanford University Libraries.

Peter Chan

Peter Chan

I am a digital archivist at Stanford University. A couple of years ago, Stanford was involved in the AIMS project, which jump-started Stanford’s thinking about the role of a “digital archivist.” The project ended in 2011 and I am the only digital archivist hired as part of the project that is still on the job on a full-time basis. I recently had discussions with my supervisors about the roles and responsibilities of a digital archivist. This inspired me to take a look at job postings for “digital archivists” and what skills and qualifications organizations were currently looking for.

I looked at eight job advertisements for digital archivists that were published in the past 12 months. The responsibilities and qualifications required of digital archivists were very diverse in these organizations. However, all of them required formal training in archival theory and practice. Some institutions placed more emphasis on computer skills and prefer applicants to have programming skills such as PERL, XSLT, Ruby, HTML and experience working with SQL databases and repositories such as DSpace and Fedora. Others required knowledge on a variety of metadata standards. A few even desired knowledge in computer forensic tools such as FTK Imager, AccessData Forensic Toolkits and writeblockers. Most of these tools are at least somewhat familiar to digital archivists/librarians.

Screenshot from the ePADD project.

Screenshot from the ePADD project.

In my career, however, I have also found other skills useful to the job. In my experience working on two projects (ePADD and GAMECIP), I also found that the knowledge of Natural Language Processing and Linked Open Data/Semantic Web/Ontology was extremely useful. Because of those needs, I became familiar with the Stanford Named Entity Recognizer (NER) and the Apache OpenNLP library to extract personal names, organizational names and locations in email archives in the ePADD project. Additionally, familiarity with SKOS, Open Metadata Registry and Protégé helped publish controlled vocabularies as linked open data and to model the relationship among concepts in video game consoles in the GAMECIP project.

The table below summarizes the tasks I encountered during the past six years working in the field as well as the skills and tools useful to address each task.

Tasks which may fall under the responsibilities of Digital Archivists Knowledge / Skills / Software / Tools needed to work on the Tasks
Collection Development (Interact with donors, creators, dealers, curators – hereafter “creators.”)  
Gain overall knowledge (computing habits of creators, varieties of digital material, hardware/software used, etc.) of the digital component of a collection. In-depth knowledge of computing habits, varieties of digital material, hardware/software for all formats (PC, Mac, devices, cloud, etc.). Tool:  AIMS Born-Digital Material Survey
Explain to creators the topic of digital preservation, including differences between “bit preservation” and “preserving the abstract content encoded into bits”; migration / emulation / virtualization; “Trust Repository”; levels of preservation when necessary. In-depth knowledge of digital preservation.Background:”Ensuring the Longevity of Digital Information” by Jeff Rothenberg, January 1995 edition of Scientific American (Vol. 272, Number 1, pp. 42-7) (PDF); Reference Model for an Open Archival System (OAIS) (PDF); Preserving Exe: Toward a National Strategy for Software Preservation (PDF); Library of Congress Recommended Format Specifications; NDSA Levels of Preservation; Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) (PDF).
Explain to creators how forensic software is used to accession, process and deliver born-digital collections when necessary – especially regarding sensitive/restricted materials. Special knowledge of making use of forensic software in an archival context.Tools: AccessData FTK, EnCase Forensic, etc.
Explain to creators the use of natural language processing/data mining/visualization tools to process and deliver born-digital collections when necessary. General knowledge of tools used in processing and delivering born-digital archives such as entity extraction, networking and visualization software.Tools: Stanford Named Entity Recognizer (NER), Apache OpenNLP, Gephi, D3.js, HTML 5 PivotViewer, etc.
Explain to creators about publishing born-digital collection metadata and/or contents in semantic web/linked open data vs. Encoded Archival Description finding aids/other HTML-based web publishing methods when necessary. Knowledge of linked data/semantic web/EAD finding aids / HTML-based web publishing method.
Explain web archiving to creators. General knowledge of web archiving, cataloging, delivery and preservation of web sites. Knowledge of web archiving software such as Heritrix and HTTrack. Knowledge of Wayback Machine from Internet Archive.
Explain to creators about the archives profession in general. Knowledge of establishing and maintaining control, arranging and describing born-digital archival materials in accordance with accepted standards and practices to ensure the long-term preservation of collections.
Accessioning   
Copy files contained on storage media including obsolete formats such as 5.25 inch floppy disks, computer punch cards, etc. Knowledge of onboard 5.25 inch. floppy disk controllers and hardware interfaces and tools, including IDE, SCSI, Firewire, SATA, FC5025, KryoFlux, Catweasel, Zip drives, computer tapes, etc. Knowledge of file systems such as FAT, NTFS, HFS, etc.
Ensure source data in storage media will not be erased/changed accidentally during accessioning while maintaining a proper audit trail in copying files from storage media. Knowledge of write-protect notch/slide switch in floppy disks and hardware write blockers. Knowledge of forensic software (e.g., FTK Imager for PC and Command FTK Imager for Mac).
Get file count, file size and file category of collections. Knowledge of forensic software (e.g. AccessData FTK, EnCase Forensic, BitCurator, etc.), JHOVE, DROID, Pronom, etc.
Ensure computer viruses, if they exist in collection materials, are under control during accessioning. Knowledge of the unique nature of archival materials (no replacement, etc.), behavior of viruses stored in file containers and special procedures in using antivirus software for archival materials.
Accession email archives. Knowledge of Internet protocol (POP, IMAP) and email format (Outlook, mbox). Knowledge of commercial software packages to archive and reformat email (Emailchemy, Mailstore). Knowledge of open source software such as ePADD (Email: Process, Accession, Discover and Deliver) to archive emails.
Archive web sites. Knowledge of web archiving software such as Heritrix and HTTrack. Knowledge of legal issues in archiving web sites. Knowledge of web archiving services such as Archive-It.
Create accession records for born-digital archives. Knowledge of archival data management systems such as Archivists’ Toolkit (AT) with the Multiple Extent Plugin, etc..
Arrangement and Description / Processing   
Screen out restricted, personal, classified and proprietary information such as social security numbers, credit card numbers, classified data, medical records, etc. in archives. Knowledge of the sensitivity of personal identifiable information (PII) and tools to locate PII (e.g. AccessData FTK, Identity Finder). Knowledge of legal restrictions on access to data (DMCA, FERPA, etc.).
Classify text elements in born-digital materials into predefined categories such as the names of persons, organizations and locations when appropriate. Knowledge of entity extraction software and tools to perform entity extraction (such as Open Calais, Stanford Named Entity Recognizer, Apache Open NLP).
Show the network relationship of people in collections when appropriate. Knowledge of network graph and tools such as Gephi, NodeXL.
Create controlled vocabularies to facilitate arrangement and description when appropriate. Knowledge of the  concepts of controlled vocabularies. Knowledge of W3C standard for publishing controlled vocabularies (SKOS). Knowledge of software for creating controlled vocabularies in SKOS such as SKOSjs and SKOS Editor. Knowledge of platforms for hosting SKOS controlled vocabularies such as Linked Media Framework and Apache Marmotta. Knowledge of services for publishing SKOS such as Open Metadata Registry and Poolparty, Inc.
Model data in archives in RDF (Resource Description Framework). Knowledge of semantic web/linked data. Knowledge of commonly used vocabularies/schema such as DC, Schema.org and FOAF, etc. Knowledge of vocabulary repositories such as Linked Open Vocabularies (LOV). Knowledge of tools to generate rdf/xml, rdf/json such as LODRefine and Karma, etc.
Model concepts and relationships between them in archives (e.g. video game consoles) using ontology when appropriate. Knowledge of the W3C standard OWL (Web Ontology Language) and software to create ontologies using OWL such as Protégé and WebProtege.
Describe files with special formats (e.g. born-digital photographic images). Knowledge of image metadata schema standards (IPTC, EXIF) and software to create/modify such metadata (Adobe Bridge, Photo Mechanic, etc.).
Describe image files by names of persons in images with the help of software when appropriate. Knowledge of facial recognition functions in software such as Picasa, Photoshop Elements.
Use visualization tools to represent data in archives when appropriate. Knowledge of open source JavaScript library for manipulating documents such as D3.js, HTML 5 PivotViewer and commercial tools such as IBM ManyEyes and Cooliris.
Assign metadata to archived web sites. Knowledge of cataloging options available in web archiving services such as Archive-It or in web archiving software such as HTTrack.
Create EAD finding aids. Knowledge of accepted standards and practices in creating finding aids. Knowledge of XML editors or other software (such as Archivists’ Toolkit) to create EAD finding aids.
Discovery and Access   
Deliver born-digital archives. Knowledge of copyright laws and privacy issues.
Deliver born-digital archives in reading room computers. Knowledge of security measures required for workstations in reading rooms, such as disabling Internet access and USB ports, to prevent unintentional transfer of collection materials. Knowledge of software to deliver images in collections such as Adobe Bridge. Knowledge of software to read files with obsolete file formats such as QuickView Plus.
Deliver born-digital archives using institutions’ catalog system. Knowledge of the interface required by the institutions’ catalog system to make the delivery.
Deliver born-digital archives using institution repository systems. Knowledge of DSpace, Fedora, Hydra and the interfaces developed to facilitate such delivery.
Publish born-digital archives using linked data/semantic web. Knowledge of linked data publishing platform such as Linked Media Framework, Apache Marmotta, OntoWiki and linked data publishing services such as Open Metadata Registry.
Deliver born-digital archives using exhibition software. Knowledge of open source exhibition software such as Omeka.
Deliver archived web sites. Knowledge of delivery options available in Web Archiving Services such as Archive-It or in web archiving software such as HTTrack.
Deliver email archives. Knowledge of commercial software such as Mailstore. Knowledge of open source software such as ePADD (Email: Process, Accession, Discover and Deliver).
Deliver software collections using emulation/virtualization. Knowledge of emulation/virtualization tools such as KEEP, JSMESS, MESS, VMNetX and XenServer.
Deliver finding aids of born-digital archives using union catalogs such as OAC. Knowledge of uploading procedures to respective union catalogs such as OAC.
Preservation 
Prepare the technical metadata (checksum, creation, modification and last access dates, file format, file size, etc.) of files in archives for transfer to preservation repository. Knowledge of forensic software such as AccessData FTK, EnCase Forensic, and BitCurator, etc. Programming skill in XSLT to extract the information when appropriate from reports generated by the software.
Use emulation / virtualization strategy to preserve software collections. Knowledge of emulation/virtualization tools such as KEEP, JSMESS, MESS, VMNetX and XenServer.
Use migration strategies to preserve digital objects. Knowledge of Library of Congress Recommended Format Specifications. Knowledge of migration tools such as Xena, Adobe Acrobat Professional and Curl Exemplars in Digital Archives (Cedars) and the Creative Archiving at Michigan and Leeds: Emulating the Old on the New (CAMiLEON) projects.
Submit items to preservation repository. Knowledge of preservation system such as Archivematica, LOCKSS and preservation services such as Portico, Tessella and DuraSpace. Knowledge of preservation repository interfaces. Advanced knowledge in Excel for batch input to the repository when appropriate.
Preserve archived web sites. Knowledge of preservation options available in Web Archiving Services such as Archive-It. Knowledge of preserving web sites in preservation repository.

This list may seem dishearteningly comprehensive, but I attained these skills with years of experience working as a digital archivist on a number of challenging projects. I didn’t start off knowing everything on this list. I learned these skills and knowledge by going to conferences, workshops, attending the Natural Language Processing MOOC classes and through self-study by seeking resources available online. A digital archivist starting out in this field does not need to have all these skills right off the bat, but does need to be open to and able to consistently learn and apply new knowledge.

Of course, digital archivists in different institutions will have different responsibilities according to their particular situations. I hope this article will generate discussion of the work expected from digital archivists and the knowledge required for them to succeed. Finally, I would like to thank Glynn Edwards, my supervisor, who supports my exploratory investigation into areas which some organizations may consider irrelevant to the job of a digital archivist. As a reminder, my opinions aren’t necessarily that of my employer or any other organizations.

9 Comments

  1. Analu lopez
    October 7, 2014 at 4:01 pm

    Hello Mr. Chan,

    Thank you for your very enlightening and positive post. I come from a relatively diverse background in regards to my education and interests. I went to school for photography (BA in Photography) but I have always been interested in collections and archives.

    As an undergraduate I pursued jobs outside of school within libraries and museums. My first encounter with the world of collections and archives was in 2005 when I got the opportunity to help photograph and digitize the Edward E. Ayer Collection at The Newberry Library of Chicago. It was very very very much fun and I was hooked. From there I pursued jobs in other institutions doing similar work. Flash forward to present day I am working on a grant to digitize the permanent collection at the museum I currently work for in Chicago and I just started going back to school for my Master in Library and Information Science with a focus on Digital Archiving and digital libraries.

    It is a new fields but such a great and lush one with so much to learn.

    Thank you for sharing all this information. It encourages me quite a lot and I am glad to see light shed on this relatively new type of position.

    Many thanks again!

    best,
    Analú

  2. Peter Chan
    October 7, 2014 at 5:37 pm

    Hi Analu,

    I am glad you find the article useful.

    Peter

  3. Lori Richards
    October 7, 2014 at 7:47 pm

    Thank you so much for this great article! I am sharing it with all my Digital Preservation students!

  4. Joseph Koivisto
    October 8, 2014 at 11:01 am

    Competencies associated with digital media, platforms, librarianship, and curation are quickly becoming the new core competencies that will be needed to remain competitive. Thank you for an excellent breakdown of where the profession is going and what skills are needed to get there.

  5. Merv Richter
    October 8, 2014 at 1:21 pm

    We have been implementing software for archivists, librarians, and records managers for many years. So often all they really require is the ability to link digital content to their metadata records. they can continue using the system they are familiar with. Their users will have only one place to look for physical as well as digital material.

  6. Paul Weatherall
    October 9, 2014 at 9:35 am

    Merv, that may be true for digital files created from existing archival materials but Peter’s list is a very comprehensive reflection of the skills required for archivists increasingly having to get to grips with born digital archives which may be ingested in a wide variety of formats (some aleady long superceded) and which may contain deleted files (have they been acknowledged as deposited). Peter mentions some issues concerning data protection and privacy but another major issue missing from his list are copyright issues some of which are peculiar to the digital world and may be even more contentiously policed than music and literary copyright.

  7. Frederic J. Grevin
    October 14, 2014 at 3:31 pm

    Several comments:

    1. I would most definitely NOT recommend Rothenberg’s article—by itself—for background. I always recommend “Preserving Digital Information – Report of the Task Force on Archiving of Digital Information”, commissioned by The Commission on Preservation and Access and The Research Libraries Group (1996). This report is far more comprehensive, less tendentious and, despite its age, quite relevant today.

    2. Your comment “I didn’t start off knowing everything on this list”, which he posted at the bottom of the blog, would have been more à propos at the beginning of the list which you yourself describe as ” dishearteningly comprehensive.”

    3. Under Accessioning, you write “Ensure computer viruses, if they exist in collection materials, are under control during accessioning.” I wonder what you mean by “under control”? I know there has been some discussion about preserving malware in general (not just virus), but I would be very leery of doing so in contaminated files.

    4. Finally, you seem to left out the extremely important issue of preserving functional databases (see SIARD, the database preservation software developed by the Swiss National Archives).

    Best regards,

    Fred

  8. Caroline Campbell
    October 17, 2014 at 12:40 pm

    Peter’s list is certainly intimidating for a newb like me. My only experience in archives to date is an introductory archives course & digitizing a book as part of my undergrad studies in library service. I understand I’d need years of on-the-job experience to master the skill set presented here.

    I’m debating whether to pursue a master’s in digital archives and/or records management. Can anyone recommend a graduate program to help me gain a solid foundation in digital archives? There seems to be several programs available and I’d appreciate input on any of them.

  9. julinta
    May 12, 2016 at 9:29 am

    Dear Mr. Chan,

    I would like to pursue a carrier as a digital archivist. Right know, my knowledge is limited to work as traditional archivist. I would like to know where to do the basic training or courses for digital archivist? From my search, the courses or training mostly done online. Is there any education institution that held this specific training or courses?

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.