IEEE Big Data Conference 2016: Computational Archival Science

This is a guest post by Meredith Claire Broadway,a consultant for the World Bank.

Photo of a PowerPoint slide projected onto a wall.

Jason Baron, Drinker Biddle & Reath LLP, “Opening Up Dark Digital Archives Through The Use of Analytics To Identify Sensitive Content,” 2016. Photo by Meredith Claire Broadway.

Computational Archival Science can be regarded as the intersection between the archival profession and “hard” technical fields, such as computer science and engineering. CAS applies computational methods and resources to large-scale records and archives processing, analysis, storage, long-term preservation and access. In short: big data is a big deal for archivists, particularly because old-school pen-and-paper methodologies don’t apply to digital records. To keep up with big data, the archival profession is called upon to open itself up to new ideas and collaborate with technological professionals.

Naturally, collaboration was the  theme of the IEEE Big Data Conference ’16: Computational Archival Science workshop. There were many speakers with projects that drew on the spirit of collaboration by applying computational methods — such as machine learning, visualization and neuro-linguistic programming — to archival problems. Subjects ranged from improving optical-character-recognition efforts with topic modeling to utilizing vector-space models so that archives can better anonymize PII and other sensitive content.

For example, “Content-based Comparison for Collections Identification” was presented by a team led by Maria Esteva of the Texas Advanced Computing Center. Maria and her team created an automated method of discovering the identity of datasets that appear to be similar or identical but may be housed in two different repositories or parts of different collections. This service is important to archivists because datasets often exist in multiple formats and versions and in different stages of completion. Traditionally, archives determine issues such as these through manual effort and metadata entry. A shift to automation of content-based comparison allows archivists to identify changes, connections and differences between digital records with greater accuracy and efficiency.

The team’s algorithm operates in straightforward manner. First, two collections are analyzed to determine the types of records they contain and then a list is generated for each collection. Next the records analysis creates a list of pairs from each collection for comparison. Finally, a summary report is created to show differences between the collections.

Chart, Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls, “Content-based Comparison for Collections Identification,” 2015.

Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls, “Content-based Comparison for Collections Identification,” 2015.

To briefly summarize Maria’s findings, metadata alone isn’t enough for the content-based comparison algorithm to determine whether a dataset is unique. The algorithm needs more information from datasets to make improved comparisons.

Automated collection-based comparison is the future of digital archives. Naturally, it raises questions, among them, “What is the best way for archivists to meet automated methods?” and  ”How can current archival workflows be aligned with computational efforts?”

The IEEE Computational Archival Science session ended on a contemplative note. Keynote speaker Mark Conrad, of the National Archives and Records Administration, asked the gathering about what skills they thought the new generation of computational archival scientists should be taught. Topping the list were answers such as “coding,” “text mining” and “history of archival practice.”

What interested me most was the ensuing conversation about how CAS deserves its own academic track. The assembly agreed that CAS differs enough from the traditional Library and Information Science and Archival tracks, in both the United States and Canada, that it qualifies as a new area of study.

CAS differs from the LIS and Archival fields in large part due to its technology-centric nature. “Hard” technical skills take more than two years (the usual time it takes to complete an LIS master’s program) to develop, a fact I can personally attest to as a former LIS student and R beginner. It makes sense, then, that for CAS students to receive a robust education they should have a unique curriculum.

If CAS, LIS and the Archival Science fields merge, there’s an assumption that they will run the risk of taking an “inch-deep, mile-wide” approach to studies. Our assembly agreed that, in this case, “less is more” if it allows students to cultivate fully developed skills.

Of course these were just the opinions of those present at the IEEE workshop. As the session emphasized, CAS encourages collaboration, discussion and differing opinions. If you have something to add to any of my points, please leave a comment.

The University of Richmond’s Digital Scholarship Lab

In November, 2016, staff from the Library of Congress’s National Digital Initiatives division visited the University of Richmond’s Digital Scholarship Lab as part of NDI’s efforts to explore data librarianship, computational research and digital scholarship at other libraries and cultural institutions. Like many university digital labs, the DSL is based in the library, which DSL […]