Introducing the Computing Cultural Heritage in the Cloud Project

LC Labs has an exciting new project, with support from the Andrew W. Mellon Foundation! Learn about the grant on the experiments page, and see the press release here. This is a guest post from LC Labs Senior Innovation Specialist Laurie Allen. 

Apply in late 2019 (very soon):

    – Two Digital Scholarship Librarians for a two-year position working with the labs team on the CCHC project.

Apply in early 2020:

    – Four research experts (or teams) who will explore the library’s digital collections through direct access to large quantities of data and funds to compute against that data.

Introduction

Hello. I’m Laurie Allen, the newest member of the Labs team here at the Library of Congress. In addition to other duties, one of my first big tasks as I join the Library is to steward the Computing Cultural Heritage in the Cloud project. This post is the first in a series of updates about the project. It is designed to give some background and overview. Look for future posts about the collections, services, and approaches we’ll be working with. Here goes:

We are delighted to share that, with support from the Andrew W. Mellon Foundation, we will pilot ways to combine cutting edge technology and the collections of the largest library in the world, to support creative new uses of collections. We’ll explore service models to support researchers accessing Library of Congress collections in the cloud, and share what we learn along the way.

History and Context

This project follows in the long history of digitizing collections and collecting born-digital materials at the Library of Congress. It is also connected to the work of other National Libraries whose large scale data projects have inspired and informed this group, and to the work of other library, archives, and museum practitioners who are exploring collections as data. At its most basic, however, this project is an extension of the work that libraries have always done: helping people learn from cultural heritage materials, and helping communities connect with and share knowledge.

At its most basic, however, this project is an extension of the work that libraries have always done: helping people learn from cultural heritage materials, and helping communities connect with and share knowledge.

The proposal was drafted by the collective brilliance of Kate Zwaard, Jaime Mears, Abbey Potter, and Meghan Ferriter, and is deeply connected to the Library’s Digital Strategy, which calls on us to “Throw Open the Treasure Chest,” “Connect,” and “Invest in the Future.” The proposal also grew, in part, out of a report written in 2018 (currently unpublished) by a group of Library of Congress staff who formed a Working Group to examine the kinds of projects that researchers want to do with our collections, the readiness of our collections for that work, and the readiness of staff to provide these new services.

Libraries create workflows and systems to enable people to identify and access materials, so that they can engage with and analyze those materials, and eventually build or report on what they’ve learned, often in the form of a text that the library will collect. An abstracted version of the process is pictured below.

Libraries have generally invested heavily in building systems and expertise to help people identify and access materials, though of course, the lamps, chairs, and tables in our reading rooms are great evidence that we also support users in engaging with our materials through analysis and reporting. The simplified stages I have laid out are an oversimplification, of course, and researchers regularly move back and forth through them. Through each step, insights and biases are introduced or uncovered, and context is gained and lost.

a diagram depicting the flow from identify to access to analyze to report. And then back to identify. The first two are shaded to indicate that the library invests heavily in systems to support these parts of the process.

Diagram of a simplified researcher process with library materials.

The model above is stretched a bit when the analyses that people are doing change from those best supported by a table, chairs, and some lamps to those that require computing power, access to large datasets, and software. As computational methods and approaches have continued to gain steam over the past few decades, and tools have emerged to support those methods, the Library of Congress has been making some collections data available for download in relatively small datasets that can be managed on a personal computer or using simple visualization and analysis tools. Generally these newer methods require that the data user will do some work wrangling the data they have downloaded in order to get it in a form that works for their analysis.

Some very cool things have been done with these data, though they have also raised important questions about what kinds of infrastructure and services libraries might provide to help people use these materials.

In addition to downloadable datasets, the Library of Congress has been providing tools to help people get manageable slices of the data, through APIs such as the popular Chronicling America, that allow users to subset data before downloading. These APIs also call for new forms of reference and research support from library staff, and allow only some kinds of access to materials, based on the ways it has been catalogued and organized.

A diagram depicting a simplified research process for small downloadable datasets, moving from identify to download, then to subset/slice, reformat, analyze and then report.

Diagram of a simplified researcher process with small, downloadable library datasets.

A diagram depicting a simplified research process for datasets with an api, moving from identify to subset/slice to download, then reformat, analyze and then report.

Diagram of a simplified researcher process with an API into library datasets.

 

 

 

 

 

 

 

 

The models above, for bulk download or API access work relatively well for smaller data. When the methods become more complex or the data and file sizes get larger, we find that the seemingly simple work of “supporting access” to our materials becomes too difficult. In some cases, we lack the tools, expertise, and staff time to create custom subsets for researchers. Sometimes the dataset is simply too large, or the points of access that researcher wishes to use to not match the points of access laid out in the metadata.

When the methods become more complex or the data and file sizes get larger, we find that the seemingly simple work of “supporting access” to our materials becomes too difficult.

The Working Group identified a number of cases where researchers wanted to do analysis of collections that they could not access through the libraries’ systems. For example, a graduate student whose research question called for text mining the words on candidate websites to see how they changed over the course of a campaign could not complete the project because the Library of Congress does not have the infrastructure nor staff programming time to subset the United States Elections Web Archive, extract the text, and re-package it for his purposes.

a diagram showing potential blockages at each stage along the way such that the new knowledge and reporting section is fully blacked out. Caption: the current state of affairs.

The current state of affairs.

A Proposed Approach

Computing Cultural Heritage in the Cloud will allow the Library to learn about the tools, costs, and potential methods we can use to make these collections available in an environment of shared responsibility. The grant will support us to test a model where subsetting, reformatting, and analysis can all be done with access to the collections in a controlled environment. The funds will be used to contract with four research experts (or teams) who will bring their own research projects to our collections. In carrying out their computational research in this test environment, the researchers will help us learn about new ways our collections can be used, and about the risks, costs, and potential in different methods of support. The grant will also provide funds for the Library of Congress to hire staff members to help the researchers do their work, to help prepare collections for the cloud, and to help us develop services and infrastructure models in the library. Finally, the grant will fund for us to contract with someone who will document the process, and help bring a fresh set of eyes to all of the work that will go into preparing collections for this environment, supporting its use, and the kinds of research that enables.

a diagram showing the proposed grant-funded model where accessing, subsetting, reformatting and analysis all happen in a zone of shared responsibility between the library and the researcher.

The proposed new model.

Keep an eye on the Signal for further updates about the process and about how you can become involved. We’ll be hiring and posting a call for research projects very soon!

In the meantime, we will be hosting quarterly web calls to share updates about the grant as we go. The first one is scheduled for December 13, 1-2pm Eastern. Sign up for the web call and share your questions in advance!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.