LC for Robots in Action: using the API to access the Federal Theatre Project collection

The following is a guest post by Derek Miller, Harvard University, and Elizabeth Brown, a reference librarian in the Main Reading Room at the Library of Congress. In it, they discuss how Brown helped Miller access LC for Robots resources that helped him gain enhanced access to Library of Congress digital collections used in his research.

Sometimes, the simplest-seeming aspects of digital research stop us in our tracks. In this example, we share a story of how a researcher discovered a new set of resources, but needed them in a different format than how the Library of Congress website presented them. Working with reference librarian Elizabeth Brown through the Library’s Ask A Librarian service, Prof. Derek Miller of Harvard University learned about and adapted tools made available by LC Labs on its LC for Robots resources page to solve his access problem.

The Federal Theatre Project Collection

Miller:

The Federal Theatre Project (1935–39) was a New Deal program both to support theater artists and to bring art to the masses during the Great Depression. This unprecedented government investment in the arts, though it fell short in many respects, had a lasting impact on theater throughout the country. Marvelously, it was also an unusually well-documented arts-producing program, particularly compared to today’s under-resourced non-profits or privately owned businesses. The piecemeal or proprietary archives they leave behind mean that theater historians have limited access to the processes and decisions that turn ideas into scripts, and scripts into theatrical performances. The Federal Theatre Project’s (FTP) records, by contrast, are outstandingly thorough and detailed, offering a unique view of how theater was made in the 1930s.

Originally cataloged by a team at George Mason University, the FTP collection is now held at the Library of Congress. Over the years, The Library has digitized thousands of pieces from that collection—costume sketches, production photographs, scripts, advertising plans, play synopses, and more. It’s an outstanding example of online curation and I thoroughly recommend the collection to anyone interested in federal arts support or American theater.

The Problem

Miller:

I became intrigued by the FTP collection, which has been well mined by many scholars of that program, as part of my current research project, a quantitative history of Broadway in the twentieth century. My attempts to describe Broadway by counting its elements—theaters, performers, ticket prices, awards, etc.—have forced me to think carefully about how my story about Broadway fits into the story of American theater more generally. Yes, Broadway is the largest commercial theater sector in the United States. But how does the work done there compare (in volume, scale, influence) to other theater-making around the country?

Tracking down records of theater in America brought me to the Library of Congress’s FTP collections. Among the many treasures here, two in particular caught my eye: weekly summaries of the FTP’s performances in New York City, and descriptive catalogues of the plays FTP produced or explored for production. The weekly summaries provide an overview of the huge variety of performances offered by the Federal Theatre Project in New York City: in any given week between January, 1937, and June, 1939 the FTP presented dozens of performances in a slew of genres. Some were large-scale, paid theatrical events (a stage version of Sinclair Lewis’ It Can’t Happen Here, or FTP’s Living Newspaper productions, for example). Others were free marionette shows, vaudeville bills, poetry readings, and even radio performances. The weekly summaries, designed to publicize FTP’s performances, offer a view into the rich variety of performance available in New York during the 1930s.

In one sense, the weekly summaries help shift the scale of theater considered in my study—a shift further encouraged by the descriptive catalogues. While FTP worked extensively to make and find plays for their own performances, the program also encouraged theater-making by Americans outside of their project. To support community-driven theater, FTP’s National Service Bureau “prepared lists of outstanding plays for use by producing groups of all types.” These catalogs include collections of Catholic, children’s, marionette, early American, and rural dramas, among many other thematic groupings. In general, the catalogs provide brief assessments of each listed play, along with a synopsis, a description of the set, costume, and performers required, and where to find the script. While many of the named play texts exist only in scattered collections across American libraries, the cataloged synopses themselves gather into one place an unparalleled overview of American drama in the 1930s. Furthermore, because the FTP has already summarized and collected these plays according to shared themes and/or audiences, we can see not only a list of extant titles—which the Library’s digitized copyright records provide—but also get a sense of the styles and ideas that mattered within American dramatic production outside of the most prominent urban theatrical communities.

Given how exciting these records are, I was eager to download them so that I could convert them to formats more amenable to my research, such as spreadsheets or more complex database structures that allow additional sorting and interpretation. However, I was having trouble downloading the documents because they were digitized as individual images of multi-page items. For instance, some weekly New York performance summaries contain 15 images in their sequence. Being able to download each document as a single file, such as a PDF, was the first hurdle in working more closely with the materials.

The Solution

Brown:

Miller contacted the Ask A Librarian service during the live online chat sessions offered each afternoon.  There, he inquired of a colleague, Peter Armenti, about downloading whole documents, rather than single page images, from the Federal Theatre Project collection. He cited the production notes from the play, Power, described in this essay, as the item he wanted to download in full. Knowing I had handled similar inquiries in the past, Armenti converted the chat conversation into a “ticket” in our Ask A Librarian system and assigned that ticket to me for follow-up.  As such requests are usually for single volumes, I was prepared to download the page images via file transfer protocol, then bundle the pages into a PDF or ZIP file, and send them to Miller.

Before doing that, I asked Miller by email if he had other titles in mind, since I could easily collect and bundle a few at the same time. When he replied with a spreadsheet of some forty titles, however, I realized that the bulk methods built on APIs originally developed by the Library’s programming teams and facilitated by LC Labs on its website, would be more appropriate. I sent him a message asking if he was comfortable using APIs and Jupyter Notebooks—digital tools that facilitate computer-to-computer access to online files. To my delight he replied in the affirmative. Next, I pointed him to the LC for Robots section of the LC Labs website and to one specific tool there. LC Labs is a small team at the Library of Congress whose work supports the Digital Strategy. As described on the LC for Robots page, LC Labs has developed a series of prototype examples that can be adapted for accessing data through the API. One such Jupyter Notebook, developed by former researcher in residence Laura Wrubel, walks researchers through using the API to download multiple files, such as the many individual pages of the forty volumes that Miller needed for his research.  I believed that this was just the tool Miller needed, as it would give him direct control to both automating and customizing his downloads. With it, he could save the files himself en masse without the need to click, display, and then click again to save each page, over and over, in order.

Miller:

The solution was as straightforward as one would expect from LOC’s posted examples. Each individual item can be accessed in a JSON format—a simple data structure—by adding a suffix. So the page //www.loc.gov/item/fpraf.09640014 can be viewed as JSON by going to this address: //www.loc.gov/item/fpraf.09640014?fo=json. Note that I’ve added “?fo=json” to the end of the URL. This tells the server to return data in the format (fo) JSON.

JSON information can look overwhelming and complicated; a detailed explanation can be found in the Library’s documentation of the API on Github. But basically, we’re looking at a kind of dictionary, with a set of keys and values, separated by a colon. The images I need are keyed here as “resources” and, within those resources, as “files.” Each file comes in a number of different types and sizes. For my part, I wanted the TIFF image files, which are the largest offered size of the image.

By writing some Python code much like that in LOC’s examples, I was able to get the URL for each TIFF file and download it to a folder.

The essential function looks like this:

response = requests.get(f”//www.loc.gov/item/{url}?fo=json”).json()

for file in response[‘resources’][0][‘files’]:

for image_type in file:

if image_type[‘mimetype’] == ‘image/tiff’:

image_response = requests.get(image_type[‘url’])

image_name = re.search(r'(\d*r)\.tif’,image_type[‘url’]).group(1)

with open(f'{image_name}.tiff’, ‘wb’) as fd:

fd.write(image_response.content)

break

Voilà! Thanks to the existing documentation, the essential assistance of LOC’s reference librarians, and a slightly modified approach to the sample code, I now have my own copies of FTP’s multipage documents, scanned by the Library of Congress, and ready to be transformed into structured data for my research.

One odd mystery remained. TIFF files are usually exceptionally large because they’re “lossless,” i.e., they use no data compression to store image information. Thus, usually, one wants to download JPEG files, a compressed format that is still more than good enough for data transcription. In this case, the FTP TIFF files were the same size as the JPEG files or smaller and, of course, in higher resolution. What was going on here?

Brown:

Miller asked about the actual digital files, so we had a follow-up conversation by email about the files themselves.  The files in the FTP collection present a combination of small one bit (bitonal, black/white) TIFF files created for use back in the day when download speeds relied on slow modem connections, and more-recently-scanned larger TIFF files that use eight bit (256 colors) and supply more image detail.  The JPEG files presented in this collection are derivatives of the latter.

I have worked with the Library’s online materials since their early days in the mid-1990s, and it is gratifying that the Library of Congress is now able to offer increasingly sophisticated digital tools for researchers. These include the APIs developed by the Library’s programmers a good eight years ago, and the subsequent work by the LC Labs team and others to support and extend their use. Our researchers are getting more sophisticated too! The availability of tools that open access to new research through computational means to the Library’s collections, combined with the rise of digital studies, digital humanities, and digital scholarship across the disciplines, add up to new opportunities to collaborate between Library of Congress staff and our researchers. Another lesson of this collaboration with Derek Miller is that while the term “computational access” can sound complicated it simply means that researchers can use computers for what they do best—in this case, automating a repetitive download process by copying and customizing pre-written code. Now that he has the files, we look forward to eventually learning how Miller has been able to use their content within his larger project—a history of Broadway and American theater in the Twentieth Century.

When you have questions about navigating or using the Library of Congress online resources, do reach out to our Ask A Librarian service, and we’ll do our best to assist you or find someone who can.

 

Newspaper Navigator Search Application Now Live!

On September 15, 2020, the Library of Congress announced the release of Newspaper Navigator, an experimental web application which makes 1.5 million photographs from the dataset from Chronicling America available to the public to explore for the first time. Read more about the design and features of the project below or jump straight to the newly launched application at //news-navigator.labs.loc.gov/search !

Citizen DJ at the virtual National Book Festival

This post was originally featured on the Minerva’s Kaleidoscope blog for kids and families. We’re excited and grateful to be able to re-share about this opportunity to experience Citizen DJ at the virtual National Book Festival next week!

Metaphors for Understanding Born Digital Collection Access: Part III

Kathleen O’Neill is currently serving as one of two Staff Innovators at the Library of Congress. Their 2020 project, Born Digital Access Now!, explores existing pathways for accessing born digital materials in the Manuscript Division. In this series of blog posts, Kathleen describes the complexities of gaining access to born digital materials through the lens of three different metaphors. Up first was “Media Format, or, Have Fun Storming the Castle!” The second blog post discussed “Legacy File Formats and Operating Systems or Lost in Translation.” This is the third and final post in the series and Kathleen carefully explains the process of emulation and makes it feel less like “strange magic.”

Metaphors for Understanding Born Digital Collection Access: Part I

The following is a guest post by Senior Archivist Kathleen O’Neill. Kathleen and her colleague Chad Conrady are currently working on a project called Born Digital Access Now! as the 2020 Staff Innovators in LC Labs. Their first blog post introduces the project, which aims to provide greater access to born digital materials held in the Manuscript Division, in greater detail. Today’s post is the first in a series of three blog posts in which Kathleen will discuss different challenges or barriers to born digital collection access through the lens of three different metaphors. Up first is: “Media Format, or, Have Fun Storming the Castle!”