The following is a guest post by Derek Miller, Harvard University, and Elizabeth Brown, a reference librarian in the Main Reading Room at the Library of Congress. In it, they discuss how Brown helped Miller access LC for Robots resources that helped him gain enhanced access to Library of Congress digital collections used in his research.
Sometimes, the simplest-seeming aspects of digital research stop us in our tracks. In this example, we share a story of how a researcher discovered a new set of resources, but needed them in a different format than how the Library of Congress website presented them. Working with reference librarian Elizabeth Brown through the Library’s Ask A Librarian service, Prof. Derek Miller of Harvard University learned about and adapted tools made available by LC Labs on its LC for Robots resources page to solve his access problem.
The Federal Theatre Project Collection
Miller:
The Federal Theatre Project (1935–39) was a New Deal program both to support theater artists and to bring art to the masses during the Great Depression. This unprecedented government investment in the arts, though it fell short in many respects, had a lasting impact on theater throughout the country. Marvelously, it was also an unusually well-documented arts-producing program, particularly compared to today’s under-resourced non-profits or privately owned businesses. The piecemeal or proprietary archives they leave behind mean that theater historians have limited access to the processes and decisions that turn ideas into scripts, and scripts into theatrical performances. The Federal Theatre Project’s (FTP) records, by contrast, are outstandingly thorough and detailed, offering a unique view of how theater was made in the 1930s.
Originally cataloged by a team at George Mason University, the FTP collection is now held at the Library of Congress. Over the years, The Library has digitized thousands of pieces from that collection—costume sketches, production photographs, scripts, advertising plans, play synopses, and more. It’s an outstanding example of online curation and I thoroughly recommend the collection to anyone interested in federal arts support or American theater.
The Problem
Miller:
I became intrigued by the FTP collection, which has been well mined by many scholars of that program, as part of my current research project, a quantitative history of Broadway in the twentieth century. My attempts to describe Broadway by counting its elements—theaters, performers, ticket prices, awards, etc.—have forced me to think carefully about how my story about Broadway fits into the story of American theater more generally. Yes, Broadway is the largest commercial theater sector in the United States. But how does the work done there compare (in volume, scale, influence) to other theater-making around the country?
Tracking down records of theater in America brought me to the Library of Congress’s FTP collections. Among the many treasures here, two in particular caught my eye: weekly summaries of the FTP’s performances in New York City, and descriptive catalogues of the plays FTP produced or explored for production. The weekly summaries provide an overview of the huge variety of performances offered by the Federal Theatre Project in New York City: in any given week between January, 1937, and June, 1939 the FTP presented dozens of performances in a slew of genres. Some were large-scale, paid theatrical events (a stage version of Sinclair Lewis’ It Can’t Happen Here, or FTP’s Living Newspaper productions, for example). Others were free marionette shows, vaudeville bills, poetry readings, and even radio performances. The weekly summaries, designed to publicize FTP’s performances, offer a view into the rich variety of performance available in New York during the 1930s.
In one sense, the weekly summaries help shift the scale of theater considered in my study—a shift further encouraged by the descriptive catalogues. While FTP worked extensively to make and find plays for their own performances, the program also encouraged theater-making by Americans outside of their project. To support community-driven theater, FTP’s National Service Bureau “prepared lists of outstanding plays for use by producing groups of all types.” These catalogs include collections of Catholic, children’s, marionette, early American, and rural dramas, among many other thematic groupings. In general, the catalogs provide brief assessments of each listed play, along with a synopsis, a description of the set, costume, and performers required, and where to find the script. While many of the named play texts exist only in scattered collections across American libraries, the cataloged synopses themselves gather into one place an unparalleled overview of American drama in the 1930s. Furthermore, because the FTP has already summarized and collected these plays according to shared themes and/or audiences, we can see not only a list of extant titles—which the Library’s digitized copyright records provide—but also get a sense of the styles and ideas that mattered within American dramatic production outside of the most prominent urban theatrical communities.
Given how exciting these records are, I was eager to download them so that I could convert them to formats more amenable to my research, such as spreadsheets or more complex database structures that allow additional sorting and interpretation. However, I was having trouble downloading the documents because they were digitized as individual images of multi-page items. For instance, some weekly New York performance summaries contain 15 images in their sequence. Being able to download each document as a single file, such as a PDF, was the first hurdle in working more closely with the materials.
The Solution
Brown:
Miller contacted the Ask A Librarian service during the live online chat sessions offered each afternoon. There, he inquired of a colleague, Peter Armenti, about downloading whole documents, rather than single page images, from the Federal Theatre Project collection. He cited the production notes from the play, Power, described in this essay, as the item he wanted to download in full. Knowing I had handled similar inquiries in the past, Armenti converted the chat conversation into a “ticket” in our Ask A Librarian system and assigned that ticket to me for follow-up. As such requests are usually for single volumes, I was prepared to download the page images via file transfer protocol, then bundle the pages into a PDF or ZIP file, and send them to Miller.
Before doing that, I asked Miller by email if he had other titles in mind, since I could easily collect and bundle a few at the same time. When he replied with a spreadsheet of some forty titles, however, I realized that the bulk methods built on APIs originally developed by the Library’s programming teams and facilitated by LC Labs on its website, would be more appropriate. I sent him a message asking if he was comfortable using APIs and Jupyter Notebooks—digital tools that facilitate computer-to-computer access to online files. To my delight he replied in the affirmative. Next, I pointed him to the LC for Robots section of the LC Labs website and to one specific tool there. LC Labs is a small team at the Library of Congress whose work supports the Digital Strategy. As described on the LC for Robots page, LC Labs has developed a series of prototype examples that can be adapted for accessing data through the API. One such Jupyter Notebook, developed by former researcher in residence Laura Wrubel, walks researchers through using the API to download multiple files, such as the many individual pages of the forty volumes that Miller needed for his research. I believed that this was just the tool Miller needed, as it would give him direct control to both automating and customizing his downloads. With it, he could save the files himself en masse without the need to click, display, and then click again to save each page, over and over, in order.
Miller:
The solution was as straightforward as one would expect from LOC’s posted examples. Each individual item can be accessed in a JSON format—a simple data structure—by adding a suffix. So the page https://www.loc.gov/item/fpraf.09640014 can be viewed as JSON by going to this address: https://www.loc.gov/item/fpraf.09640014?fo=json. Note that I’ve added “?fo=json” to the end of the URL. This tells the server to return data in the format (fo) JSON.
JSON information can look overwhelming and complicated; a detailed explanation can be found in the Library’s documentation of the API. But basically, we’re looking at a kind of dictionary, with a set of keys and values, separated by a colon. The images I need are keyed here as “resources” and, within those resources, as “files.” Each file comes in a number of different types and sizes. For my part, I wanted the TIFF image files, which are the largest offered size of the image.
By writing some Python code much like that in LOC’s examples, I was able to get the URL for each TIFF file and download it to a folder.
The essential function looks like this:
response = requests.get(f”https://www.loc.gov/item/{url}?fo=json”).json()
for file in response[‘resources’][0][‘files’]:
for image_type in file:
if image_type[‘mimetype’] == ‘image/tiff’:
image_response = requests.get(image_type[‘url’])
image_name = re.search(r'(\d*r)\.tif’,image_type[‘url’]).group(1)
with open(f'{image_name}.tiff’, ‘wb’) as fd:
fd.write(image_response.content)
break
Voilà! Thanks to the existing documentation, the essential assistance of LOC’s reference librarians, and a slightly modified approach to the sample code, I now have my own copies of FTP’s multipage documents, scanned by the Library of Congress, and ready to be transformed into structured data for my research.
One odd mystery remained. TIFF files are usually exceptionally large because they’re “lossless,” i.e., they use no data compression to store image information. Thus, usually, one wants to download JPEG files, a compressed format that is still more than good enough for data transcription. In this case, the FTP TIFF files were the same size as the JPEG files or smaller and, of course, in higher resolution. What was going on here?
Brown:
Miller asked about the actual digital files, so we had a follow-up conversation by email about the files themselves. The files in the FTP collection present a combination of small one bit (bitonal, black/white) TIFF files created for use back in the day when download speeds relied on slow modem connections, and more-recently-scanned larger TIFF files that use eight bit (256 colors) and supply more image detail. The JPEG files presented in this collection are derivatives of the latter.
I have worked with the Library’s online materials since their early days in the mid-1990s, and it is gratifying that the Library of Congress is now able to offer increasingly sophisticated digital tools for researchers. These include the APIs developed by the Library’s programmers a good eight years ago, and the subsequent work by the LC Labs team and others to support and extend their use. Our researchers are getting more sophisticated too! The availability of tools that open access to new research through computational means to the Library’s collections, combined with the rise of digital studies, digital humanities, and digital scholarship across the disciplines, add up to new opportunities to collaborate between Library of Congress staff and our researchers. Another lesson of this collaboration with Derek Miller is that while the term “computational access” can sound complicated it simply means that researchers can use computers for what they do best—in this case, automating a repetitive download process by copying and customizing pre-written code. Now that he has the files, we look forward to eventually learning how Miller has been able to use their content within his larger project—a history of Broadway and American theater in the Twentieth Century.
When you have questions about navigating or using the Library of Congress online resources, do reach out to our Ask A Librarian service, and we’ll do our best to assist you or find someone who can.
Comments
This was fun to read. I write as a retired LC employee who can remember the one-off special project that gave some IBM folks a chance to scan the Federal Theater Collection, associated with what was called American Memory, maybe 1991 or so (?). IBM’s image scientists were experimenting with new ideas and formats (and color-gamut representation) and we all sought to find ways to have their output fit into our still-developing format preferences and systems. This blog does nice job of reminding us how digital systems and formats remain open and malleable, with each generation of tinkering better able to serve the needs of researchers. Bravo for the continuing exploration!