At the Library of Congress, digital librarians, software developers, and outreach specialists frequently work together. This is especially true when it comes to the application programming interface (API) that delivers content to the loc.gov website. The loc.gov JSON API is the connective tissue between digitized and digital content and its metadata, an extract-transform-load (ETL) process, and the end user interface. The API was designed by people at the Library who created the look and functionality of the loc.gov website as a way to grab and structure data from intermediary systems for the web interface.
The redesigned loc.gov website launched in 2012 and the API itself was unveiled publicly in 2017 as a resource for members of the public to access structured data about Library of Congress collections. Accompanying this announcement was an exploratory GitHub repository and site developed by software development librarian Laura Wrubel (figure 1). The Library recently published several updates to loc.gov/apis, the new home of LC API documentation (figure 2).
To celebrate this next step in the documentation’s evolution, I sat down with two people who have worked extensively with LC APIs: Laura Wrubel, now a digital library software developer at Stanford Libraries, and Patrick Rourke, software engineer at the Library of Congress who I collaborated with on the recent loc.gov/apis update. In our conversation, we discussed the approaches each of them has taken to creating documentation to help people use this powerful and complicated tool.
Eileen J. Manchester (EJM): Hi Laura, hi Patrick. Before we jump in, could you please introduce yourselves and tell our readers a bit more about your backgrounds?
Laura Wrubel (LW): I am currently a software developer in the Libraries at Stanford University, supporting the Stanford Digital Repository. Before I joined Stanford last year, I was at George Washington University, developing open-source applications and teaching coding. During that time, I was able to spend three months of research leave here at the Library of Congress with the LC Labs team.
Patrick Rourke (PR): I’m a developer who works on loc.gov, both on the API and on the processes we use to add data to the loc.gov search engine. I have been at the Library for almost nine years. Before that, I worked in various government contracting jobs. And before that, I worked in academic publishing.
EJM: Thanks so much. I understand that over the course of the last five years, you both have been deeply involved with working with and documenting the Library of Congress’ application programming interface(s). Can you tell me more about that?
LW: My history with the Library of Congress JSON API goes back even further than 2017. I was on research leave at the Maryland Institute for Technology in the Humanities in 2015 to pick up some programming skills and apply them in the cultural heritage space. As part of a project, I was trying to collect and aggregate historical images of catastrophe, specifically natural disasters, and present them in a web application. I used the API of the Digital Public Library of America (DPLA), relatively new then, and I was looking for another source of images.
The LOC API was not documented publicly but a colleague showed me how by tweaking the URL for a collection or object page, you could get all this JSON data. It was amazing. Since it was not documented, and I had some questions about the data I was seeing, I reached out to the Library’s Information Technology Design & Development (ITDD) office. The staff were very helpful, and generous with advice and knowledge about the API data.
Fast forward a few years to 2017/2018, and I spent another few months on research leave here with the LC Labs team contributing to the team’s efforts to make collections more computationally accessible. Drafting documentation about the JSON API was part of that. I published initial documentation on a GitHub Pages site, showing how to make API requests and understand responses. This was a prototype of “documentation”: a first step in trying to describe what would need to be there. I also worked on Jupyter notebooks with example code and demonstration projects using the API.
EJM: That’s wonderful. So you began as a user of the API before you were the “documenter.” What about you, Patrick?
PR: I first came to the Library to work on a project called Viewshare. One of the things Viewshare did was create visualizations based on various kinds of inputs, one of which was JSON documents. I was working a little with the web archiving team and the team that works on loc.gov. They showed me how to access the API representation and I was able to use that to demonstrate some things you could do with Viewshare using the API. When Viewshare came to an end, the loc.gov team welcomed me to work with them.
The API actually is fundamental to how we develop loc.gov. We use it for testing and to populate the webpages. As I began working with ETL and the loc.gov application, I had to learn how the API worked, especially its different fields, to understand the data structures in various collections, since collections tends to be very heterogenous.
EJM: What does ETL stand for?
PR: Extract-Transform-Load. It’s a process where you read data from one data source in one or more formats—that’s the extract part. Then you transform it into the format you want to use. And then you load it into a database or search engine. The way that works with loc.gov is: we extract data from lots of different sources (METS files, MODS files, MARC files, and more.). We transfer what we receive into our schema; whatever fields don’t match are put into the “items section” which gets represented as the bibliographic item on the webpage. Next, we save that information to a SOLR search engine.
The API performs lookups against the SOLR search engine, though it’s not the only source. In the process of pulling that data from SOLR, we conduct several additional computational operations before it’s displayed as the API. And ultimately the goal is to create code that can assemble the data in a way that builds a webpage.
EJM: Hearing both of your experiences working directly with the API is really helpful. It does sound like a lot of work, though, to document the Library of Congress’ API and make it intelligible for members of the public. Can you walk me through your process for tackling this endeavor?
PR: Laura’s work really was foundational! You and I used her prototype as the basis for what you see at loc.gov/apis and from there, really worked to augment and rearrange the content. That was a lot of it.
One of my primary interests in working with the API documentation is to teach people how to use the API in ways that have the least impact on them — and on other users. For example, that’s why we now have a whole section on how to work around the various limits imposed on API use. We don’t have API keys for loc.gov, which means that we can’t control the number people are hitting the API at any one time, so we’ve had to add rate limiting.
LW: I did some experimenting first. For example, I tried building up a collection of thumbnail images and compiling URLs for items in a collection. I wanted to explore the data and get a sense of what might be most useful to people outside the Library–researchers, or perhaps other librarians or software developers. I tried to wear an “outsider” hat as much as possible even though I am a librarian and do software development.
In terms of structure of the documentation, I was inspired by what the folks at the Digital Public Library of America had been doing. They were trying to make their API documentation welcoming to people who were newer to working with code. They gave some contextual information about working with APIs and lots of examples. This model helped provide me a way to scope this effort. It wasn’t going to be ideal to document every possible field in the JSON responses. I had to be somewhat selective: the JSON data is quite extensive and represents an aggregation of data from different sources, created by different units and automated processes. So considering its origins, it was natural there were some inconsistencies. These do put a burden on someone trying to write code so it’s important to acknowledge.
EJM: That has definitely been our guiding principle as well. We’d like to alert people to the fact that things are in flux but that there are such things as the core parts of the response that are “most likely” to remain stable.
PR: That’s a really good point. Different disciplines have different cataloging practices. They’re interested in different characteristics and the way that you distinguish between two objects of study in a particular discipline may differ. Those differences are reflected in the API.
EJM: Totally. I have two follow-up questions: why is it important for cultural heritage organizations, and especially the Library of Congress to openly document its APIs? Why is that a worthwhile undertaking?
LW: I have heard Dr. Hayden, the Librarian of Congress, talk about “throwing open the treasure chest” of the Library’s amazing collections. Digitizing them has been one way the Library is making the resources available. Making the data about these resources available is another way. It reaches a different audience and also affords new uses.
The other reason is that if you want people to use this access point, you have to provide documentation and keep it up to date. I think the Library has been taking that promise seriously in moving in the direction it’s moving with the API docs.
EJM: What is the importance of publicly documenting the API from your perspective, Patrick?
PR: Having an API available makes it much easier to get the data you really want. Because the API is the basis of the website, the data on the website is in the API. The only difference is that resources like PDFs are linked from the API instead of being directly displayed, and there are also some HTML documents that are incorporated into the website, particularly the collections descriptions and program portals, that are not obviously accessible in the API. But all the metadata is in the API.
Using a machine-readable API makes it easier for users to get and rearrange the information in ways they can use, like for aggregation. Documenting the API makes it easier for people to write the code they need to write to get the data. Our purpose is to make this information available, especially to the United States but also to the whole world. Providing it through an API is an additional way of providing that data that can happen at higher rates than a UI.
It also provides an example API for people who want to learn how to work with APIs, or as a teaching resource. I think that’s something we will see people using it for — having the documentation makes that possible. And I also want the API documented because it draws users who are very experienced data users and can give us feedback that we can take back to our teams and discuss. Finally, it gives us a structure to help us adhere to certain standards for intelligibility — otherwise it would be more difficult for people to use the website and the API.
EJM: What kinds of skills or attitudes are helpful for coming to this work? It seems like multiple orientations can be helpful.
LW: For librarians, I think experimenting yourself with the data is a great place to start. People in libraries already have great metadata skills; the JSON data is the same metadata people have been producing and working with all along. It’s just in a different format. Maybe that means exploring it in a browser or another tool. It’s also fun to take an interesting research question that came in and try to play around with it.
EJM: That’s a great segue into my final question. If you had one wish for the future of loc.gov/apis, what would it be?
PR: Your question reminds me of a recent talk at the Digital Library Forum done by Christa Maher, Dave Woodward, Krisztina Thatcher and you, Eileen, about how the ETL process determines what data is represented in the API. I would love to see something like that which explains in detail what winds up in the API on the documentation website.
LW: I think the community aspect would be great to continue building. In all my experience working with the API, I have gotten great guidance from staff and it would be helpful to learn from other API users. I would love to see showcase projects or research that you’ve seen accomplished through the use of APIs, especially when there are methods or code people can re-use.
EJM: Thank you for your time!
Subscribe to the Signal blog— it’s free!
Comments (2)
As a frequent user of the LOC APIs I want to say THANK YOU to Laura and Patrick for developing these APIs and documenting them in an understandable manner. Keep up the good work!
Also, JSON is great, but don’t give up on XML and RDF!!
I’d heard that the loc.gov API emerged organically–it’s really interesting to hear the details. I’d love to see Laura and Patrick’s ideas come to fruition, along with versioning of the API and a change log (while we’re making wishes 🙂