Using data from historic newspapers

This post is derived from a talk David Brunton, current Chief of Repository Development at the Library of Congress, gave to a group of librarians in 2015. 

I am going to make a single point this morning, followed by a quick live demonstration of some interfaces. I have no slides, but I will be visiting the following addresses on the web:

The current Chronicling America API: //chroniclingamerica.loc.gov/about/api/

The bulk data endpoint: //chroniclingamerica.loc.gov/batches/

The text only endpoint: //chroniclingamerica.loc.gov/ocr/

As you can probably tell already, there is a theme to my talk this morning, which is old news. I’ve participated in some projects at the Library of Congress that involve more recent publications, but this one is my favorite. I will add, at this point, that I am not offering you an official position of the Library of Congress, but rather some personal observations about the past ten years of this newspaper website.

For starters, complicated APIs are trouble.

You may be surprised to hear that from a programmer, but it’s true. They’re prone to breakage, misunderstanding, small changes breaking existing code, backward-incompatible changes (or forward-incompatible features), and they inevitably leave out something that researchers will want.

I don’t mean to suggest that nobody has gotten good use out of our complicated APIs, many people have. But over time it has been my unscientific observation that researchers are, in general, subject to at least three constraints that make simplification of APIs a priority:

  • Most researchers are gathering data from multiple sources
  • Most researchers don’t have unlimited access to a group of professional developers
  • Most researchers already possess a tool of choice for modeling or visualization

I’m not going to belabor the point, because I think anyone in this room who is a researcher will probably agree immediately. There is an even more important constraint in the case where researchers are using data as a secondary or corollary source, which is that they may not be able to pay for it, and they may (or may not) be able to agree to any given licensing terms of the data. But I digress.

Multiple sources, no professional developers, and a preferred tool for modeling and visualization.

Interestingly, there is some research about that last point that we may come back to if there is time. So on to the demonstration.

The first URL, the API. This is an extremely rich set of interfaces. As you can tell from the length of this help page (which is far from exhaustive), we have spent a lot of effort creating a set of endpoints that can provide a very rich experience. You can’t blame us, right? We’re programmers, so we made something for programmers to love!

Chronicling American API

Now, lest anyone misconstrue my description of this application programming interface, I want to stress this point: it is a truly wonderful Application Programming Interface. Unfortunately, an Application Programming Interface isn’t exactly what researchers want most the time. This is not to say that folks haven’t written some lovely applications with this API as a backend, because they have. But any time there is lots of network latency between their servers and ours, or any time our site is (gasp!) slow, it slows down those applications.

Over time, it has been my unscientific observation that when it is an option, it’s generally better for all parties involved to simply have their own copy of the data. This lets them do at least three cool things:

  • Mix the data with data from other sources.
  • Interact with the data without intermediation of custom software.
  • Use their tools of choice for modeling the data.

Sound familiar?

I’ll continue by directing everyone’s attention to the next two endpoints, which seem to be getting an increasingly large share of our use. The first is the place where someone can simply download all our data in bulk.

Chronicling America batch data download

So, the only problem we’ve discovered about this particular endpoint is that researchers would just as soon not pore through everything, which leads me to the next one, where researchers can download the newspaper text only, but still in bulk.

It’s perfectly reasonably to go to these pages, and tell some poor undergraduate to click on all the links and download the files, maybe put them on a thumb drive. But we’ve also made a very minimal layer on top of this, which makes them available as a feed. Since I’ve just finished saying how important it is to keep these things simple, I won’t belabor this addition too much, but I will point out that there is support in nearly every platform for the “feed” format.

Chronicling America text download

The last point I will make is that for a library, in particular, these three points are critical: when was it made, where was it obtained, and has it maintained fixity?

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.