(The following is a guest post by Hispanic Division Junior Fellow Matthew Bova.)
My assignment as a Junior Fellow was to make a visualization showing the digitized data related to the Hispanic Division that is available via the Library’s website. During my first few days, I found the LOC API (Application Programming Interface). This API was developed by LOC Labs, and allows computer programs to access data about all the items found in a search of the Library’s website. You can try this for yourself – just search something like dogs on the Library’s website and add the term “&fo=json” to the end of it. It displays most of the data you would find on that webpage in a format called JSON, which computers can interpret far more easily than a normal webpage.
I found I was able to count and display the results of my searches, and eventually discovered a method for counting every item by format (books, audio, manuscripts, etc.). I then came up with a method of finding every digitized item with Hispanic metadata, meaning it contained a subject, location or language that could be considered “Hispanic.” (Within the Library of Congress, the Hispanic Division recommends the acquisition of materials and provides reference services related to Spain, Portugal, Latin America, the Caribbean, and US Latinx communities) This allowed me to create a simple visualization (shown below), using open-source graphing software.
This method has its limitations. I was only extracting the numbers themselves rather than the values. This limited how much I could tell from this graph. My computer only grabbed what it needed to make this chart. For example, my dataset cant’t tell you how many of these “Hispanic” books were published before 1900. To do that, you would need a dataset containing every single item in the library, and I set out to create that dataset.
This proved to be more difficult than expected. The API isn’t designed for mass data collection, and only opens the first 100,000 results of a search. To solve this, I chunked up the format searches into individual years, and downloaded them as .csv files, a very simple tabular data format. Using this method, I sorted out the Hispanic metadata items and was able to make a chart that shows the growth of the library’s digital catalog, both in total and those with Hispanic Metadata.
This chart allows the Hispanic Division staff to see the Hispanic resources that are accessible through a loc.gov search and to use that data to plan for the future.