Making a Newspaperbot

The following is a guest post from Library of Congress Labs Innovation Intern, Aditya Jain. While exploring the possibilities of digital collections, Aditya created @newspaperbot. Below he shares his process, some of the challenges he encountered, along with the code.

The Chronicling America API provides access to historical newspapers from the first half of the 20th century, from geographically diverse sources. Such a collection presents a unique opportunity to retrospectively study the zeitgeist of a nation. Towards that end, “Newspaperbot” is a Twitterbot that tweets out historical newspapers from the Chronicling America API. Everyday in the early hours of morning, the Twitterbot finds all the historical newspapers from that day exactly 100 years ago. The bot then proceeds to tweet the front page of each newspaper accompanied by the title of the journal and the place of publication. It is also accompanied by a link to the item’s location on the Chronicling America website where the reader can access high resolution images of the newspaper.

Inspiration

The idea of a Twitterbot was inspired by Kate Rabinowitz’s @LoCMapBot which tweets out maps from the Library of Congress Geography & Maps digital collections.

Image of Tweet featuring Sanborn Fire Insurance map from Nebraska

LoCMapBot Tweet, created by Kate Rabinowitz and sharing Library of Congress Geography & Maps digital collections

Building on Kate’s idea, “Newspaperbot” tweets newspapers from 100 years ago each day. By doing so newspaperbot attempts to create a narrative that reflects the slow quotidian pace of history. It is hoped that such a narrative encourages the reader to reflect on American progress and history.

Data

Finding the data for such a project was tremendously easy thanks to the simple and intuitive Chronicling America API. The public endpoint that yields the front page for a given date such as the 12th of January, 1918 is the following:

//chroniclingamerica.loc.gov/frontpages/1918-1-12.json

A simple GET request to the endpoint above yields a JSON object such as the following:

[
   {
     place_of_publication: "Hopkinsville, Ky.",
     url: "/lccn/sn86069395/1918-01-12/ed-1/seq-1/",
     label: "Hopkinsville Kentuckian.",
     medium_url: "/lccn/sn86069395/1918-01-12/ed-1/seq-1/medium.jpg",
     thumbnail_url: "/lccn/sn86069395/1918-01-12/ed-1/seq-1/thumbnail.jpg",
     pages: 8
   }
 ]

Each newspaper in the returned JSON is represented by an object. For instance, the snippet above represents the ‘Hopskinville Kentuckian’ that was published in ‘Hopskinville, Ky.’

For “Newspaperbot,” four fields were of relevance: The label representing the name of the newspaper, the place of publication, the medium url that will return a JPG of the front page that can be tweeted out, and the URL that leads the reader to a high resolution version of the newspaper in case they wish to explore the newspaper further.

Pre-preppin’

To make a Twitterbot you need to create a Twitter account that will tweet for you! Choose a Twitter handle to your liking and then proceed to apps.twitter.com

Create a new app and head over to the ‘Keys and Access Tokens’. This is where you will generate 4 things that we require for our python script to interface with the Twitter API:

  1. Consumer Key
  2. Consumer Secret
  3. Access Token
  4. Access Token Secret

Note that these strings are NOT to be shared with anybody. Remember to delete the strings from your code before you push your code to Github.

Screenshot of Twitter account Application settings

Establishing the Keys and Access Tokens in the Application Settings for the Twitter account

Code

The code for this project was initially written in Node.js, before being ported to Python. We will only discover the Python version here but the reasons for porting the code to Python were two-fold:

  1. Possible future features that make use of the wealth of Natural Language Processing (NLP) libraries that exist in the Python ecosystem
  2. Multithreading to speed up the process of getting newspaper images from the Chronicling America API.

The code for the project is fairly straightforward and less than 100 lines. If you want to jump ahead and just see the code you can do so here. I’ll give a high-level walkthrough of the code down below.

def getPictures():
   deleteExistingFiles()
   today_date = arrow.now().shift(years=-100).shift(days=+1).format('YYYY-MM-DD')
   r = requests.get("%s/frontpages/%s.json" % (base_url,today_date))
   r_json = r.json()
   num_partitions = math.ceil(len(r_json)/8)
   chunks = [r_json[x:x+num_partitions] for x in range(0, len(r_json), num_partitions)]
   threads = []
   for i in range(0,8):
     t = threading.Thread(target=worker, args=(chunks[i],i,num_partitions,today_date,))
     threads.append(t)
     t.start()
   for i in range(0,8):
     threads[i].join()
   print('\x1b[1;37;44m'+"FINISHED DOWNLOADING IMAGES"+'\x1b[0m')
   schedule.every().day.at("04:00").do(startTweetin,r_json,len(r_json),today_date)
 
 schedule.every().day.at("23:15").do(getPictures)
 
 while 1:
     schedule.run_pending()
     time.sleep(1)

The snippet above describes actions that occur in the main thread. We use the schedule python library to pre-fetch all newspaper images for the next day at 11:15PM local. The function that gets called at 11:15PM, getPictures, fetches the next day’s list of newspapers in the JSON format discussed above, and launches 8 threads to divide the workload of downloading all newspaper images. Once all the newspapers have been downloaded, the script schedules the tweeting of pictures to commence at 4 AM next day.

Newspaperbot tweets The Seattle Star front page on January 05, 1918.   //chroniclingamerica.loc.gov/lccn/sn87093407/1918-01-05/ed-1/seq-1/

 

A function is run at 4AM and is responsible for tweeting out all the images. It loops over the length of JSON object, which represents the number of newspapers on a given day. In order to not spam the readers’ timelines we must ensure that all tweets on a given date our tweeted out in a single thread. To do so using the Twitter API one must do two things:

  1. Maintain an ID to the tweet that our tweet needs to be threaded to. We can do so using a previous variable that will keep a track of an id to the previous in our loop
  2. The tweet in question must begin with the username of the the tweet that’s being replied to

Note that the first tweet in a thread will neither have a previous tweet nor begin with a username.

We must also catch any errors the Twitter API throws at us, not doing so will result in our program crashing on us.

Lessons learned

  1. It is possible Twitter may prevent you for tweeting because it flags your tweets for automated behavior. There are some ways to prevent such a flagging: Make sure there is a reasonable time delay between two consecutives tweets and make sure your tweets are as unique as possible. Tweets that have similar content/format may be flagged
  2. It’s important to think about your bot’s User Experience! Be mindful of the number of times your bot tweets out in a day, and make sure your bot doesn’t spam a user’s feed. If you have a lot of content you wish to tweet out in a day, consider threading your tweets.
  3. If you do choose to tweet a large number of tweets consider doing so at a time when everyone is likely to be sleeping. “Newspaperbot” works at 4AM for this specific reason. Even with threading, a large number of tweets breaks Twitter UX in a way: your thread will be at the top of timeline for the duration of the tweeting. This can possibly be annoying to user’s who are looking for new tweets from other sources.  
  4. You may get 503 errors when you query the Chronicling America API. You’re probably querying too often in such a case, place a couple of seconds of time.sleep between your requests

So what’s next? I’m currently investigating the Lomax collections and the Amazing Grace collection as potential content to map and visualize. Stay tuned!

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.