Innovator Ben Lee and LC Labs Host “Data Jam” with 100 Million Historic Newspaper Images

A gallery of historic moustaches, a wall of 12,000 photos, a collage of First World War-era “damn the Kaiser” cartoons, and more were on display May 7, when 135 people attended a virtual “data jam” to dig into a massive new collection of historic newspaper images.  LC Labs hosted the event to showcase Library of Congress Innovator in Residence Ben Lee’s work extracting visual content from the Chronicling America digital newspaper collection.  After a behind-the-scenes look at Lee’s project, Newspaper Navigator, “jammers” had the chance to work with the images themselves for the first time.

A "gallery of moustaches" created by Newspaper Navigator data jam participant Mary Feeney

A “gallery of moustaches” created by Newspaper Navigator data jam participant Mary Feeney

The Chronicling America Historic American Newspapers site currently holds over 16 million pages of newspapers.  Researchers, family historians, and the curious have long sought its images, including photographs, maps, cartoons, and ads.  Until recently, however, the only way to collect that visual content was to comb through page after page of text—a project that would take decades.  Last week’s event showcased a solution that uses computing power and new machine learning techniques to do that work in days.

Framing his research, Lee said last week, was this question: “Can we throw open the treasure chest of Chronicling America by training a machine learning algorithm to process all of Chronicling America?”

A collection of images created by Newspaper Navigator data jam participant Brian Foo

A collection of images created by Newspaper Navigator data jam participant Brian Foo

During his introduction to the new collection, Lee explained that his work was made possible by a paradigm shift in machine learning techniques.  Instead of training models from scratch, machine learning practitioners now use community datasets to pre-train their models, such as the Common Objects in Context (COCO) dataset, and then fine tune with smaller datasets – in this case, annotations resulting from the Beyond Words crowdsourcing initiative, along with some additional annotations.  This method streamlines and expedites the process of creating and training models, while allowing for fine tuning—for example, ensuring that models are able to sort out how to differentiate photographs from illustrations on newspaper pages.

Lee was able to process 16,368,041 pages of newspaper text—99.998% of the Chronicling America collection, which includes nearly two centuries of historic newspapers contributed by 48 states and territories.  Using 96 CPU cores and 8 GPUs, the process took only 19 days, harnessing years of compute time as efficiently as possible.  You can learn more about Lee’s model, training set, and the pipeline in The Newspaper Navigator Dataset: Extracting and Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America and GitHub repo.

What did he find?  An astounding 100 million photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements.

All of those images are now available for download on news-navigator.labs.loc.gov. There, you can also find prepackaged files sorted by year and by type (photos, comics, etc.).  The prepackaged image sets are available in zip files, with metadata in JSON and CSV (spreadsheet-friendly) formats.  Smaller “sample packs,” designed for quick download, contain 1,000 random images of one type for the year.

Innovator in Residence Ben Lee's list of curated datasets of images from Chronicling America

Innovator in Residence Ben Lee’s list of curated datasets of images from Chronicling America

The data jam invited people from all over the world to view and use these free images, which are all in the public domain.  Lee and the Library envision everything from collages to large-scale computing projects on the images.  “We want to see an active public conversation” about these images and what people are doing with them, said Senior Innovation Specialist Jaime Mears, who coordinates the Innovator in Residence program.

Aside from their sheer number, why are the images in Chronicling America so intriguing?  Deb Thomas, manager of the program, explained that newspapers were the main form of communication in the 19th and 20th centuries, the source that people looked to first.  Nearly every town and community in the United States began publishing its own newspaper, providing a remarkable record of the things people knew at the time—“history as it happened,” so to speak.  Chronicling America and the National Digital Newspaper Program are a partnership between the Library and the National Endowment for the Humanities since 2005, building on a decades-long effort to find and preserve America’s historic newspapers on microfilm.  In 2010, the Library made Chronicling America’s data open to all.

A "D--n the Kaiser" cartoon from a collection assembled by data jam participant Jeremy Guillette using Newspaper Navigator

A “D–n the Kaiser” cartoon from a collection assembled by data jam participant Jeremy Guillette using Newspaper Navigator

After Lee, Thomas, and Mears spoke about the project, data jammers were ready to dig in and see what they could find and create.  They showed off visualizations, collages, queries—and even a few challenges.  Some of them are found in the images here, and you can follow the #NewspaperNavigator hashtag on Twitter to see more.  A public user interface for the data will be available in late summer, so stay tuned!

Collage created by data jam participant and LC Labs Senior Innovation Specialist Meghan Ferriter using images found by Newspaper Navigator

Collage created by data jam participant and LC Labs Senior Innovation Specialist Meghan Ferriter using images found by Newspaper Navigator

Missed the data jam this time around?  You can download the images at news-navigator.labs.loc.gov.  And if you do create something interesting, we’d love to see it!  Email us or tweet @LC_Labs using the #NewspaperNavigator hashtag.  We can’t wait to see what you find!

LC Labs Letter: April 2020

LC LABS LETTER A Monthly Roundup of News and Thoughts from the Library of Congress Labs Team Editor’s Note As it did for many people across the country and all over the world, the month of March brought new ways of working and communicating and challenging, complex circumstances for the LC Labs team. We found […]

Innovator Brian Foo Incorporates “Citizen DJs” into Design Process

The following is a guest post by Innovator in Residence Brian Foo, creator of Citizen DJ. The Citizen DJ project invites the public to make music using the free-to-use audio and video collections from the Library of Congress. The project will feature online tools for exploring and remixing tens of thousands of sounds from a variety of collections ranging from music to government film to oral histories. 

Earth Day 2020 Has Gone Digital

This is a guest post by Jennifer “JJ” Harbster, Head of the Science Reference Section in the Library’s Science, Technology and Business Division. She had her first taste of web archiving with the Internet Archive’s collaborative project documenting Hurricane Katrina and went on to lead the Science Blogs Web Archive. On April 22, 2020 we […]

Newspaper Navigator Surfaces Treasure Trove of Historic Images – Get a Sneak Peek at Upcoming Data Jam!

Projects like Newspaper Navigator are busy unlocking even more digital content for members of the public to access from home. On May 7th at 2pm EST, Innovator in Residence Ben Lee will host a virtual data jam to experiment and play with thousands of images—including maps, advertisements, comics, and more!—from historical newspapers dating to the 1800s. In this post, Ben discusses his aspirations for engaging the American public with the millions of images he extracted from  Chronicling America.

Digital Scholarship Working Group Report: Published!

Digital scholarship takes advantage of the availability of digital collections and a changing landscape of tools, resources and methodologies to produce new forms of research and engagement. Digital scholarship projects and centers are common at research universities. They serve faculty and student needs by supporting digital skill development and sharing best practices in digital research […]

Happy Birthday to LCWA! Celebrating the 20th Anniversary of Web Archiving at the Library of Congress.

Today’s guest post is from Abbie Grotke, who is Lead Librarian, Web Archiving Team in the Digital Content Management Section of the Library of Congress.   2020 marks a special occasion for the Library of Congress – our anniversary of 20 years of web archiving! Remember the year 2000? Back when we all breathed a […]

More Open eBooks: Routinizing Open Access eBook Workflows

This is a guest post by Kristy Darby, a Digital Collections Specialist in the Digital Content Management Section in Library Services. We are excited to share that anyone anywhere can now access a growing online collection of contemporary open access eBooks from the Library of Congress website. For example, you can now directly access books […]

PDF is Here to Stay: Archiving with the Portable Document Format

Today’s guest post is from Kate Murray (Digital Projects Coordinator, Digital Collections Management and Services Division, Library of Congress), Duff Johnson (Executive Director, PDF Association / ISO Project Leader, ISO 32000), and Kevin De Vorsey (Senior Electronic Records Policy Analyst, Records Management Policy and Standards, National Archives and Records Administration). PDF in the Federal Archiving […]