Newspaper Navigator

Benjamin Charles Germain Lee

This page contains a number of resources for Newspaper Navigator, the project that I am carrying out while an Innovator-in-Residence at the Library of Congress. The goal of Newspaper Navigator is to re-imagine searching over the visual content in Chronicling America. The project consists of two steps: All deliverables resulting from the Newspaper Navigator project (including code, data, etc.), is placed into the public domain for unrestricted re-use. An enormous thank you to LC Labs, The National Digital Newspaper Program, and IT Design & Development at the Library of Congress, as well as my Ph.D. advisor, Professor Daniel Weld!

The Newspaper Navigator Dataset

This is the landing page for the Newspaper Navigator dataset. Here, you will find details about how to query the data over https and S3. There are also hundreds of pre-packaged datasets for immediate use - no coding necessary!

This paper details the construction of the Newspaper Navigator dataset. This includes details on training the visual content recognition model, statistics on running the pipeline, and visualizations of the dataset itself.

This GitHub repo contains all of the code for the Newspaper Navigator Project, as well as the finetuned visual content recognition model weights, the dataset used for finetuning (annotations from the Beyond words project augmented with additional annotations), and demos. The entire contents of the repo are placed into the public domain.


The Librarians of the Future Will Be AI Archivists via Popular Mechanics (Courtney Linder, 05/13/2020)

An Archive Unearthed via The Batch (Nick Stockton, 05/13/2020)

Millions of historic newspaper images get the machine learning treatment at the Library of Congress via TechCrunch (Devin Coldewey, 05/07/2020)

Machine Learning: The Library of Congress “Newspaper Navigator” Dataset is Now Available; Over 16 Million Pages From “Chronicling America” Processed via InfoDocket (Gary Price, 05/07/2020)

U.S. Library of Congress Processes over 16 Million Historic Newspaper Pages Using AI via NVIDIA Developer News Center (05/06/2020)

Library of Congress Innovator in Residence Ben Lee Discusses His Newspaper Navigator Project That Uses Machine Learning to Extract Visual Content From Chronicling America & Announces Upcoming “Data Jam” to Preview Dataset via InfoDocket (Gary Price, 04/21/2020)

Ph.D. student Benjamin Lee named Library of Congress Innovator in Residence via Allen School News (11/25/2019)

Hip Hop and Human-Computer Interaction Focus of 2020 Innovators in Residence via the Library of Congress (11/18/19)


Chronicling America and Navigating Newspapers via From Our Corner: Washington Secretary of State Blog

Blog Posts from The Signal

Innovator Ben Lee and LC Labs Host “Data Jam” with 100 Million Historic Newspaper Images via The Signal (Leah Weinryb-Grohsgal, 04/21/20)

Newspaper Navigator Surfaces Treasure Trove of Historic Images – Get a Sneak Peek at Upcoming Data Jam! via The Signal (Eileen Jakeway, 04/21/20)

Introducing Ben and Brian, the Library’s new Innovators in Residence! via The Signal (Eileen Jakeway, 11/18/19)

Events & Talks

"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning"
2019-20 Collections as Data Discussion Series
Princeton Center for Digital Humanities & Princeton University Library
View a recording of the talk here

Newspaper Navigator Data Jam
Public Event
Library of Congress
Read more about the data jam here!

Newspaper Navigator on the Web

Mega Mideast Map Collage via The Afternoon Map


Check out tweets about Newspaper Navigator and artifacts from the Newspaper Navigator Data Jam by searching Twitter for #NewspaperNavigator!