Newspaper Navigator
Benjamin Charles Germain Lee
This page contains a number of resources for Newspaper Navigator, my
Innovator in Residence project at the
Library of Congress. The goal of Newspaper Navigator is to re-imagine
searching over the visual content in
Chronicling America using machine learning. The project consists of two steps:
-
Extracting the headlines, photographs, illustrations, comics,
cartoons, and advertisements from the 16.3 million pages in
Chronicling America.
This step is complete, and the Newspaper Navigator dataset has
been released!
-
Re-imagining exploratory search over this extracted visual content, including with visual similarity search.
The Newspaper Navigator search app has been released! You can find it
here.
The goal of Newspaper Navigator is to engage the American public with the Library of
Congress's collections, as well advance research in computer science,
digital libraries, and the digital humanities.
All deliverables resulting from the Newspaper Navigator project
(including all code, the dataset, etc.), are placed into
the public domain for unrestricted re-use. An enormous thank you
to LC Labs,
The National Digital Newspaper Program, and IT Design & Development at the Library of Congress, as well as
Dan Weld!
The Newspaper Navigator Dataset
Website:
https://news-navigator.labs.loc.gov/
This is the landing page for the Newspaper Navigator dataset. Here, you
will find details about how to query the data over https and S3. There
are also hundreds of pre-packaged datasets for immediate use - no coding
necessary!
Paper:
CIKM 2020
DOI: https://dl.acm.org/doi/10.1145/3340531.3412767
ArXiv: https://arxiv.org/abs/2005.01583
This paper details the construction of the Newspaper Navigator dataset.
This includes details on training the visual content recognition model,
statistics on running the pipeline, and visualizations of the dataset
itself.
*This paper was named Best Resource Paper Runner-up at
CIKM 2020.*
*The dataset has been named Best Digital Humanities Dataset at the
2020 Digital Humanities Awards.*
Code:
https://github.com/LibraryOfCongress/newspaper-navigator
This GitHub repo contains all of the code for the Newspaper Navigator
Project, as well as the finetuned visual content recognition model
weights, the dataset used for finetuning (annotations from the Beyond
words project augmented with additional annotations), and demos. The
entire contents of the repo are placed into the public domain.
The Newspaper Navigator Search Application
Website:
https://news-navigator.labs.loc.gov/search
The Newspaper Navigator search application enables visitors to search over 1.5 million
Newspaper Navigator photos. In addition to providing faceted + keyword search affordances,
The application empowers visitors to train their own AI navigators to search 1.5 million Newspaper
Navigator photos by visual similarity. Visitors can train AI navigators by labeling positive and
negative training examples and tune the system on the fly, as training and predicting on all 1.5
million photos takes just a couple of seconds. The AI navigators are powered by ResNet-18 image
embeddings.
Demo: UIST 2020
Paper DOI: https://dl.acm.org/doi/10.1145/3379350.3416143
Preview Video: https://www.youtube.com/watch?v=1WfTFVXx1fg
Short Talk Video: https://www.youtube.com/watch?v=9w7ippuo3Gk
This demo presents open faceted search, the new mode of search launched in the
Newspaper Navigator Search Application. From the computer science perspective, open faceted
search empowers users to define their own facets in an open-domain fashion.
Newspaper Navigator Data Archaeology
In this Digital Humanities Quarterly paper, which I call a "data archaeology," I consider the digitization journeys of
four different pages in Black newspapers in Chronicling America that reproduce the same photograph
of W.E.B. Du Bois. In doing so, I unpack how each step in the pipelines, such as the imaging process and the
construction of training data, not only imprints bias on the resulting Newspaper Navigator dataset
but also propagates the bias via the machine learning algorithms employed. I investigate the limitations machine
learning as it relates to cultural heritage, from marginalization and erasure via algorithmic bias to
unfair labor practices in the construction of commonly-used datasets. I argue that any use must be done with an understanding
of the broader socio-technical ecosystems in which the algorithms have been utilized.
Data Archaeology:
http://www.digitalhumanities.org/dhq/vol/15/4/000578/000578.html
Newspaper Navigator Organizational Overview
To document an overview of Newspaper Navigator from the organizational perspective, we contributed an article to the
EuropeanaTech Insight
special issue on newspapers. You can find more details below:
Paper: https://pro.europeana.eu/page/issue-16-newspapers
Newspaper Navigator and the Ladino Press
I have also applied Newspaper Navigator to study visual content embedded within the Ladino press, as digitized by the University of Washington's
Sephardic Studies Digital Library. I have presented this work at #DHJewish - Jewish Studies in the Digital Age".
The DOI for the abstract, as well as the video recording of our presentation, can be found here.
I have also written a chapter for the edited volume Jewish Studies in the Digital Age, which was publisehd with De Gruyter Press as part of the Studies in Digital History and Hermeneutics Series.
This work is supported by the Richard Willner Memoral Fellowship as part of the Stroum Center for Jewish Studies's
Graduate Fellowship program. More information on this project can be found in a blog post
that I wrote for the Stroum Center for Jewish Studies.
Jewish Studies in the Digital Age Book Chapter: https://doi.org/10.1515/9783110744828-010
Newspaper Navigator and the Visual Layouts of Ethnic Periodicals
With periodicals scholars Jim Casey, Sarah Salter, and Joshua Ortiz Baco, I am studying the evolution of visual layouts of ethnic periodicals
within Chronicling America. Using the Newspaper Navigator dataset, it is possible to directly quantify the similarity of layouts across millions of newspaper pages, enablingus not only to trace the technological developments of printing presses but also to uncover the hidden editorial practices embedded within layouts themselves. For example, we have identified clusters of newspaper titles withsimilar visual layouts, such as networks of African-American titles that feature illustrations and photographsof members of their communities in portrait poses in the center of their front pages. The editors’ choice of ashared visual grammar speaks to the ways in which visual culture featured prominently into editorial practices. Our first paper detailing this collaboration has been accepted at the Computational Humanities Research (CHR) 2021. You can find more details below:
First Paper: http://ceur-ws.org/Vol-2989/short_paper3.pdf
Our second paper is has appeared in the journal Criticism in the special issue: "New Approaches to Critical Bibliography and the Material Text."
Second Paper: http://ceur-ws.org/Vol-2989/short_paper3.pdf
Newspaper Navigator in the Classroom
To document the ways in which Newspaper Navigator and machine learning writ large can play a role in social studies education,
Ilene Berson, Michael Berson, and I wrote an article for Social Education. You can find more details below:
Paper: https://www.socialstudies.org/social-education/85/2/machine-learning-and-social-studies
Newspaper Navigator and Layout Parser
The Newspaper Navigator visual content recognition model is now part of
Layout Parser's
Model Zoo! Layout Parser is a unified toolkit for Deep Learning Based Document Image Analysis. You can pip install the library and process newspaper pages in a few lines of code.
Our paper on Layout Parser, led by Zejiang Shen and Melissa Dell, was presented at ICDAR 2021. You can find more details below.
Paper: https://arxiv.org/abs/2103.15348
Preview Video:https://www.youtube.com/watch?v=zmr9NOYPKHo
Presentation Video:https://www.youtube.com/watch?v=ASe4X7fSRa4
Press
The Librarians of the Future Will Be AI Archivists
Popular Mechanics
(Courtney Linder, 05/13/2020)
An Archive Unearthed
The Batch
(Nick Stockton, 05/13/2020)
Millions of historic newspaper images get the machine learning
treatment at the Library of Congress
TechCrunch
(Devin Coldewey, 05/07/2020)
Machine Learning: The Library of Congress “Newspaper Navigator”
Dataset is Now Available; Over 16 Million Pages From “Chronicling
America” Processed
InfoDocket
(Gary Price, 05/07/2020)
U.S. Library of Congress Processes over 16 Million Historic Newspaper Pages Using AI
NVIDIA Developer News Center
(05/06/2020)
Library of Congress Innovator in Residence Ben Lee Discusses His
Newspaper Navigator Project That Uses Machine Learning to Extract
Visual Content From Chronicling America & Announces Upcoming “Data
Jam” to Preview Dataset
InfoDocket
(Gary Price, 04/21/2020)
Ph.D. student Benjamin Lee named Library of Congress Innovator in
Residence
Allen School News
(11/25/2019)
Hip Hop and Human-Computer Interaction Focus of 2020 Innovators in
Residence
the Library of Congress
(11/18/19)
Interviews
Preserving Sephardic History through Interdisciplinary Collaboration: An Interview with Makena Mezistrano and Ben Lee
EuropeNow
(Taylor Soja, 04/17/22)
Navigating Collections of Digitised Historical Newspapers: A Conversation with Ben Lee
NewsEye Blog
(Amanda Maunoury, 10/05/21)
Ladino Newspapers Are the New Wave in “Uncharted Waters” of Digital History
The Stroum Center for Jewish Studies at the University of Washington
(Hannah Pressman, 12/01/20)
Chronicling America and Navigating Newspapers
From Our Corner: Washington Secretary of State Blog (06/17/2020)
Blog Posts from
The Signal and the NEH Blog
Newspaper Navigator Search Application Now Live!
The Signal (Eileen Jakeway, 09/21/20)
Reimagining Searching in Chronicling America
The NEH Blog
(Joshua Ortiz Baco, 07/17/20)
Innovator Ben Lee and LC Labs Host “Data Jam” with 100 Million
Historic Newspaper Images
The Signal
(Leah Weinryb-Grohsgal, 04/21/20)
Newspaper Navigator Surfaces Treasure Trove of Historic Images – Get
a Sneak Peek at Upcoming Data Jam!
The Signal
(Eileen Jakeway, 04/21/20)
Introducing Ben and Brian, the Library’s new Innovators in
Residence!
The Signal
(Eileen Jakeway, 11/18/19)
Reviews
06/23/2022
Reviews in DH: A Review of Newspaper Navigator
By Lorella Viola
Selected Events & Talks
*for a more comprehensive list of talks that I've given on
Newspaper Navigator, please see my talks page*
10/31/2023
Deutsche Nationalbibliothek
Newspaper Portals Meet DH Conference
Panel with Torsten Roeder
"Visual Content: Reimagining Digitized Newspapers with Machine Learning" (virtual)
09/08/2023
Virginia Tech
Computer Science Seminar Series
"Reimagining Search and Discovery for Digital Collections with Machine Learning"
06/08/2023
University College London
Sloane Lab Symposium Series: (Re)connecting Heritage Collections as Data, Infrastructure, and Participatory Engagement: Big Dreams, Big Challenges
"Reimagining Search and Discovery for Digital Collections with Mahcine Learning" (virtual)
04/21/2023
Mellon-Rare Book School Society of Fellows in Critical Bibliography
Lecture Series
"Preserving and Analyzing Digital Texts" (virtual)
With James Hodges, Ryan Cordell, and Emily Maemura
Event description available here, recording available here
03/21/2023
United States Naval Academy
"Data Science for Jewish Studies" Guest Lecture (virtual)
02/21/2023
University of Washington
Information School Seminar
"Reimagining Search and Discovery for Digital Collections with Machine Learning"
01/20/2023
European Society for Periodical Research
Seminar
"New Computational Approaches to Periodical Studies" (virtual)
With Thomas Smits and Kaspar Beelen
Event description available here, recording available here
01/18/2023
Humboldt University of Berlin
Digital History Research Colloquium
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
01/05/2023
American Historical Association 2023
Panel
"Computing on Cultural Heritage: Reports from an LC Labs Experiment"
With Benjamin Schmidt, Meghan Ferriter, Jessica Mack, Lauren Tilton, and Taylor Arnold
10/12/2022
Digital Library Federation Forum 2022
"Newspaper Navigator: Hosting the Dataset and Deploying the Search Application" (virtual)
With Chris Adams
07/27/2022
Digital Humanities 2022
Long Presentation
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
06/04/2022
German Historical Institute Washington
Fifth Annual GHI Conference on Digital Humanities and Digital History
Datafication in the Historical Humanities
"Compounded Mediation: Excavating the Newspaper Navigator Dataset"
05/19/2022
Research Society for American Periodicals 2022
Symposium
"New Directions for Interdisciplinary Collaborations in Periodical Studies" (virtual)
Event description available here
05/18/2022
DH Unbound 2022
"A Computational Periodicals Unconference: Exploring New Opportunities for Critical and Collaborative Inquiry" (virtual)
(with Sarah Salter, Jim Casey, and Joshua Ortiz Baco)
04/11/2022
2022 UCLadino Conference
"The Digital Humanities and the Ladino Press: Unlocking Historic Ladino Newspapers with & Machine Learning" (virtual)
Recording available here
03/15/2022
Fantastic Futures 2021
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
02/23/2022
03/14/2022
Designing Storage Architectures for Digital Collections 2022
Library of Congress
"Newspaper Navigator: Hosting the Dataset and Deploying the Search Application" (with Chris Adams, virtual)
Event description available here; slides available here
02/23/2022
University of Wrocław
Workshop: Studying Advertisements in pre-1939 Jewish Press: Methods and Challenges
"Using Machine Learning to Extract and Analyze Advertisements in Historic Ladino Newspapers, 1890-1948"
02/22/22
Princeton University
Center for Digital Humanities
"Novel Machine Learning Methods for Computing Cultural Heritage: An Interdisciplinary Approach" (virtual)
Event description available here
02/02/22
University of Illinois Urbana-Champaign
Digital Humanities + iSchool (DHIS) Collective Meeting
"Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset" (virtual)
Event description available here
01/28/22
The University of Washington
Computer Science & Engineering HCI Seminar
"Newspaper Navigator: Open Faceted Search for 1.5 Million Images" (virtual)
11/10/2021
Texas A&M University-Corpus Christi
Honors Program Speaker Series
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
11/09/2021
University of London School of Advanced Study
Institute of Historical Research
Digital History Seminar Series
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
Recording available here
10/06/2021
Consortium for the History of Science, Technology, and Medicine
Digital History of Science Working Group
"Excavating the Newspaper Navigator Dataset from a Critical Data Studies Perspective" (virtual)
06/25/2021
Association for Documentary Editing
Annual Meeting
"Cowboys, Computers, and Cartoons: Excavating and Explicating America’s Political Cartoons"
A Panel Session with Clay Jenkinson and Sharon Kilzer (virtual)
05/29/2021
The University of Washington
The Stroum Center for Jewish Studies
Colloquium
"Sephardic Experiences of Modernity: Newspapers, Migrants and Midwives" (virtual)
A Panel Session with Busra Demirkol, Oya Aktas, and Oscar Aguirre-Mandujano (respondent)
Recording available here
03/24/2021
Harvard University
Discovery Series
"Newspaper Navigator: Re-Imagining Digitized Newspapers with Machine Learning"(virtual)
Sponsored by the Cabot Science Library and the Harvard University Digital Scholarship Group
03/13/2021
The International NewsEye Conference
"From Chronicling America to Newspaper Navigator: Improving Access to Historic Newspaper Photos at the Library of Congress through Machine Learning"
(with Nathan Yarasavage)
Recording available here; slides available here
01/13/2021
#DHJewish - Jewish Studies in the Digital Age"
"The Digital Humanities and the Ladino Press: Using Machine Learning to Extract and Analyze Visual Content in Historic Ladino Newspapers"(virtual)
(with Devin Naar)
Abstract DOI & presentation recording available here
12/09/2020
Association of College & Research Libraries
ULS Technology in University Libraries Committee Tech Forum
"Newspaper Navigator: Re-imagining Library Search and Discovery with Machine Learning" (virtual)
Recording available here
11/12/2020
The University of Washington
Stroum Center for Jewish Studies
"Teaching Computers to Read Ladino" (virtual)
Recording available
here
10/30/2020
The Johns Hopkins University
Digital History Workshop
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
10/09/2020
Drexel University
History Department
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
09/18/2020
Duke University
Data Dialogue Series
"Newspaper Navigator: Reimagining Historic Newspapers with Machine Learning" (virtual)
09/15/2020
National Endowment for the Humanities and the Library of Congress
NEH Division of Preservation and Access and LOC Serial and Government Publications Division
"Seeing Editors: Metadata, Machine Learning, and the Shapes of Social Justice" (virtual)
A panel with Jim Casey, Sarah Salter, and Joshua Ortiz Baco
09/11/2020
The Alan Turing Institute
Computer Vision for Digital Heritage
"Newspaper Navigator: Reimagining Digitized Newspapers with Machine Learning" (virtual)
07/12/2020
The Allen Turing Institute & British Library
Living with Machines Group
"Newspaper Navigator: An Introduction and Demo" (virtual)
05/15/2020
Princeton University
Princeton Center for Digital Humanities & Princeton University Library
2019-20 Collections as Data Discussion Series
"Newspaper Navigator: Reimagining Digitized Newspapers with
Machine Learning" (virtual)
Recording available
here
05/07/2020
Library of Congress
Newspaper Navigator Data Jam
Public Event
View a recording of the data jam
here
Read more about the data jam
here
Newspaper Navigator around the Web
Innovations with Digitized Newspapers
2022-2023 Library of Congress Albert Einstein Distinguished Educator Fellow Jacqueline Katz's talk describing uses of Newspaper Navigator in the classroom!
nnanno: A Collection of Tools for working with the Newspaper Navigator Dataset
Daniel van Strien's nnanno toolkit is an invaluable resource for sampling
Newspaper Navigator data, adding additional labels, and training models to make new predictions using IIF!
The Collective Wisdom Handbook: Perspectives on Crowdsourcing in Cultural Heritage
Newspaper Navigator has been featured as a case study in Chapter 10: Working with Crowdsourcing Data
of The Collective Wisdom Handbook! The authors of the volume are Mia Ridge, Samantha Blickhan, Meghan Ferriter, Austin Mast, Ben Brumfield, Brendon Wilkins, Daria Cybulska, Denise Burgher, Jim Casey, Kurt Luther, Michael Haley Goldman, Nick White, Pip Willcox, Sara Carlstead Brumfield, Sonya J. Coleman, and Ylva Berglund Prytz.
Working with Maps at Scale Using Computer Vision and Jupyter Notebooks
Delivered by Daniel van Strien as part of the
Digital Humanities and Digital Archives workshop at the National Library of Estonia
Daniel's dataset derived from Newspaper Navigator can be found here
The Impact of Artificial Intelligence on Genealogy
Elevenses with Lisa: A Genealogy Show (Episode 32)
by Lisa Louise Cooke
Show notes here
How to Use Chronicling America's Newspaper Navigator to Find Photos and Images
Elevenses with Lisa: A Genealogy Show (Episode 26)
by Lisa Louise Cooke
Show notes here
The Newspaper Navigator Search App, an Educator's View
The Primary Source Podcast
(Season 1, Episode 3)
by
Tom Bober
Genealogy Quick Start (September 24, 2020 episode)
by
Shamele Jordon
Machine Learning + Libraries: A Report on the State of the Field
by Ryan Cordell
Sanborn Maps Navigator (with Newspaper Navigator data as well!)
by Selena Qian (learn more
about the project in Selena's 2020 Junior Fellow presentation!)
Mega Mideast Map Collage
via
Nick Danforth at The Afternoon Map
Twitter
Check out tweets about Newspaper Navigator and artifacts from the
Newspaper Navigator Data Jam by searching Twitter for
#NewspaperNavigator!