How have the topics covered by The Atlantic changed over time?

--

We used Natural Language Processing (NLP) to identify, and compare, the themes of our archival articles and our current journalism.

By: Michael Roman

The Atlantic / Oliver Munday

The release of The Atlantic archive, published online in its entirety for the first time in July 2022, provides our readers access to our journalism, starting with the very first issue published in November 1857. The digitization of 165 years of journalism also granted us on the data science team an unprecedented opportunity to explore the topics we’ve historically written about.

We set out to answer a simple question: How overlapping are the themes found in our most recently published stories to the themes in our archival collection? Do we write about the same topics now?

To learn about our archive collection we built a topic model. Topic models are amazing tools that allow us to enter in a collection of articles and the desired number of topics to be discovered. Then, based on patterns of co-occurring words across the articles, the model groups articles based on shared words and phrases. We then label these groupings for easy interpretation (e.g., grouping 1 = “Politics”, grouping 2 = “Healthcare”, grouping 3 = “Technology”, etc.). Topic models have the neat property of quantifying the extent to which each article epitomizes the discovered topics, too. For example, article A might be about “Politics”, “Healthcare,” and “Technology” in equal parts, but article B might be entirely about “Healthcare”.

With these models, we can begin to not only answer questions about what The Atlantic has covered — our unique universe of topics and how a particular article blends that universe — but also gain insight into how often we wrote about each topic at scale.

We compared the outputs from this newly built archive topic model to the outputs of an existing production topic model trained on just our most recent collection (defined as articles published on or after 2017–01–01).

Here’s what we found.

Archive topics overlap highly with recent topics

We trained both topic models — archive and most recently published — to each discover 50 topics. However, it’s common for these models to generate topics that aren’t directly usable, such as chimerical topics (a mash-up of two distinct topics, like “Elon Musk” and “Gardening”) or topics that are incoherent. Incoherence refers to articles that are grouped together by the model but only share overly common or uninformative words such as “year”, “make,” and “long”, which don’t actually indicate what the article is about. We report on just the usable topics discovered by each model.

Our most recently published topic model discovered a coherent universe of 43 distinct topics. These reflect coverage areas such as “Health, healthcare & disease,” “Congress,” and “UK & EU”.

The archive topic model, trained on a collection of documents published pre-1996, discovered a coherent universe of 34 distinct topics.

What we found is that the two sets of topics are, in fact, highly overlapping: 65 of the 77 topics (84%) overlap.

Some of the themes observed in both are:
- Geography, cities, and demography
- Technology
- Religion
- Music
- War & military
- Animals, nature
- Race, racism & slavery
- Government & politics
- Family & marriages
- Labor / Industry / Economy
- Middle East coverage
- Sports
- Movies, TV, and radio

This first finding suggests that The Atlantic has, across its history, continued to provide commentary, analysis, and reporting on a wide range of topics that are crucial to the American idea — and that our archive still adds context to what’s happening in our world today.

Topics we write about today not observed in the archive

There are nine topics discovered by the most recently published topic model (out of 43) for which there was no solid match in the archive.

Topics in this list include:
- Media, News, & Journalism
- Gender & Sexuality
- Violence, Guns, & Shootings
- Sexual Misconduct
- Social Media
- Pregnancy & Reproductive Debate
- Trump Scandals & Crises

Some of the topics on this list (though not all) are here because certain technologies, people, or phenomena came into existence or notoriety for the first time after the archive publication date cut-off. The diversity of new topics covered within our pages and on our website demonstrates how The Atlantic has evolved to cover the most important issues of our day.

Topics we write about in the archive collection not observed in today’s collection

There are three topics discovered by the archive topic model (out of 34) for which there was no solid match in the most recently published collection.

These were:
- Articles containing highly vernacular (local language or dialect) speech
- Historical Perspectives in Visual Art
- Ships, boats, sailing, fishing

“Articles containing highly vernacular speech” was a particularly interesting category. Our archive contains a number of vernaculars spoken in the 19th and early 20th century, and this topic housed texts where vernacular language was prominently used. For example, “A Harbor Feud,” a piece about a feud between boatmen, greatly activated this topic. A character in this story says, “11 lave that sudden-like.” If you can translate that, let us know! A separate piece, “Jack the Robber,” a fictional work set in small-town Ireland, also strongly activated this topic.

Comparing language drift over time

Not only do the topics we write about evolve over time, but so does the language we use when referring to a specific topic. This is a well-understood phenomenon, and we decided to take a look at how this is reflected in our collection. Below are word clouds showing the most commonly occurring words for the “Technology” topic in two different time periods: pre-1900 and post-2021. In the pre-1900’s archive, the words we used to talk about “Technology” were dominated by words like “train,” “electricity,” “locomotive,” and “railroad.” However, in our “Technology” articles published in 2021 or later, we see words like “ai,” “chatbot,” “internet,” and “search” used prominently.

Most Frequent Words in a Topic for Archive Articles:

Most Frequent Words in a Topic for Modern Articles:

Topic modeling has made a substantial impact for us at The Atlantic. It enabled us to systematically approach the ~50,000 newly digitized archive documents in a structured way, allowing us to make sense of them and understand how they fit together. It has also become indispensable metadata we generate on newly published pieces. The static universe of topics generated by the most recently published model have become a widely-used nomenclature, akin to a lingua franca, for the data team and business stakeholders when discussing traffic performance and audience preferences. We’ve also leveraged the topic model to recommend content to our readers. The use cases go on and on!

There are many other NLP techniques we are researching here at The Atlantic. We are actively delving into the potential of Large Language Models (LLMs) and chatbots. We’re also excited about the explosion of other interesting new Generative AI technologies and their applications. Following our recent hackathon, we deployed AI narrated articles to meet a longstanding reader request for more audio content. Even with these advancements, topic models hold a special, enduring place in our stack and in our hearts!

--

--