Topic Modeling with LDA

7 min readMay 3, 2023

Hi everyone, welcome to my blog! Today I’m going to talk about a very interesting and useful topic in natural language processing (NLP) called topic modeling.

1. Introduction

💌 Topic modelling is a technique that helps to identify the underlying themes or topics present in a collection of documents.

Topic modeling is a way of finding out what are the main themes or topics in a large collection of texts, such as books, articles, reviews, tweets, etc. For example, if you have a bunch of news articles, you might want to know what are the most common topics they cover, such as politics, sports, entertainment, etc. Topic modeling can help you do that automatically and efficiently.

But how does topic modeling work?🤔

Well, there are different methods and algorithms for topic modeling, but the basic idea is that each text can be represented as a mixture of topics, and each topic can be represented as a distribution of words.
For example, a text about politics might have a high proportion of the topic “government”, which in turn might have a high probability of words like “president”, “election”, “policy”, etc.
A text about sports might have a high proportion of the topic “soccer”, which might have a high probability of words like “goal”, “player”, “match”, etc.

The goal of topic modeling is to discover these topics and their word distributions from the texts themselves, without any prior knowledge or labels.

Different Methods of Topic Modeling

🔹This highly important process can be performed by various algorithms or methods. Some of them are:

Latent Dirichlet Allocation (LDA)
Non Negative Matrix Factorization (NMF)
Latent Semantic Analysis (LSA)
Parallel Latent Dirichlet Allocation (PLDA)
Pachinko Allocation Model (PAM)

Topic modeling has many applications and benefits in NLP and beyond. For example, topic modeling can help you:

Summarize and organize large collections of texts
- Find similar or related texts based on their topics
- Explore and discover new insights and trends from texts
- Enhance other NLP tasks such as text classification, sentiment analysis, information retrieval, etc.

Importance of Topic Modelling

💜Large amounts of data are collected every day.

💜As more information becomes available, it becomes a difficult task to find what we are looking for.

💜So, we require some sort of tools and techniques to organize, search and understand huge quantities of information.

💜Hence Topic modelling helps us to organize, understand and summarize large collections of textual information.

💜It can be used to extract hidden patterns , that may not be immediately visibel to the reader.

💜Once these topics have been identified, the documents can be grouped or labeled according to the topics they cover.

💜This makes it easier to organize, search and summarize the texts.

💜By using topic modelling, we can gain a deeper understanding of the content of the documents and the relationships between them.

How topic modeling works ?

Latent Dirichlet Allocation

To understand how topic modeling works, we’ll look at an approach called Latent Dirichlet Allocation (LDA).

LDA was developed in 2003 by researchers David Blei, Andrew Ng and Michael Jordan. Its simplicity, intuitive appeal and effectiveness have led to strong support for its use.

🎀 Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus.

❤️LDA assumes that each document is a mixture of topics and that each topic is a mixture of words.

❤️It allows us to identify the hidden topics present in a set of documents and the probabilities of each document belonging to those topics.

❤️The word “latent” in LDA refers to the hidden topics that are present in the document collection,

❤️while “Dirichlet” used to figure out how likely certain topics and words are to be found together in a document collection.

❤️And the word “allocate” : we will allocate topic to the document and words of document to the topic .

🎀The LDA algorithm — Step by step:

1️⃣ After importing the required libraries, we will compile all the documents into one list to have the corpus.

2️⃣ We will perform the following text preprocessing steps :

Convert the text into lowercase.
Split text into words.
Remove the stop loss words.
Remove the Punctuation, any symbols, and special characters .
Normalize the word (I’ll be using Lemmatization for normalization).

The next step is to convert the cleaned text into a numerical representation.

3️⃣ The next step is to convert the cleaned text into a numerical representation.

For sklearn: Use either the Count vectorizer or TF-IDF vectorizer to transform the Document Term Matrix (DTM) into numerical arrays.
For gensim: Using gensim for Document Term Matrix(DTM), we don’t need to explicitly create the DTM matrix from scratch. The gensim library has an internal mechanism to create the DTM.

The only requirement for the gensim package is that we need to pass the cleaned data in the form of tokenized words.

4️⃣ Choosing how many topics we want to discover (K).

This is a parameter that we have to specify before running the algorithm.
There is no definitive answer to how many topics we should choose, but we can try different values and see which one gives us the best results.

5️⃣ In the LDA algorithm, we assume that each document is composed of a mixture of topics, & Each topic is composed of a set of words.

However, we don’t know exactly which topics and words are present in each document, so we treat these as hidden or “latent” variables.
To make inferences about these hidden variables, we assume that they follow certain probability distributions.
We assume that the distribution of topics in each document (theta) follows a Dirichlet distribution with a parameter called alpha, and the distribution of words in each topic (phi) follows a Dirichlet distribution with a parameter called beta.
These alpha and beta values are called hyperparameters because they control the shape of the prior distributions.

So what is Dirichlet distribution ?

In short, a Dirichlet Process is a way to group similar things together, with the ability to create new groups as needed.

6️⃣ Then, we need to assign each word in each document to a random topic.This is our initial guess of what topics each word belongs to.

7️⃣ After that , We need to update our guess of the topics by looking at two things:

How often each word appears in each topic(pi)
How often each topic appears in each document.(theta)

We use a formula called Gibbs sampling to calculate these probabilities and update the topic assignments. </aside>

8️⃣ We repeat step 4 until we reach a stable state where the topics don’t change much anymore.

9️⃣ We then use the final topic assignments to estimate the values of theta and phi. These are our outputs: theta tells us what topics each document has, and phi tells us what words each topic has.

Explanation with example

Here are the main steps of the algorithm:

First, we need to decide how many topics we want to find in our documents. This is a parameter that we have to choose before running the algorithm. Let’s say we want to find 10 topics.
Next, we need to represent each document as a bag of words, which means we ignore the order of the words and just count how many times each word appears in the document. For example, the sentence “I love machine learning and natural language processing” would be represented as {I: 1, love: 1, machine: 1, learning: 1, and: 1, natural: 1, language: 1, processing: 1}.
Then, we need to assign each word in each document to a random topic. For example, we might assign “I” to topic 3, “love” to topic 7, “machine” to topic 2, and so on. This is our initial guess of the topics of the words.
Now comes the fun part. We need to update our guess of the topics by looking at two things: how often each word appears in each topic, and how often each topic appears in each document. For example, if we see that “machine” and “learning” often appear together in topic 2, and that topic 2 often appears in document A, then we can increase the probability that “machine” and “learning” belong to topic 2 in document A. On the other hand, if we see that “natural” and “language” rarely appear in topic 2, and that topic 2 rarely appears in document B, then we can decrease the probability that “natural” and “language” belong to topic 2 in document B.
We repeat step 4 until we reach a stable state where the topics don’t change much anymore. This means we have found the best topics for our documents.

Some advantages of LDA are:

It can discover hidden or latent topics that are not explicitly labeled or categorized by humans.
It can handle large and diverse collections of documents, such as news articles, scientific papers, social media posts, etc.
It can provide insights into the content and structure of documents, such as what are the main themes, how they are related, how they change over time, etc.

Some disadvantages of LDA are:

It requires choosing the number of topics (K) beforehand, which can be difficult or arbitrary.
It can produce topics that are not meaningful or interpretable by humans, such as topics that mix unrelated words or topics that are too broad or too specific.
It can be sensitive to noise or outliers in the data, such as spelling errors, slang words, abbreviations, etc.

I hope this blog post gave you a simple and intuitive explanation of what LDA is and how it works. If you want to learn more about LDA and NLP in general, you can check out some of these resources:

A friendly introduction to LDA by Edwin Chen: https://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- A tutorial on LDA by David Blei: http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf
- A Python library for LDA by Radim Rehurek: https://radimrehurek.com/gensim/models/ldamodel.html

Video Reference:

Thanks for reading and stay tuned for more!

NOTE : Chekout My Blog On Topic Modeling with Gensim (Python) — with source code .