NLP / Lesson 5

Topic Modeling

Discover hidden themes and patterns in document collections

What is Topic Modeling?

Topic modeling is an unsupervised machine learning technique for discovering abstract "topics" that occur in a collection of documents. It automatically identifies patterns of word co-occurrences and groups them into topics, helping us understand the main themes in large text corpora without manual labeling.

Key Insight: Topic models assume documents are mixtures of topics, and topics are probability distributions over words. This allows documents to belong to multiple topics simultaneously.

Interactive Topic Visualization

Explore how topics are composed of words and how documents relate to topics:

Traditional Topic Modeling Methods

Latent Dirichlet Allocation (LDA)

The most popular traditional topic modeling algorithm, using Bayesian inference:

How it works:

1. Assumes each document is a mixture of topics
2. Each topic is a distribution over words
3. Uses Dirichlet priors for distributions
4. Inference via variational Bayes or Gibbs sampling

Key parameters:

• α (alpha): Document-topic density
• β (beta): Topic-word density
• K: Number of topics
• Iterations: Sampling iterations

Strengths: Interpretable, well-studied, handles sparse data well

Limitations: Assumes bag-of-words, ignores word order, requires tuning K

Latent Semantic Analysis (LSA)

Uses SVD to reduce dimensionality and find latent semantic structure:

# LSA process
1. Create term-document matrix (TF-IDF)
2. Apply SVD: A = UΣV^T
3. Truncate to k dimensions
4. Topics = rows of V (document-topic)
5. Terms = rows of U (term-topic)

✓ Fast and deterministic

✗ Topics can have negative values

Non-negative Matrix Factorization (NMF)

Decomposes document-term matrix into non-negative factors:

V ≈ W × H

V: documents×terms, W: documents×topics, H: topics×terms

✓ Interpretable parts-based representation

✗ Sensitive to initialization

Modern Neural Topic Models

BERTopic

State-of-the-art topic modeling using transformer embeddings:

Pipeline:

1. Embed

BERT/Sentence-BERT

→

2. Reduce

UMAP

→

3. Cluster

HDBSCAN

→

4. Tokenize

c-TF-IDF

Advantages:

• Captures semantic meaning
• Handles short texts well
• Dynamic topic modeling
• No need to specify K

Limitations:

• Computationally expensive
• Requires pre-trained models
• Less interpretable parameters
• Can create many small topics

Top2Vec

Jointly learns word, document, and topic vectors:

• Uses Doc2Vec for embeddings
• Finds dense areas in embedding space as topics
• Automatically determines number of topics
• Topics represented as centroids in vector space

Neural Variational Topic Models

Combines neural networks with traditional topic modeling:

• NVDM: Neural Variational Document Model
• ProdLDA: Product-of-Experts LDA
• ETM: Embedded Topic Model (uses word embeddings)

Multimodal Topic Modeling

Beyond Text: Multimodal Approaches

Modern topic modeling extends beyond text to handle multiple modalities:

Text + Image Topics

Discovers topics across textual and visual content:

• mmLDA: Multi-modal LDA
• CLIP-based: Uses CLIP embeddings
• Applications: Social media analysis, news

Hierarchical Multimodal

Builds topic hierarchies across modalities:

• Nested topic structures
• Cross-modal topic alignment
• Applications: Scientific literature

Example: Social Media Topic Modeling

Text

"Amazing sunset at the beach"

Image

[Sunset photo]

Topic

Travel/Nature

Comparing Topic Modeling Approaches

Method	Type	Strengths	Limitations	Best For
LDA	Probabilistic	Interpretable, proven	Bag-of-words, fixed K	Long documents
BERTopic	Neural	Semantic, flexible K	Computational cost	Short texts, tweets
NMF	Matrix	Fast, deterministic	Requires good preprocessing	Well-structured text
Top2Vec	Neural	Automatic K, semantic	Black box	Exploratory analysis

Practical Considerations

Preprocessing Steps

1. Text Cleaning:
- • Remove stopwords
- • Lemmatization/stemming
- • Handle special characters
2. Feature Engineering:
- • TF-IDF weighting
- • N-gram extraction
- • Minimum document frequency

Learn more about text preprocessing →

Evaluation Metrics

• Coherence Score: Measures topic quality based on word co-occurrence
• Perplexity: Model's ability to predict held-out documents
• Topic Diversity: Uniqueness of words across topics
• Human Evaluation: Manual assessment of topic interpretability

Choosing the Right Method

If you have long, well-structured documents: Start with LDA

If you have short texts (tweets, reviews): Use BERTopic or Top2Vec

If you need deterministic results: Choose NMF or LSA

If you have multimodal data: Consider CLIP-based approaches

If interpretability is crucial: Stick with traditional methods (LDA, NMF)

Real-World Applications

Content Recommendation

Discover user interests and recommend similar content based on topic preferences

Trend Analysis

Track emerging topics in social media, news, or scientific literature over time

Document Organization

Automatically categorize and organize large document collections

Customer Feedback

Analyze reviews and support tickets to identify common themes and issues

Research Discovery

Find related research papers and identify research trends in academic literature

Market Intelligence

Analyze competitor content and industry discussions to identify market trends

Implementation Example

Quick Start with Python

# Traditional LDA with Gensim
from gensim import corpora, models

# Create dictionary and corpus
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Train LDA model
lda_model = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,
    random_state=42,
    passes=10,
    alpha='auto',
    per_word_topics=True
)

# Modern BERTopic
from bertopic import BERTopic

model = BERTopic(
    language="english",
    calculate_probabilities=True,
    verbose=True
)
topics, probs = model.fit_transform(documents)

# Visualize topics
model.visualize_topics()

Both approaches have their place - choose based on your specific requirements for interpretability, performance, and document characteristics.

Next Steps

Explore related NLP concepts:

→ Contextual Embeddings (ELMo, BERT)
← TF-IDF