Skip to content

Topic modeling

Methods for discovering and representing latent topics (semantic themes) in large collections of text documents. Topic models provide unsupervised organization of documents and enable analysis of thematic content over time or across documents.

Approaches

Probabilistic generative models:
Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) model each document as a mixture of discrete topics and each topic as a distribution over words. Strengths: interpretability, principled probabilistic framework. Weaknesses: discretization of continuous semantic space, poor generalization to corpora dissimilar from training data, requiring manual specification of topic count.

Embedding-based methods:
Methods like Top2Vec leverage semantic embeddings (word2vec, doc2vec, transformer embeddings) to represent documents and words in continuous semantic space. Topics emerge as dense regions of similar documents; topic words are identified as nearest neighbors in embedding space. Strengths: automatic topic count discovery, no stop-word removal required, superior topic coherence when evaluated by information gain. Weaknesses: high memory requirements for large corpora, computational cost of embedding entire documents.

Neural topic models:
Variational autoencoders (VAE) and other neural approaches combining neural networks with probabilistic frameworks for flexible, differentiable topic modeling. Bridges discrete and continuous representations.

Applications in misinformation research

Topic modeling is foundational for analyzing the thematic structure of misinformation:

  • Propaganda detection: Identifying dominant narrative frames and themes in propaganda or coordinated inauthentic behavior.
  • Content characterization: Understanding what topics dominate false, misleading, or authentic news sources (e.g., which topics are more vulnerable to misinformation).
  • Temporal analysis: Tracking how misinformation themes evolve and shift over time.
  • Source profiling: Characterizing organizations or sources by their topical emphasis and consistency.

Key papers in this wiki

  • [[2020-angelov-top2vec]] — Unsupervised topic discovery via semantic embeddings, UMAP dimension reduction, and HDBSCAN clustering; automatically determines topic count and produces more informative topics than LDA/PLSA.
  • Corporate funding and ideological polarization about climate change — Applies Structural Topic Modeling to 40,785 texts on climate change; demonstrates how corporate funding influences thematic content of polarization campaigns.