Text classification¶

Text classification is the task of automatically assigning one or more labels or categories to a text document or sequence (e.g., a headline, article, tweet, or comment). It is a foundational task in NLP and misinformation detection.

Classification approaches¶

Hand-crafted features + traditional ML: Feature engineering (word frequencies, sentiment lexicons, linguistic patterns) combined with classifiers like SVM, logistic regression, or random forests. Effective but labor-intensive and domain-specific.

Word embeddings + shallow networks: Pre-trained embeddings (word2vec, GloVe) fed to fully connected networks or simple RNNs. Reduces feature engineering while capturing semantic information.

Deep neural networks: CNNs for local n-gram patterns, RNNs/LSTMs for sequential dependencies, attention mechanisms for interpretability. Automatically learn hierarchical feature representations.

Transformer-based models: BERT, GPT, and other pre-trained language models achieve state-of-the-art results by capturing bidirectional context and fine-tuning on task-specific data.

Applications in misinformation detection¶

Fake news detection: Classifying articles or headlines as true/false or misleading/legitimate
Clickbait detection: Identifying sensationalized or misleading headlines
Rumor stance classification: Categorizing comments as supporting, denying, or querying a rumor
Satire detection: Identifying satirical news intended to deceive
Propaganda detection: Detecting propaganda techniques in text

Common datasets and benchmarks¶

LIAR: 12.8K political statements labeled True, Mostly True, Half True, Mostly False, False
FakeNewsNet: Multi-domain fake news dataset with articles and social context
Claim verification: Datasets for evidence retrieval and natural language inference
Stance detection: Twitter- and Reddit-based datasets (SemEval, RumEval)

Key papers¶

We used Neural Networks to Detect Clickbaits: You won't believe what happened Next! — Bidirectional LSTM for clickbait detection; demonstrates deep learning effectiveness without hand-crafted features
A Survey on Hate Speech Detection using Natural Language Processing — Comprehensive survey of feature engineering and classification methods for hate speech detection; systematic taxonomy of surface features, embeddings, linguistic features, and knowledge-based approaches
Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection — LIAR dataset for claim veracity classification
FakeNewsNet: A Data Repository with News Content, Social Context and Spatiotemporal Information for Studying Fake News on Social Media — FakeNewsNet dataset with social context for fake news detection
A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities — Comprehensive survey of fake news detection methods including classification approaches

Deep learning — neural architectures for classification
Natural Language Processing — field encompassing text classification
Fake content detection — problem category
Recurrent Neural Networks — sequential models for text

Text classification¶

Classification approaches¶

Applications in misinformation detection¶

Common datasets and benchmarks¶

Key papers¶

Related topics¶