Content-based fake news detection¶

Content-based methods detect fake news by analyzing the news article itself — textual and/or visual information — without relying on how the news propagated on social media. They are the natural choice when social context is unavailable (e.g., immediately after publication), and form the standard comparison regime for social-context approaches.

Three families:

Textual feature engineering: Manually designed features from news text — LIWC psycholinguistic categories, n-gram models, syntactic features (context-free grammars), readability scores, rhetorical structure (RST). Classical classifiers (SVM, random forest, logistic regression) are applied. See feature engineering.
Neural textual methods: End-to-end deep models applied to news text — Text-CNN, bi-LSTMs, attention mechanisms. Avoids manual feature design; learns representations directly from raw text.
Multimodal methods: Extend neural approaches to news images or other non-textual content. See multimodal detection.

Source-credibility features (author metadata, publication history) straddle the content/context boundary: they describe the news source rather than article content, but are available without social propagation data.

Key papers¶

Nan et al. (2021) — MDFEND: Multi-domain fake news detection using mixture-of-experts with adaptive domain gating; addresses domain shift in word usage and propagation patterns across 9 Weibo domains; F₁ 0.9137 outperforms single-domain and cross-domain baselines; introduces Weibo21 benchmark dataset.
Zhou et al. (2019) — WSDM Tutorial on Fake News Detection: surveys content-based (knowledge and style) perspectives for detection; organizes textual features into knowledge-based (relational fact extraction) and style-based (linguistic, forensic) families.
Zhou et al. (2020) — Fake News Early Detection: An Interdisciplinary Study: theory-driven multi-level feature engineering (lexicon, syntax, semantic, discourse) for early detection without propagation data; ≈89% accuracy and F₁ 0.892/0.879 on PolitiFact/BuzzFeed, outperforming propagation-based and hybrid baselines.
Sitaula et al. (2019) — Credibility-based Fake News Detection: 3 source features outperform 23 content features; F1-macro 0.77–0.83 on PolitiFact + BuzzFeed.
Zhou et al. (2020) — SAFE: Multi-modal (text + image) content; cross-modal similarity; surpasses LIWC text baseline by large margins (F₁ +0.081 on PolitiFact).
Khattar et al. (2019) — MVAE: Multimodal VAE learning shared text-image representations; jointly trains reconstruction (VAE) with classification; 74.5% / 82.4% accuracy on Twitter / Weibo.
Zhou et al. (2023) — HERO: Hierarchical Recursive Neural Network; integrates constituency and discourse parsing into a unified linguistic tree; Bi-GRU aggregation bottom-up; 0.866–0.896 AUC on Recovery and MM-COVID, outperforming all neural baselines.
Shu et al. (2019) — dEFEND: Hierarchical attention on news content sentences; also jointly models user comments; produces explainable fake news predictions via top-k sentences; achieves 0.904 accuracy on PolitiFact.
Singhania et al. (2023) — 3HAN: A Deep Neural Network for Fake News Detection: Three-level hierarchical attention network treating headlines, sentences, and words as separate hierarchical levels. Word-level attention extracts relevant words within sentences; sentence-level attention identifies informative sentences in article bodies; headline-body attention captures stance between headline and body. Achieves 96.77% accuracy with headline-based pre-training; provides interpretable attention visualizations for human fact-checkers.
Mayank, Sharma & Sharma (2021) — DEAP-FAKED: Combines biLSTM title encoding with knowledge graph entity embeddings (Wikidata + ComplEx) to detect fake news from titles alone. Addresses dataset bias through systematic removal of biased terms; achieves 88% and 78% F1 on two datasets; demonstrates entity-based features provide complementary signal to text.
Kaliyar, Goswami & Narang (2021) — FakeBERT: Neural content-based method combining BERT embeddings with parallel 1D CNNs for multi-scale feature extraction. Achieves 98.90% accuracy on real-world 2016 election dataset, demonstrating effectiveness of contextualized transformer embeddings for capturing semantic content differences between true and fake news on social media.

Connections¶

Social-context detection is the complementary paradigm, requiring propagation and user data unavailable at publication time.
Multimodal detection is the subset of content-based methods exploiting visual as well as textual content.
Feature engineering is the dominant methodology within text-only content-based detection.