Multimodal fake news detection¶

Multimodal approaches to fake news detection exploit multiple information types within news content — most commonly textual (headline, article body) and visual (attached images) modalities. The motivation is that fake news creators frequently pair misleading claims with irrelevant or emotionally manipulative images, creating a cross-modal mismatch signal absent in unimodal text-only approaches.

Two broad families:

Concatenation-based (early fusion): Extract independent representations of each modality and concatenate before a shared classifier. Examples: att-RNN (Jin et al., 2017), TI-CNN (Yang et al., 2018), MVAE (Khattar et al., 2019). The inter-modal relationship is ignored.
Similarity-aware (cross-modal): Explicitly model the relationship between modalities as an additional feature, on the intuition that low text-image similarity is itself a fake signal. SAFE (Zhou et al., 2020) is the canonical example.

A recurring architectural choice is whether to use a standard pre-trained vision CNN (e.g., VGG-19) for images directly, or to first map images to a comparable embedding space via an image-captioning model, enabling more principled cross-modal comparison.

Key papers¶

Zlatkova et al. (2019) — Fact-Checking Meets Fauxtography: Image-claim pair verification via reverse image search; extracts image features (Google tags, URL domains, media source credibility), claim features (TF-IDF), and relationship features (claim-article similarity via cosine and embedding methods); 1,233 image-claim pairs from Snopes and Reuters; achieves 80.1% accuracy; demonstrates that web-based features (source credibility, URL domains) outperform image forensic features (splice detection, EXIF metadata, ELA).
A Deep Learning Approach for Multimodal Deception Detection: Deep learning approach to deception detection in courtroom videos using multimodal fusion (3D-CNN video, openSMILE audio, CNN-based text, micro-expressions); demonstrates that neural feature extraction outperforms hand-crafted features; achieves 96.14% accuracy on 121 courtroom videos. Related foundational work on multimodal feature fusion strategies applicable to multimodal fake news detection.
Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements: Comprehensive tutorial covering multimodal approaches to fake news detection; reviews visual features (statistical, content-based, neural CNN features) and multimodal fusion strategies; discusses how fake news creators pair misleading claims with irrelevant images and how multimodal detectors exploit text-image mismatches.
TI-CNN: Convolutional Neural Networks for Fake News Detection: Early concatenation-based approach extracting explicit text features (word counts, question marks, capital letters, negations, pronouns) and explicit image features (resolution, face count), then learning latent text and image representations via parallel CNNs; F₁ 0.9210 on election news dataset.
r/Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection: Introduces Fakeddit, a large-scale multimodal dataset with 1.06M Reddit submissions, 64% text+image pairs, and 2-way/3-way/6-way labels; demonstrates that multimodal models (BERT + ResNet50 with maximum fusion) achieve 85.88% 6-way accuracy, ~10 percentage points above text-only baselines; identifies satire and imposter content as hardest categories.
Li et al. (2020) — MM-COVID: Multilingual and multimodal COVID-19 dataset with 3,981 fake news pieces in six languages and 7,192 tweets; benchmarks text-only (dEFEND'C), social-only (dEFEND'N), and combined (dEFEND) approaches; demonstrates that multimodal (text+social) features achieve 0.91–0.96 accuracy and enable cross-lingual transfer.
Hameleers et al. (2020) — A Picture Paints a Thousand Lies?: Experimental evidence that visual disinformation (text+image tweets about refugees, school shootings) is perceived as significantly more credible than text-only disinformation; fact-checkers effectively reduce credibility regardless of modality; motivated reasoning moderates effectiveness.
Wang et al. (2018) — EANN: Event Adversarial Neural Networks using a minimax game to learn event-invariant features; feature extractor couples with a fake news detector and tries to fool an event discriminator; achieves 71.5% / 82.7% accuracy on Twitter / Weibo; first to formulate fake news detection on new events as a transfer learning problem.
Khattar et al. (2019) — MVAE: Variational autoencoder that learns shared text-image representations via joint VAE training (reconstruction + classification); achieves 74.5% / 82.4% accuracy on Twitter / Weibo; demonstrates that explicit reconstruction loss helps discover cross-modal correlations beyond attention-based fusion.
Zhou et al. (2020) — SAFE: Similarity-aware multi-modal detection; modified cosine similarity between Text-CNN text and image2sentence visual representations; F₁ 0.896 / 0.895 on PolitiFact / GossipCop; first to show cross-modal similarity outperforms fusion alone.
Zhou et al. (2020) — ReCOVery: introduces the ReCOVery dataset with 2,029 COVID-19 news articles + 140,820 tweets; benchmarks multimodal SAFE (F₁ 0.833 reliable / 0.672 unreliable) against single-modal baselines, confirming multimodal features outperform text-only approaches in a pandemic-specific domain.
Yang et al. (2020) — CHECKED: Chinese COVID-19 Weibo dataset with image URLs (up to 18 per post), video URLs, and 1.87M repost threads; multimodal benchmarks are explicitly deferred as future work, making this the primary open target for applying SAFE-like methods to Chinese-language content.
Silva et al. (2021) — Cross-domain Multimodal Detection: Addresses the practical problem that multimodal models trained on one domain (politics, entertainment) fail on others; proposes unsupervised domain discovery via propagation networks and supervised domain-agnostic classification preserving both domain-specific and cross-domain knowledge; LSH-based instance selection reduces labeling cost; achieves 7.55% F₁ improvement on rarely-appearing domains.
Wang et al. (2021) — MetaFEND: Combines meta-learning with neural processes for few-shot fake news detection on emergent events; proposes hard attention (Straight-Through Gumbel SoftMax) to select the most informative post despite class imbalance, and label embedding treating categorical labels as semantic vectors; achieves 4–5% accuracy improvements on Twitter and Weibo datasets in 5-shot and 10-shot settings.
Cao et al. (2025) — SLIM: introduces SLIM_MULTIMODAL, which combines keyword sets with named-entity tags or article metadata (title, author) rather than images; shows that keyword+title fusion outperforms keywords alone on ReCOVery, establishing a text-only multimodal paradigm.
A Multi-Modal Method for Satire Detection using Textual and Visual Cues: Applies ViLBERT (Vision & Language BERT) to detect satirical news articles using headline-image pairs; achieves 93.8% accuracy on a dataset of 10,000 articles (4000 satirical, 6000 mainstream); demonstrates early fusion via co-attention outperforms simple concatenation; notably, image forensics (ELA+CNN) alone fails at the task, highlighting the importance of joint multimodal reasoning.
Vo & Lee (2020) — Where Are the Facts? Searching for Fact-checked Information: Applies multimodal retrieval to find fact-checking articles matching original tweets; proposes Multimodal Attention Network (MAN) that jointly models text-image interactions via Glove/ELMo embeddings and ResNet50 visual features; achieves 4.7% NDCG@1 improvement over text-only baselines on Snopes; demonstrates multimodal signal improves ranking quality for fact-checked evidence retrieval.

Connections¶

Content-based detection is the parent category: all multimodal methods are content-based but not vice versa.
Feature engineering provides the text-only baselines (LIWC, n-gram) that multimodal systems typically compare against.
FakeNewsNet (PolitiFact + GossipCop) is the dominant benchmark, with both text and image fields available.