Multilingual fake news detection¶

Multilingual fake news detection addresses the challenge of identifying misinformation in languages beyond English. The key motivation is that misinformation spreads globally and across linguistic communities — COVID-19 conspiracy theories, political disinformation, and health falsehoods propagate in Spanish, Portuguese, Hindi, French, Italian, Mandarin, and other languages — yet most existing detection datasets and models focus exclusively on English.

Two complementary research directions:

Parallel multilingual datasets: Creating parallel corpora of fake and real news in multiple languages simultaneously, enabling direct comparison of language-invariant vs. language-specific misinformation patterns (e.g., MM-COVID)
Cross-lingual transfer learning: Training detection models on high-resource languages (English) and transferring to low-resource languages that lack large labeled datasets

Distinctive challenges¶

Language-specific linguistic patterns: Fake news may employ different rhetorical devices, emotional language, and syntactic structures across languages. Patterns learned from English text may not transfer.
Fact-checking source bias: Most fact-checking agencies are English-based (PolitiFact, Snopes, Politifact). Labeled datasets for other languages are sparse and may be biased toward what English-language fact-checkers prioritize.
Social media platform variation: Misinformation spread patterns differ by platform and language community. Russian disinformation on Facebook/Twitter may behave differently than Hindi misinformation on WhatsApp.
Code-switching and transliteration: In multilingual communities, users mix languages (e.g., Spanish-English in US Latino communities) or transliterate text (e.g., Hindi text written in Latin script), complicating feature extraction.
Cultural context: The credibility of health claims, political narratives, and conspiracy theories varies by cultural context and prior beliefs, making transfer across language groups non-trivial.

Key papers¶

Quelle & Bovet (2023) — The Perils & Promises of Fact-checking with Large Language Models — First large-scale evaluation of LLM-based fact-checking across 16+ languages using Data Commons dataset with 78+ fact-checking organizations; demonstrates severe performance drops for non-English languages (Turkish 84→81%, Thai 48→54%) when tested in original language; shows that translating claims to English before verification improves accuracy by 5–20+ percentage points, revealing critical training-data dominance of English in GPT models
Li et al. (2020) — MM-COVID: multilingual and multimodal COVID-19 dataset with 3,981 fake news pieces in six languages (English, Spanish, Portuguese, Hindi, French, Italian) and 7,192 associated tweets. Demonstrates that social context (user profiles, engagement patterns) provides language-invariant signals even in zero-resource cross-lingual transfer settings. dEFEND combined text+social achieves 0.91–0.96 accuracy in high-resource settings and 0.76–0.90 in low-resource settings.
Du et al. (2021) — CrossFake: Addresses the challenge of COVID-19 misinformation in Chinese, a language with limited fact-checked datasets. Trains on English COVID-19 news and applies to Chinese news via machine translation. Proposes BERT-based architecture with sub-text slicing to preserve information across long documents. Achieves 75% accuracy on 200 manually-annotated Chinese articles, demonstrating practical cross-lingual transfer. Identifies translation quality and information location as key bottlenecks for cross-lingual detection.
Yang et al. (2020) — CHECKED: first Chinese-language COVID-19 fake news dataset (2,104 Weibo posts, 344 fake / 1,760 real). Addresses the gap that most COVID-19 misinformation datasets (ReCOVery, MM-COVID) are English-centric despite the pandemic's global spread and significance in China.

Key datasets¶

MM-COVID — 3,981 fake news pieces in six languages (English, Spanish, Portuguese, Hindi, French, Italian) with Twitter social context and propagation timelines; primary benchmark for multilingual COVID-19 detection.
CHECKED — 2,104 Weibo posts (Chinese) with per-item fake/real labels, images, video, and full propagation threads (1.87M reposts, 1.19M comments).

Cross-lingual detection and transfer learning — the complementary problem of training on one language and testing on another without explicit parallel data
COVID-19 misinformation — the primary domain where multilingual detection has been applied; MM-COVID and CHECKED both focus on the pandemic
Multimodal detection — MM-COVID and CHECKED both include text+image+social features, showing that multimodal and multilingual challenges often co-occur
Feature engineering — language-specific and language-invariant feature design is central to multilingual transfer

Multilingual fake news detection¶

Distinctive challenges¶

Key papers¶

Key datasets¶

Related topics and connections¶