Skip to content

Multilingual fake news detection

Multilingual fake news detection addresses the challenge of identifying misinformation in languages beyond English. The key motivation is that misinformation spreads globally and across linguistic communities — COVID-19 conspiracy theories, political disinformation, and health falsehoods propagate in Spanish, Portuguese, Hindi, French, Italian, Mandarin, and other languages — yet most existing detection datasets and models focus exclusively on English.

Two complementary research directions:

  • Parallel multilingual datasets: Creating parallel corpora of fake and real news in multiple languages simultaneously, enabling direct comparison of language-invariant vs. language-specific misinformation patterns (e.g., MM-COVID)
  • Cross-lingual transfer learning: Training detection models on high-resource languages (English) and transferring to low-resource languages that lack large labeled datasets

Distinctive challenges

  • Language-specific linguistic patterns: Fake news may employ different rhetorical devices, emotional language, and syntactic structures across languages. Patterns learned from English text may not transfer.
  • Fact-checking source bias: Most fact-checking agencies are English-based (PolitiFact, Snopes, Politifact). Labeled datasets for other languages are sparse and may be biased toward what English-language fact-checkers prioritize.
  • Social media platform variation: Misinformation spread patterns differ by platform and language community. Russian disinformation on Facebook/Twitter may behave differently than Hindi misinformation on WhatsApp.
  • Code-switching and transliteration: In multilingual communities, users mix languages (e.g., Spanish-English in US Latino communities) or transliterate text (e.g., Hindi text written in Latin script), complicating feature extraction.
  • Cultural context: The credibility of health claims, political narratives, and conspiracy theories varies by cultural context and prior beliefs, making transfer across language groups non-trivial.

Key papers

  • Quelle & Bovet (2023) — The Perils & Promises of Fact-checking with Large Language Models — First large-scale evaluation of LLM-based fact-checking across 16+ languages using Data Commons dataset with 78+ fact-checking organizations; demonstrates severe performance drops for non-English languages (Turkish 84→81%, Thai 48→54%) when tested in original language; shows that translating claims to English before verification improves accuracy by 5–20+ percentage points, revealing critical training-data dominance of English in GPT models
  • Li et al. (2020) — MM-COVID: multilingual and multimodal COVID-19 dataset with 3,981 fake news pieces in six languages (English, Spanish, Portuguese, Hindi, French, Italian) and 7,192 associated tweets. Demonstrates that social context (user profiles, engagement patterns) provides language-invariant signals even in zero-resource cross-lingual transfer settings. dEFEND combined text+social achieves 0.91–0.96 accuracy in high-resource settings and 0.76–0.90 in low-resource settings.
  • Du et al. (2021) — CrossFake: Addresses the challenge of COVID-19 misinformation in Chinese, a language with limited fact-checked datasets. Trains on English COVID-19 news and applies to Chinese news via machine translation. Proposes BERT-based architecture with sub-text slicing to preserve information across long documents. Achieves 75% accuracy on 200 manually-annotated Chinese articles, demonstrating practical cross-lingual transfer. Identifies translation quality and information location as key bottlenecks for cross-lingual detection.
  • Yang et al. (2020) — CHECKED: first Chinese-language COVID-19 fake news dataset (2,104 Weibo posts, 344 fake / 1,760 real). Addresses the gap that most COVID-19 misinformation datasets (ReCOVery, MM-COVID) are English-centric despite the pandemic's global spread and significance in China.

Key datasets

  • MM-COVID — 3,981 fake news pieces in six languages (English, Spanish, Portuguese, Hindi, French, Italian) with Twitter social context and propagation timelines; primary benchmark for multilingual COVID-19 detection.
  • CHECKED — 2,104 Weibo posts (Chinese) with per-item fake/real labels, images, video, and full propagation threads (1.87M reposts, 1.19M comments).
  • Cross-lingual detection and transfer learning — the complementary problem of training on one language and testing on another without explicit parallel data
  • COVID-19 misinformation — the primary domain where multilingual detection has been applied; MM-COVID and CHECKED both focus on the pandemic
  • Multimodal detection — MM-COVID and CHECKED both include text+image+social features, showing that multimodal and multilingual challenges often co-occur
  • Feature engineering — language-specific and language-invariant feature design is central to multilingual transfer