Skip to content

MM-COVID

Full name: MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation

Authors: Yichuan Li, Bohan Jiang, Kai Shu, Huan Liu

Paper: Li et al. (2020)

Access: https://github.com/bigheadxu/X-COVID (GitHub repository with dataset and baseline code)

Description

MM-COVID is a multilingual and multimodal dataset specifically designed for COVID-19 fake news detection research. It addresses a critical gap in existing COVID-19 misinformation datasets by providing parallel multilingual content with rich social context features. The dataset includes COVID-19-related fake news alongside real news in six languages, with both news content and associated Twitter social engagements.

The multilingual aspect is essential because COVID-19 misinformation spreads rapidly across linguistic boundaries, yet most existing fake news datasets focus on monolingual English content. MM-COVID enables cross-lingual fake news detection, allowing researchers to study language-invariant and language-specific characteristics of misinformation and develop transfer learning approaches.

Statistics

Component Count
Fake news pieces 3,981
Real news pieces (unlabeled comparison set)
Tweets (social engagements) 7,192
Languages 6: English, Spanish, Portuguese, Hindi, French, Italian
Temporal span Early 2020 (pandemic emergence)

Coverage by language

Language Code Fake news Tweets Real news
English en ~660 ~1,200 paired
Spanish es ~660 ~1,200 paired
Portuguese pt ~660 ~1,200 paired
Hindi hi ~660 ~1,200 paired
French fr ~660 ~1,200 paired
Italian it ~660 ~1,200 paired

Schema and features

News content features

  • Fact-checking reviews: Label, debunked explanation, verification source query (from PolitiFact, Snopes, Poynter)
  • Source content: URL, language, location, release date, text content, image metadata
  • Veracity: Binary (fake/real) or graded labels from fact-checking agencies

Social context features

  • Tweet-level: Text, creation time, retweet/reply/like counts, emoji analysis
  • User profiles: Follower count, friend count, account creation date, follower-friend ratio
  • Network structure: Reply and retweet cascades showing propagation

Temporal information

  • Propagation timeline: Timestamps of tweets and replies tracking how misinformation spreads
  • Language-specific timelines: How the same misinformation evolves across languages
  • Debunking delays: Temporal lag between fake news emergence and fact-checking

Data collection methodology

  1. Source selection: Reliable fact-checking websites (PolitiFact, Snopes, Poynter) curated the initial claims
  2. Language filtering: Multilingual claims were selected; content was translated or natively multilingual
  3. Twitter query: Claims were searched on Twitter using the headline and keywords; tweets matching the claim were collected
  4. Social engagement collection: Reply, retweet, and user profile data were gathered via Twitter API
  5. Archive retrieval: Archived versions of URLs were retrieved when original sources became unavailable

Key characteristics revealed by dataset analysis

Language-invariant patterns: - Fake news consistently uses medical-related keywords (doctor, hospital, patient, vaccine) across languages - Sentiment analysis shows more emotional language in fake tweets (laughing, angry emojis) - Bot likelihood (based on user profile characteristics) is elevated for some languages (es, pt, hi) but not others (en, it)

Language-specific patterns: - Italian and English fake news contain higher bot engagement; Spanish and Portuguese show lower bot-likelihoodratios - Temporal propagation patterns differ: English fake news reach maximum engagement quickly; Hindi fake news spread more gradually - Topic differences: English focuses on conspiracy theories (#vaccine, #hydroxychloroquine); Spanish emphasizes economic impact; Portuguese stresses health authority credibility

Social context signals: - Real news attracts more replies and retweets on average, but fake news shows higher emotional emoji usage - User profiles sharing fake news often have fewer followers, but this effect is weaker in languages with smaller user populations (hi, fr)

Benchmark results

The paper evaluates several baseline methods across three resource scenarios:

Scenario Best method Accuracy (average) Notes
Enough resource (80% training data) dEFEND (text+social) 0.91–0.96 Social context provides 3–6% improvement over text-only
Low resource (20% training data) dEFEND (text+social) 0.76–0.90 Social features remain beneficial even with limited labels
No resource (zero-shot cross-lingual) dEFEND multilingual 0.41–0.85 Transfer from English to other languages shows variable success

Key finding: Social context (user replies and engagement patterns) provides language-invariant features beneficial for cross-lingual transfer, enabling better-than-baseline performance even in zero-resource settings.

Intended use

  • Multilingual fake news detection: Train and evaluate detection models across multiple languages
  • Cross-lingual transfer learning: Test whether detectors trained on English transfer to under-resourced languages
  • Language-invariant feature discovery: Identify shared misinformation patterns across linguistic boundaries
  • Early detection: Leverage temporal social engagement patterns to detect misinformation before widespread propagation
  • Fact-checking assistance: Prioritize claims for human fact-checkers based on language and engagement signals
  • Mitigation strategy research: Study propagation networks to design intervention strategies (bot removal, influential user targeting)

Limitations

  • Languages: Covers only six languages; excludes Chinese, Russian, and other high-misinformation regions
  • Fact-checking source bias: Reliance on English-majority fact-checking agencies (PolitiFact, Snopes) introduces language bias in what gets labeled
  • Temporal boundary: Data collection focused on early 2020; later pandemic phases (vaccination period) underrepresented
  • Real news baseline: Real news is not explicitly fact-checked; assumes unverified tweets from official health sources are accurate
  • Tweet deletions: Historical tweet retrieval may be incomplete due to account suspensions or deletions
  • Monolingual fact-checks: Many fact-checks exist only in English; translated fact-checks for non-English languages may introduce translation artifacts
Dataset Year Languages Multimodal Social context COVID-19 focused
FakeNewsNet FakeNewsNet 2018 English No Yes
ReCOVery ReCOVery 2020 English Yes Yes
CHECKED CHECKED 2020 Chinese (Weibo) Yes Yes
MM-COVID 2020 6 languages Yes Yes Yes
NELA-GT-2020 NELA-GT-2020 2020 English No No

MM-COVID is unique in combining multilingual coverage with multimodal features (text + image + social context) and temporal information, filling a gap between monolingual English datasets (FakeNewsNet, NELA-GT) and non-English datasets (CHECKED).

Connections