MM-COVID¶
Full name: MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation
Authors: Yichuan Li, Bohan Jiang, Kai Shu, Huan Liu
Paper: Li et al. (2020)
Access: https://github.com/bigheadxu/X-COVID (GitHub repository with dataset and baseline code)
Description¶
MM-COVID is a multilingual and multimodal dataset specifically designed for COVID-19 fake news detection research. It addresses a critical gap in existing COVID-19 misinformation datasets by providing parallel multilingual content with rich social context features. The dataset includes COVID-19-related fake news alongside real news in six languages, with both news content and associated Twitter social engagements.
The multilingual aspect is essential because COVID-19 misinformation spreads rapidly across linguistic boundaries, yet most existing fake news datasets focus on monolingual English content. MM-COVID enables cross-lingual fake news detection, allowing researchers to study language-invariant and language-specific characteristics of misinformation and develop transfer learning approaches.
Statistics¶
| Component | Count |
|---|---|
| Fake news pieces | 3,981 |
| Real news pieces | (unlabeled comparison set) |
| Tweets (social engagements) | 7,192 |
| Languages | 6: English, Spanish, Portuguese, Hindi, French, Italian |
| Temporal span | Early 2020 (pandemic emergence) |
Coverage by language¶
| Language | Code | Fake news | Tweets | Real news |
|---|---|---|---|---|
| English | en | ~660 | ~1,200 | paired |
| Spanish | es | ~660 | ~1,200 | paired |
| Portuguese | pt | ~660 | ~1,200 | paired |
| Hindi | hi | ~660 | ~1,200 | paired |
| French | fr | ~660 | ~1,200 | paired |
| Italian | it | ~660 | ~1,200 | paired |
Schema and features¶
News content features¶
- Fact-checking reviews: Label, debunked explanation, verification source query (from PolitiFact, Snopes, Poynter)
- Source content: URL, language, location, release date, text content, image metadata
- Veracity: Binary (fake/real) or graded labels from fact-checking agencies
Social context features¶
- Tweet-level: Text, creation time, retweet/reply/like counts, emoji analysis
- User profiles: Follower count, friend count, account creation date, follower-friend ratio
- Network structure: Reply and retweet cascades showing propagation
Temporal information¶
- Propagation timeline: Timestamps of tweets and replies tracking how misinformation spreads
- Language-specific timelines: How the same misinformation evolves across languages
- Debunking delays: Temporal lag between fake news emergence and fact-checking
Data collection methodology¶
- Source selection: Reliable fact-checking websites (PolitiFact, Snopes, Poynter) curated the initial claims
- Language filtering: Multilingual claims were selected; content was translated or natively multilingual
- Twitter query: Claims were searched on Twitter using the headline and keywords; tweets matching the claim were collected
- Social engagement collection: Reply, retweet, and user profile data were gathered via Twitter API
- Archive retrieval: Archived versions of URLs were retrieved when original sources became unavailable
Key characteristics revealed by dataset analysis¶
Language-invariant patterns: - Fake news consistently uses medical-related keywords (doctor, hospital, patient, vaccine) across languages - Sentiment analysis shows more emotional language in fake tweets (laughing, angry emojis) - Bot likelihood (based on user profile characteristics) is elevated for some languages (es, pt, hi) but not others (en, it)
Language-specific patterns: - Italian and English fake news contain higher bot engagement; Spanish and Portuguese show lower bot-likelihoodratios - Temporal propagation patterns differ: English fake news reach maximum engagement quickly; Hindi fake news spread more gradually - Topic differences: English focuses on conspiracy theories (#vaccine, #hydroxychloroquine); Spanish emphasizes economic impact; Portuguese stresses health authority credibility
Social context signals: - Real news attracts more replies and retweets on average, but fake news shows higher emotional emoji usage - User profiles sharing fake news often have fewer followers, but this effect is weaker in languages with smaller user populations (hi, fr)
Benchmark results¶
The paper evaluates several baseline methods across three resource scenarios:
| Scenario | Best method | Accuracy (average) | Notes |
|---|---|---|---|
| Enough resource (80% training data) | dEFEND (text+social) | 0.91–0.96 | Social context provides 3–6% improvement over text-only |
| Low resource (20% training data) | dEFEND (text+social) | 0.76–0.90 | Social features remain beneficial even with limited labels |
| No resource (zero-shot cross-lingual) | dEFEND multilingual | 0.41–0.85 | Transfer from English to other languages shows variable success |
Key finding: Social context (user replies and engagement patterns) provides language-invariant features beneficial for cross-lingual transfer, enabling better-than-baseline performance even in zero-resource settings.
Intended use¶
- Multilingual fake news detection: Train and evaluate detection models across multiple languages
- Cross-lingual transfer learning: Test whether detectors trained on English transfer to under-resourced languages
- Language-invariant feature discovery: Identify shared misinformation patterns across linguistic boundaries
- Early detection: Leverage temporal social engagement patterns to detect misinformation before widespread propagation
- Fact-checking assistance: Prioritize claims for human fact-checkers based on language and engagement signals
- Mitigation strategy research: Study propagation networks to design intervention strategies (bot removal, influential user targeting)
Limitations¶
- Languages: Covers only six languages; excludes Chinese, Russian, and other high-misinformation regions
- Fact-checking source bias: Reliance on English-majority fact-checking agencies (PolitiFact, Snopes) introduces language bias in what gets labeled
- Temporal boundary: Data collection focused on early 2020; later pandemic phases (vaccination period) underrepresented
- Real news baseline: Real news is not explicitly fact-checked; assumes unverified tweets from official health sources are accurate
- Tweet deletions: Historical tweet retrieval may be incomplete due to account suspensions or deletions
- Monolingual fact-checks: Many fact-checks exist only in English; translated fact-checks for non-English languages may introduce translation artifacts
Related datasets and comparisons¶
| Dataset | Year | Languages | Multimodal | Social context | COVID-19 focused |
|---|---|---|---|---|---|
| FakeNewsNet | FakeNewsNet | 2018 | English | No | Yes |
| ReCOVery | ReCOVery | 2020 | English | Yes | Yes |
| CHECKED | CHECKED | 2020 | Chinese (Weibo) | Yes | Yes |
| MM-COVID | 2020 | 6 languages | Yes | Yes | Yes |
| NELA-GT-2020 | NELA-GT-2020 | 2020 | English | No | No |
MM-COVID is unique in combining multilingual coverage with multimodal features (text + image + social context) and temporal information, filling a gap between monolingual English datasets (FakeNewsNet, NELA-GT) and non-English datasets (CHECKED).
Connections¶
- ReCOVery (Zhou et al., 2020) covers COVID-19 credibility but in English only; MM-COVID extends to multiple languages
- HERO (Zhou et al., 2023) uses MM-COVID as a benchmark for linguistic-style-based fake news detection
- dEFEND (Shu et al., 2019) is adapted and evaluated on MM-COVID; social context variant (dEFEND'N) shows benefits for multilingual detection
- Multimodal fake news detection — MM-COVID provides testbed for combining text, image, and social signals
- COVID-19 misinformation — foundational dataset for pandemic-specific detection research