MM-COVID¶

Full name: MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation

Authors: Yichuan Li, Bohan Jiang, Kai Shu, Huan Liu

Paper: Li et al. (2020)

Access: https://github.com/bigheadxu/X-COVID (GitHub repository with dataset and baseline code)

Description¶

MM-COVID is a multilingual and multimodal dataset specifically designed for COVID-19 fake news detection research. It addresses a critical gap in existing COVID-19 misinformation datasets by providing parallel multilingual content with rich social context features. The dataset includes COVID-19-related fake news alongside real news in six languages, with both news content and associated Twitter social engagements.

The multilingual aspect is essential because COVID-19 misinformation spreads rapidly across linguistic boundaries, yet most existing fake news datasets focus on monolingual English content. MM-COVID enables cross-lingual fake news detection, allowing researchers to study language-invariant and language-specific characteristics of misinformation and develop transfer learning approaches.

Statistics¶

Component	Count
Fake news pieces	3,981
Real news pieces	(unlabeled comparison set)
Tweets (social engagements)	7,192
Languages	6: English, Spanish, Portuguese, Hindi, French, Italian
Temporal span	Early 2020 (pandemic emergence)

Coverage by language¶

Language	Code	Fake news	Tweets	Real news
English	en	~660	~1,200	paired
Spanish	es	~660	~1,200	paired
Portuguese	pt	~660	~1,200	paired
Hindi	hi	~660	~1,200	paired
French	fr	~660	~1,200	paired
Italian	it	~660	~1,200	paired

Schema and features¶

News content features¶

Fact-checking reviews: Label, debunked explanation, verification source query (from PolitiFact, Snopes, Poynter)
Source content: URL, language, location, release date, text content, image metadata
Veracity: Binary (fake/real) or graded labels from fact-checking agencies

Tweet-level: Text, creation time, retweet/reply/like counts, emoji analysis
User profiles: Follower count, friend count, account creation date, follower-friend ratio
Network structure: Reply and retweet cascades showing propagation

Temporal information¶

Propagation timeline: Timestamps of tweets and replies tracking how misinformation spreads
Language-specific timelines: How the same misinformation evolves across languages
Debunking delays: Temporal lag between fake news emergence and fact-checking

Data collection methodology¶

Source selection: Reliable fact-checking websites (PolitiFact, Snopes, Poynter) curated the initial claims
Language filtering: Multilingual claims were selected; content was translated or natively multilingual
Twitter query: Claims were searched on Twitter using the headline and keywords; tweets matching the claim were collected
Social engagement collection: Reply, retweet, and user profile data were gathered via Twitter API
Archive retrieval: Archived versions of URLs were retrieved when original sources became unavailable

Key characteristics revealed by dataset analysis¶

Language-invariant patterns: - Fake news consistently uses medical-related keywords (doctor, hospital, patient, vaccine) across languages - Sentiment analysis shows more emotional language in fake tweets (laughing, angry emojis) - Bot likelihood (based on user profile characteristics) is elevated for some languages (es, pt, hi) but not others (en, it)

Language-specific patterns: - Italian and English fake news contain higher bot engagement; Spanish and Portuguese show lower bot-likelihoodratios - Temporal propagation patterns differ: English fake news reach maximum engagement quickly; Hindi fake news spread more gradually - Topic differences: English focuses on conspiracy theories (#vaccine, #hydroxychloroquine); Spanish emphasizes economic impact; Portuguese stresses health authority credibility

Social context signals: - Real news attracts more replies and retweets on average, but fake news shows higher emotional emoji usage - User profiles sharing fake news often have fewer followers, but this effect is weaker in languages with smaller user populations (hi, fr)

Benchmark results¶

The paper evaluates several baseline methods across three resource scenarios:

Scenario	Best method	Accuracy (average)	Notes
Enough resource (80% training data)	dEFEND (text+social)	0.91–0.96	Social context provides 3–6% improvement over text-only
Low resource (20% training data)	dEFEND (text+social)	0.76–0.90	Social features remain beneficial even with limited labels
No resource (zero-shot cross-lingual)	dEFEND multilingual	0.41–0.85	Transfer from English to other languages shows variable success

Key finding: Social context (user replies and engagement patterns) provides language-invariant features beneficial for cross-lingual transfer, enabling better-than-baseline performance even in zero-resource settings.

Intended use¶

Multilingual fake news detection: Train and evaluate detection models across multiple languages
Cross-lingual transfer learning: Test whether detectors trained on English transfer to under-resourced languages
Language-invariant feature discovery: Identify shared misinformation patterns across linguistic boundaries
Early detection: Leverage temporal social engagement patterns to detect misinformation before widespread propagation
Fact-checking assistance: Prioritize claims for human fact-checkers based on language and engagement signals
Mitigation strategy research: Study propagation networks to design intervention strategies (bot removal, influential user targeting)

Limitations¶

Languages: Covers only six languages; excludes Chinese, Russian, and other high-misinformation regions
Fact-checking source bias: Reliance on English-majority fact-checking agencies (PolitiFact, Snopes) introduces language bias in what gets labeled
Temporal boundary: Data collection focused on early 2020; later pandemic phases (vaccination period) underrepresented
Real news baseline: Real news is not explicitly fact-checked; assumes unverified tweets from official health sources are accurate
Tweet deletions: Historical tweet retrieval may be incomplete due to account suspensions or deletions
Monolingual fact-checks: Many fact-checks exist only in English; translated fact-checks for non-English languages may introduce translation artifacts

Dataset	Year	Languages	Multimodal	Social context	COVID-19 focused
FakeNewsNet	FakeNewsNet	2018	English	No	Yes
ReCOVery	ReCOVery	2020	English	Yes	Yes
CHECKED	CHECKED	2020	Chinese (Weibo)	Yes	Yes
MM-COVID	2020	6 languages	Yes	Yes	Yes
NELA-GT-2020	NELA-GT-2020	2020	English	No	No

MM-COVID is unique in combining multilingual coverage with multimodal features (text + image + social context) and temporal information, filling a gap between monolingual English datasets (FakeNewsNet, NELA-GT) and non-English datasets (CHECKED).

Connections¶

ReCOVery (Zhou et al., 2020) covers COVID-19 credibility but in English only; MM-COVID extends to multiple languages
HERO (Zhou et al., 2023) uses MM-COVID as a benchmark for linguistic-style-based fake news detection
dEFEND (Shu et al., 2019) is adapted and evaluated on MM-COVID; social context variant (dEFEND'N) shows benefits for multilingual detection
Multimodal fake news detection — MM-COVID provides testbed for combining text, image, and social signals
COVID-19 misinformation — foundational dataset for pandemic-specific detection research