ReCOVery¶

Full name: ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research Authors: Zhou Xinyi, Mulay Apurva, Ferrara Emilio, Zafarani Reza Paper: Zhou et al. (2020), CIKM '20 Access: http://coronavirus-fakenews.com (tweet IDs and article IDs; full text via provided instructions)

Description¶

ReCOVery is a multimodal dataset of COVID-19 news articles labeled for credibility. Labels are assigned at the publisher level using NewsGuard and Media Bias/Fact Check (MBFC) rather than at the per-article level, enabling scalable collection without manual annotation. This design allows continuous extension as publishers release more COVID-19 content.

Statistics¶

Split	Articles	w/ images	w/ social data	Tweets	Users
Reliable	1,364	1,354	1,219	114,402	78,659
Unreliable	665	663	528	26,418	17,323
Total	2,029	2,017	1,747	140,820	93,761

Class ratio: approximately 2:1 reliable to unreliable.

Schema¶

Each news article has 12 components:

Field	Description
News ID	Unique article identifier
URL	Source URL
Publisher	Name of the publishing media outlet
Publication date	yyyy-mm-dd format
Author(s)	Byline (may be blank or fictional)
Title	News headline
Body text	Full article text
Main image	URL of the primary/head image
Country	Country of the publisher
Political bias	One of: extremely left / left / left-center / center / right-center / right / extremely right (from AllSides + MBFC)
NewsGuard score	0–100 credibility score
MBFC factual level	very high / high / most factual / mixed / low / very low

Social data fields: tweet ID, tweet text, language, creation time, retweet/reply/like counts; posting user ID, follower count, friend count.

Labeling methodology¶

Publishers are classified using strict thresholds: - Reliable: NewsGuard score >90 AND MBFC factual level = "very high" or "high" - Unreliable: NewsGuard score <30 AND MBFC factual level = "low" or "very low"

Threshold of 90/30 (vs. NewsGuard's default 60) is intended to reduce false positives/negatives. 22 reliable publishers (e.g., NPR, Reuters) and 38 unreliable publishers (e.g., Humans Are Free, Natural News) are included. Coverage: US, Russia, UK, Iran, Cyprus, Canada.

Intended use¶

Credibility classification of COVID-19 news
Multimodal fake news detection (text + image + social)
Analysis of misinformation propagation dynamics during a pandemic
Study of political bias in COVID-19 news coverage

Limitations¶

Publisher-level labels introduce noise: a credible publisher may occasionally publish an error; an unreliable publisher may republish a factual story. The corpus is US-heavy and English-language only. Tweet data is released as IDs only; historical tweet retrieval may be incomplete due to tweet deletions.

Connections¶

FakeNewsNet is the closest structural analogue (news + Twitter spreading), covering general fake news with per-article fact-check labels rather than publisher-level labels.
SAFE (Zhou et al., 2020) is evaluated on ReCOVery as a multimodal baseline and achieves the best benchmark F₁.
Cao et al. (2025) — SLIM uses ReCOVery (train/val/test as provided by the dataset, 966/278/120 reliable and 487/114/64 fake) as a primary benchmark for limited-information detection; reports 95.55% accuracy with the full-text XLNet baseline and ~99% accuracy ratio with 30% keyword extraction.