Skip to content

ReCOVery

Full name: ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research Authors: Zhou Xinyi, Mulay Apurva, Ferrara Emilio, Zafarani Reza Paper: Zhou et al. (2020), CIKM '20 Access: http://coronavirus-fakenews.com (tweet IDs and article IDs; full text via provided instructions)

Description

ReCOVery is a multimodal dataset of COVID-19 news articles labeled for credibility. Labels are assigned at the publisher level using NewsGuard and Media Bias/Fact Check (MBFC) rather than at the per-article level, enabling scalable collection without manual annotation. This design allows continuous extension as publishers release more COVID-19 content.

Statistics

Split Articles w/ images w/ social data Tweets Users
Reliable 1,364 1,354 1,219 114,402 78,659
Unreliable 665 663 528 26,418 17,323
Total 2,029 2,017 1,747 140,820 93,761

Class ratio: approximately 2:1 reliable to unreliable.

Schema

Each news article has 12 components:

Field Description
News ID Unique article identifier
URL Source URL
Publisher Name of the publishing media outlet
Publication date yyyy-mm-dd format
Author(s) Byline (may be blank or fictional)
Title News headline
Body text Full article text
Main image URL of the primary/head image
Country Country of the publisher
Political bias One of: extremely left / left / left-center / center / right-center / right / extremely right (from AllSides + MBFC)
NewsGuard score 0–100 credibility score
MBFC factual level very high / high / most factual / mixed / low / very low

Social data fields: tweet ID, tweet text, language, creation time, retweet/reply/like counts; posting user ID, follower count, friend count.

Labeling methodology

Publishers are classified using strict thresholds: - Reliable: NewsGuard score >90 AND MBFC factual level = "very high" or "high" - Unreliable: NewsGuard score <30 AND MBFC factual level = "low" or "very low"

Threshold of 90/30 (vs. NewsGuard's default 60) is intended to reduce false positives/negatives. 22 reliable publishers (e.g., NPR, Reuters) and 38 unreliable publishers (e.g., Humans Are Free, Natural News) are included. Coverage: US, Russia, UK, Iran, Cyprus, Canada.

Intended use

  • Credibility classification of COVID-19 news
  • Multimodal fake news detection (text + image + social)
  • Analysis of misinformation propagation dynamics during a pandemic
  • Study of political bias in COVID-19 news coverage

Limitations

Publisher-level labels introduce noise: a credible publisher may occasionally publish an error; an unreliable publisher may republish a factual story. The corpus is US-heavy and English-language only. Tweet data is released as IDs only; historical tweet retrieval may be incomplete due to tweet deletions.

Connections

  • FakeNewsNet is the closest structural analogue (news + Twitter spreading), covering general fake news with per-article fact-check labels rather than publisher-level labels.
  • SAFE (Zhou et al., 2020) is evaluated on ReCOVery as a multimodal baseline and achieves the best benchmark F₁.
  • Cao et al. (2025) — SLIM uses ReCOVery (train/val/test as provided by the dataset, 966/278/120 reliable and 487/114/64 fake) as a primary benchmark for limited-information detection; reports 95.55% accuracy with the full-text XLNet baseline and ~99% accuracy ratio with 30% keyword extraction.