ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research¶

Authors: Xinyi Zhou, Apurva Mulay, Emilio Ferrara, Reza Zafarani Venue: CIKM '20, October 19–23, 2020, Virtual Event — DOI

TL;DR¶

The COVID-19 pandemic triggered a parallel "infodemic" of low-credibility news; this paper constructs ReCOVery, a multimodal repository of 2,029 COVID-19 news articles with binary credibility labels inherited from their publisher's NewsGuard/MBFC ratings, paired with 140,820 tweets tracking social spread. The key design trade-off is scalability vs. per-article accuracy: labeling at the publisher level (NewsGuard >90 and MBFC very high/high = reliable; NewsGuard <30 and MBFC below mixed = unreliable) avoids costly per-article annotation while likely introducing some noise. Baseline experiments across LIWC, RST, Text-CNN, and SAFE reveal that multimodal features consistently outperform single-modal ones, with SAFE achieving F₁ 0.833 (reliable) and 0.672 (unreliable).

Contributions¶

Constructs ReCOVery: 2,029 COVID-19 news articles (1,364 reliable, 665 unreliable) published January–May 2020, collected from 60 pre-screened publishers across six countries, with 140,820 associated tweets from 93,761 unique users.
Introduces a site-level credibility labeling approach using dual external authorities (NewsGuard and MBFC) with strict thresholds (>90/very high–high for reliable; <30/below mixed for unreliable), enabling scalable ground-truth construction without per-article annotation.
Provides 12 components per article — textual (title, body), visual (head image URL), temporal (publication date), social (tweet IDs, user metadata), country, and political bias — enabling multimodal research.
Reports benchmark results for four baselines (LIWC+DT, RST+DT, Text-CNN, SAFE) as a reference for future methods; codes and data at http://coronavirus-fakenews.com.

Method¶

Publisher selection. ~2,000 news publishers listed in MBFC were searched. A publisher is classified reliable if its NewsGuard score >90 AND its MBFC factual level is "very high" or "high"; unreliable if NewsGuard <30 AND MBFC is "low" or "very low" (below mixed). These thresholds are deliberately more extreme than NewsGuard's default (60) to reduce false positives and negatives. This yielded 22 reliable and 38 unreliable publishers; several known-fake outlets were excluded because their domains had gone offline.

Article collection. Articles from selected publishers are crawled using the Newspaper Python library, filtered by keywords "SARS-CoV-2", "COVID-19", or "Coronavirus" (case-insensitive). These are the WHO-official names, chosen to avoid naming biases and to target fake news that mimics professional terminology — considered more dangerous than obviously fringe content. Each article is stored with 12 fields: ID, URL, publisher, publication date, author(s), title, body text, main image, country, political bias (AllSides + MBFC), NewsGuard score, and MBFC factual level.

Social tracking. Twitter Premium Search API is queried by news article URL to find all tweets posted after the article's publication date (up to May 26, 2020). Per-tweet data includes ID, text, language, creation time, retweet/reply/like counts, and the posting user's ID, follower count, and friend count. Only tweet IDs are publicly released (Twitter Terms of Service compliance).

Dataset characteristics. The corpus is imbalanced at ~2:1 reliable-to-unreliable. Most articles have both textual and visual information (2,017/2,029) and have appeared on Twitter (1,747/2,029). Author counts follow a long-tail distribution; coauthorship network among 1,095 authors shows >90% with ≤2 collaborators. Publication volume grows exponentially from January to May 2020 (tracking COVID-19 case growth). Six countries are represented, dominated by US publishers.

Results¶

Experiments use an 80/20 train/test split with precision, recall, and F₁ evaluation (reported because of the ~2:1 class imbalance).

Method	Reliable Pre.	Reliable Rec.	Reliable F₁	Unreliable Pre.	Unreliable Rec.	Unreliable F₁
LIWC+DT	0.779	0.771	0.775	0.540	0.552	0.545
RST+DT	0.721	0.705	0.712	0.421	0.441	0.430
Text-CNN	0.746	0.782	0.764	0.522	0.472	0.496
SAFE	0.836	0.829	0.833	0.667	0.677	0.672

Key observations: (1) all four baselines are content-based; social-media-mining approaches are explicitly flagged as a promising direction. (2) Multimodal SAFE consistently outperforms single-modal methods, echoing findings from SAFE's original evaluation. (3) Unreliable news is harder to classify than reliable news across all methods — the class imbalance and the label-inheritance noise at the article level likely both contribute.

Connections¶

The ReCOVery dataset is the primary artifact; see that page for schema details and download instructions.
SAFE (Zhou et al., 2020) is one of the four baselines used here; its cross-modal text-image similarity approach achieves the best benchmark performance on ReCOVery.
FakeNewsNet is the closest antecedent in design — fact-check-labeled news articles + Twitter spreading data — but covers general fake news, not COVID-19, and uses per-article fact-checking rather than publisher-level labeling.
The credibility assessment paradigm is operationalized here at the publisher level; compare with Sitaula et al. (2019), which applies credibility reasoning at the author and article level.
ReCOVery's dual NewsGuard+MBFC labeling methodology is adapted from FakeNewsNet's use of PolitiFact and GossipCop fact-check labels — both sidestep per-article human annotation in favor of a structured external authority.
See COVID-19 misinformation for related work on the infodemic.
Yang et al. (2020) — CHECKED is the Chinese-language complement: where ReCOVery provides English news articles with publisher-level labels, CHECKED provides Chinese Weibo microblogs with per-item expert labels, enabling cross-lingual infodemic research.
Zhou et al. (2023) — HERO uses ReCOVery as a primary evaluation benchmark and surpasses all baselines reported here, achieving AUC 0.866 using hierarchical linguistic trees on the same corpus.

Notes¶

The publisher-level labeling is the central design tension: it makes the corpus scalable and continuously extensible (add a publisher, get articles for free), but it conflates publisher-level reliability with article-level truth. A high-credibility outlet can publish a mistake; a low-credibility outlet can republish a factual wire story. The authors acknowledge this and frame it as a "trade-off between dataset scalability and label accuracy" — a principled position, but researchers training classifiers on this data should expect noisy labels, particularly in the unreliable class (where 665 articles come from 38 diverse sources, many with very low NewsGuard scores averaging ~15).

The political-bias field (AllSides + MBFC) is a valuable secondary annotation largely absent in other COVID-19 misinformation datasets. Figure 11 shows that right-leaning content is more evenly distributed across credibility levels than left-leaning content — an intriguing finding the paper notes but does not analyze in depth.

The four baselines are all content-based by design, which the authors explicitly flag as a limitation. Given that the repository includes rich social graph data (140,820 tweets, follower/friend networks), a propagation-based or user-profile-based approach similar to Shu et al. seems a natural next step that the baselines section deliberately leaves open.