CoAID: COVID-19 Healthcare Misinformation Dataset¶

Affiliation: The Pennsylvania State University, University Park, PA 16802

Venue: arXiv:2006.00885 (2020) — arXiv

TL;DR¶

CoAID is a benchmark dataset for COVID-19 healthcare misinformation detection comprising 4,251 news articles (204 fake, 3,565 true), 28 fake claims and 454 true claims, 296,000 user engagements from social platforms, and 926 social media posts with ground-truth labels. The dataset bridges text content, user engagement features, and multiplatform coverage, enabling evaluation of misinformation detection methods in the healthcare context.

Contributions¶

Introduces CoAID, a comprehensive COVID-19 misinformation dataset integrating diverse information types: news articles from fact-checked websites and false news sites, factual and false health claims, social media tweets and replies (from Twitter), and social platform posts (Facebook, Twitter, Instagram, YouTube, TikTok, LeadStories).
Provides 296,000 user engagement records (tweets, replies) linked to news, enabling detection methods that leverage social context signals such as user sentiment, retweet patterns, and conversation structure.
Compares CoAID's coverage and feature richness to existing datasets (LIAR, FA-KES, FakeNewsNet, FakeHealth), demonstrating uniqueness: CoAID is the only dataset integrating claims, news articles, tweets, and platform-level posts simultaneously.
Benchmarks multiple detection methods (SVM, LR, RF, CNN, BiGRU, CSI, SAMEv, HAN, dEFEND) on CoAID, establishing baselines and revealing that state-of-the-art multimodal and contextual methods outperform simple baselines.
Documents automatic update mechanisms to keep the dataset current with newly published COVID-19 misinformation.

Method¶

Data collection: The dataset integrates information from multiple sources and modalities:

News articles: Crawled from 9 reliable fact-checking outlets (HealthlineNews, Science Daily, Health Medicine, Medical News Today, Mayo Clinic, Cleveland Clinic, WebMD, WHO, CDC) and misinformation sources identified through fact-checkers and prior work. Used Newspaper3 for web scraping and keyword filtering.
Claims: Extracted one-two sentence claims from WHO official website and Twitter, manually separated into true and false claims (28 false, 454 true).
User engagement - Tweets: Retrieved using Twitter API with news article titles as search queries; specified date ranges to capture tweets discussing each news item. Collected tweet ID, text, user engagement metrics (likes, retweets, replies).
User engagement - Replies: Obtained tweet reply threads using tweet IDs, capturing conversational context around news articles.
Social platform posts: Collected both fake and true posts from five fact-checked platforms (LeadStories, FactCheck.org, CheckYourFact, AFP Fact Check, Health Feedback) and social platforms (Facebook, Twitter, Instagram, YouTube, TikTok). Posts deduplicated by removing multiple reposts of the same item by different users.

Timeline: December 1, 2019 to September 1, 2020 (9 months).

Dataset composition statistics (Version 0.3):

Information Type	Fake	True	Total
Website claims	28	454	482
News articles	204	3,565	3,769
Tweets	484	8,092	8,576
Replies	626	12,451	13,077
Social platform posts	650	42	692
Total engagement	1,788	21,039	22,827

Topics covered include: COVID-19, coronavirus, pneumonia, lockdown, stay-home, quarantine, ventilator, and related health terminology.

Results¶

Misinformation detection experiments: The authors evaluated seven baseline and state-of-the-art methods:

Simple baselines: SVM (bag-of-words), Logistic Regression (LR), Random Forest (RF)
Deep learning: CNN, BiGRU
Content + context: CSI (combines article content with user comment sentiment), SAMEv (uses image, content, and metadata), HAN (hierarchical attention), dEFEND (hierarchical attention + co-attention)

Setup: 75% training, 25% testing on CoAID Version 0.1. Word embeddings (100d GloVe) used; training with Adam optimizer, batch size 50, 10 epochs, cross-entropy loss.

Findings: State-of-the-art methods incorporating user engagement signals (SAMEv, HAN, dEFEND) substantially outperform simple baselines. However, the COVID-19 detection task exhibits severe class imbalance (many more true than false cases), causing models to achieve high overall accuracy while generating excessive false positives—a critical problem for real-world deployment where false alarms erode user trust.

Sentiment analysis of user engagement: Using VADER sentiment analysis on Twitter engagement, tweets discussing false news show higher negative sentiment compared to those discussing true news—information leak that users intuitively distrust false claims even before accessing fact-checks.

Common misinformation claims: Tracked frequency of recurring false claims over time (e.g., "COVID-19 is just like the flu," "5G mobile networks spread COVID-19"). Peak claim volumes correspond to real-world events (e.g., "5G spreads COVID" peaked April 10, 2020 when 5G towers were being set on fire globally).

Connections¶

CHECKED is a complementary dataset: Chinese-language, Weibo-only, multimedia-rich, explicitly labeled for fake/real with expert verification; CoAID is English, multi-platform, primarily text + engagement, broader in source diversity.
The COVID-19 Social Media Infodemic studies COVID-19 information diffusion across platforms using epidemic models; CoAID provides the underlying labeled ground truth for such studies.
FakeNewsNet is the largest general fake news dataset with news-level labels; CoAID brings that paradigm specifically to health misinformation with user engagement features.
LIAR introduced the claim-level misinformation dataset; CoAID combines claims with articles and social engagement, extending LIAR to the health domain.
Misinformation and fake news detection is the containing topic; CoAID is a central benchmark for COVID-19-specific detection work.
Datasets and benchmarks is the broader category; CoAID is used by many downstream studies on COVID-19 misinformation.

Notes¶

Strengths: First dataset to integrate multiple misinformation modalities (articles, claims, social tweets/replies, platform posts) in a single resource. Multimodal coverage enables evaluation of methods that combine textual, engagement, and platform-level signals. Automatic update mechanism makes it evolving rather than static. Large-scale user engagement data (296K tweets/replies) is a rich resource for social-context detection methods. Temporal collection span (9 months) allows studying how misinformation narratives evolve.

Limitations: Class imbalance is severe—true news vastly outnumbers false news (>10:1), making this an inherently imbalanced classification problem; methods achieving 90%+ accuracy may still be generating false positives. Detection baselines show that state-of-the-art methods are needed to outperform simpler approaches, but the paper does not deeply diagnose why simple methods fail (feature analysis would help). Social media posts are collected only once per URL/user pair, potentially missing duplicate/repost dynamics that are important in understanding viral misinformation. Detection results are reported at the article level, but the user engagement data is post-level—mismatch in granularity limits joint model evaluation. Dataset is English-only, limiting generalizability to non-English COVID-19 discourse.

Follow-up opportunities: Multimodal methods exploiting the image/video fields (collected but not benchmarked). Temporal detection: predicting which articles will become viral sources of misinformation before engagement occurs. Cross-platform propagation tracking: linking the same misinformation across Twitter, Facebook, and other platforms. User-level credibility modeling combining engagement patterns with article truth.

Availability: Dataset is public at https://github.com/cuilimeng/CoAID. Updated version controls ensure reproducibility.