CoAID¶

Paper: Cui & Lee, 2020

Repository: https://github.com/cuilimeng/CoAID

Language: English

Collection period: December 1, 2019 — September 1, 2020

Overview¶

CoAID (COVID-19 Healthcare Misinformation Dataset) is a multimodal, multi-source benchmark for evaluating COVID-19 misinformation detection methods. It integrates information from diverse origins and modalities:

News articles: 4,251 articles (204 fake, 3,565 true) from fact-checking outlets and misinformation sources
Claims: 482 statements (28 false, 454 true) extracted from WHO and Twitter
Social engagement: 296,000 user engagements (tweets and replies) linked to news and claims
Platform posts: 926 fact-checked posts from Facebook, Twitter, Instagram, YouTube, TikTok, and specialized fact-checking platforms

Schema¶

Information on Website¶

ID
Fact-checking URL
Information URLs (the actual claim/news source)
Title (of the article)
Article Title (the crawled article title)
Content
Abstract
Publish date
Keywords

User Engagement: Tweets¶

Tweet ID
Tweet ID (repeated for reference)
Reply ID

User Engagement: Replies¶

Tweet ID
Reply ID
User ID

Post ID
Fact-checking URL
Post URLs
Title

Statistics (Version 0.3)¶

Type	Fake	True	Total
Website claims	28	454	482
News articles	204	3,565	3,769
Tweets	484	8,092	8,576
Replies	626	12,451	13,077
Social platform posts	650	42	692
Total	1,988	24,084	26,072

Data Access¶

The dataset is publicly available at: https://github.com/cuilimeng/CoAID

The repository includes: - Raw CSV files for each information type - Automatic update scripts to fetch the latest COVID-19 misinformation - Baseline detection code for multiple methods

Benchmark Results¶

The paper evaluates the following detection methods:

Method	Category	Notes
SVM	Baseline	Bag-of-words representation
Logistic Regression	Baseline	Linear classifier
Random Forest	Baseline	Tree ensemble
CNN	Deep Learning	Convolutional over word embeddings
BiGRU	Deep Learning	Bidirectional GRU sequence model
CSI	Context + User	Incorporates article content and user comment sentiment
SAMEv	Multimodal	Uses image, content, and metadata
HAN	Attention	Hierarchical attention over words and sentences
dEFEND	Attention + Context	Hierarchical attention + co-attention with user comments

State-of-the-art models incorporating user engagement signals substantially outperform simple baselines. However, severe class imbalance (true >> false) makes practical deployment challenging.

CHECKED: Chinese-language COVID-19 dataset on Weibo with expert labels and multimedia
ReCOVery: English COVID-19 news with NewsGuard/MBFC publisher credibility labels
MM-COVID: Multilingual COVID-19 dataset across six languages

Use in Research¶

CoAID has been used for: - Misinformation detection model evaluation - User engagement pattern analysis - Sentiment analysis of social responses to false claims - Temporal trend analysis of COVID-19 narratives - Benchmarking contextual and multimodal detection methods