CoAID¶
Paper: Cui & Lee, 2020
Repository: https://github.com/cuilimeng/CoAID
Language: English
Collection period: December 1, 2019 — September 1, 2020
Overview¶
CoAID (COVID-19 Healthcare Misinformation Dataset) is a multimodal, multi-source benchmark for evaluating COVID-19 misinformation detection methods. It integrates information from diverse origins and modalities:
- News articles: 4,251 articles (204 fake, 3,565 true) from fact-checking outlets and misinformation sources
- Claims: 482 statements (28 false, 454 true) extracted from WHO and Twitter
- Social engagement: 296,000 user engagements (tweets and replies) linked to news and claims
- Platform posts: 926 fact-checked posts from Facebook, Twitter, Instagram, YouTube, TikTok, and specialized fact-checking platforms
Schema¶
Information on Website¶
ID- Fact-checking URL
- Information URLs (the actual claim/news source)
- Title (of the article)
- Article Title (the crawled article title)
- Content
- Abstract
- Publish date
- Keywords
User Engagement: Tweets¶
- Tweet ID
- Tweet ID (repeated for reference)
- Reply ID
User Engagement: Replies¶
- Tweet ID
- Reply ID
- User ID
Social Platform Posts¶
- Post ID
- Fact-checking URL
- Post URLs
- Title
Statistics (Version 0.3)¶
| Type | Fake | True | Total |
|---|---|---|---|
| Website claims | 28 | 454 | 482 |
| News articles | 204 | 3,565 | 3,769 |
| Tweets | 484 | 8,092 | 8,576 |
| Replies | 626 | 12,451 | 13,077 |
| Social platform posts | 650 | 42 | 692 |
| Total | 1,988 | 24,084 | 26,072 |
Data Access¶
The dataset is publicly available at: https://github.com/cuilimeng/CoAID
The repository includes: - Raw CSV files for each information type - Automatic update scripts to fetch the latest COVID-19 misinformation - Baseline detection code for multiple methods
Benchmark Results¶
The paper evaluates the following detection methods:
| Method | Category | Notes |
|---|---|---|
| SVM | Baseline | Bag-of-words representation |
| Logistic Regression | Baseline | Linear classifier |
| Random Forest | Baseline | Tree ensemble |
| CNN | Deep Learning | Convolutional over word embeddings |
| BiGRU | Deep Learning | Bidirectional GRU sequence model |
| CSI | Context + User | Incorporates article content and user comment sentiment |
| SAMEv | Multimodal | Uses image, content, and metadata |
| HAN | Attention | Hierarchical attention over words and sentences |
| dEFEND | Attention + Context | Hierarchical attention + co-attention with user comments |
State-of-the-art models incorporating user engagement signals substantially outperform simple baselines. However, severe class imbalance (true >> false) makes practical deployment challenging.
Related Datasets¶
- CHECKED: Chinese-language COVID-19 dataset on Weibo with expert labels and multimedia
- ReCOVery: English COVID-19 news with NewsGuard/MBFC publisher credibility labels
- MM-COVID: Multilingual COVID-19 dataset across six languages
Use in Research¶
CoAID has been used for: - Misinformation detection model evaluation - User engagement pattern analysis - Sentiment analysis of social responses to false claims - Temporal trend analysis of COVID-19 narratives - Benchmarking contextual and multimodal detection methods