Skip to content

CoAID

Paper: Cui & Lee, 2020

Repository: https://github.com/cuilimeng/CoAID

Language: English

Collection period: December 1, 2019 — September 1, 2020

Overview

CoAID (COVID-19 Healthcare Misinformation Dataset) is a multimodal, multi-source benchmark for evaluating COVID-19 misinformation detection methods. It integrates information from diverse origins and modalities:

  • News articles: 4,251 articles (204 fake, 3,565 true) from fact-checking outlets and misinformation sources
  • Claims: 482 statements (28 false, 454 true) extracted from WHO and Twitter
  • Social engagement: 296,000 user engagements (tweets and replies) linked to news and claims
  • Platform posts: 926 fact-checked posts from Facebook, Twitter, Instagram, YouTube, TikTok, and specialized fact-checking platforms

Schema

Information on Website

  • ID
  • Fact-checking URL
  • Information URLs (the actual claim/news source)
  • Title (of the article)
  • Article Title (the crawled article title)
  • Content
  • Abstract
  • Publish date
  • Keywords

User Engagement: Tweets

  • Tweet ID
  • Tweet ID (repeated for reference)
  • Reply ID

User Engagement: Replies

  • Tweet ID
  • Reply ID
  • User ID

Social Platform Posts

  • Post ID
  • Fact-checking URL
  • Post URLs
  • Title

Statistics (Version 0.3)

Type Fake True Total
Website claims 28 454 482
News articles 204 3,565 3,769
Tweets 484 8,092 8,576
Replies 626 12,451 13,077
Social platform posts 650 42 692
Total 1,988 24,084 26,072

Data Access

The dataset is publicly available at: https://github.com/cuilimeng/CoAID

The repository includes: - Raw CSV files for each information type - Automatic update scripts to fetch the latest COVID-19 misinformation - Baseline detection code for multiple methods

Benchmark Results

The paper evaluates the following detection methods:

Method Category Notes
SVM Baseline Bag-of-words representation
Logistic Regression Baseline Linear classifier
Random Forest Baseline Tree ensemble
CNN Deep Learning Convolutional over word embeddings
BiGRU Deep Learning Bidirectional GRU sequence model
CSI Context + User Incorporates article content and user comment sentiment
SAMEv Multimodal Uses image, content, and metadata
HAN Attention Hierarchical attention over words and sentences
dEFEND Attention + Context Hierarchical attention + co-attention with user comments

State-of-the-art models incorporating user engagement signals substantially outperform simple baselines. However, severe class imbalance (true >> false) makes practical deployment challenging.

  • CHECKED: Chinese-language COVID-19 dataset on Weibo with expert labels and multimedia
  • ReCOVery: English COVID-19 news with NewsGuard/MBFC publisher credibility labels
  • MM-COVID: Multilingual COVID-19 dataset across six languages

Use in Research

CoAID has been used for: - Misinformation detection model evaluation - User engagement pattern analysis - Sentiment analysis of social responses to false claims - Temporal trend analysis of COVID-19 narratives - Benchmarking contextual and multimodal detection methods

Topics