CMU-MisCOV19¶
Paper: Memon & Carley, 2020
Repository: https://zenodo.org/record/4024154
Language: English
Collection period: March 29, June 15–24, 2020
Overview¶
CMU-MisCOV19 is a manually annotated COVID-19 Twitter dataset designed to characterize misinformed and informed online communities. Unlike many COVID-19 datasets focused solely on false claims, this dataset explicitly includes categories for true information, corrections, and sarcasm, enabling comparison of competing communities.
Core statistics: - Annotated tweets: 4,573 - Unique users: 3,629 - User tweets (average): 1.24 annotated tweets per user - Augmented dataset: 330,609 COVID-19 related tweets (91 per user average) extracted from user timelines - Categories: 17 distinct classes
Categories and Distribution¶
Organized by information type:
Misinformation (n=1,420; 31%)¶
- Conspiracy — 924 tweets: Conspiracy theories (5G, bioweapons, lab-created)
- Fake Cure — 141 tweets: Claims of non-medical treatments (bleach, essential oils, colloidal silver)
- Fake Treatment — 34 tweets: False medical treatments
- False Fact or Prevention — 321 tweets: Incorrect prevention advice or scientific claims
True Information (n=1,669; 37%)¶
- True Prevention — 175 tweets: Correct prevention guidance
- True Treatment — 0 tweets: (not present in sample)
- Calling out / Correction — 1,331 tweets: Explicit correction of misinformation or fact-checking
- Sarcasm / Satire — 476 tweets: Sarcastic critiques of misinformation or conspiracy theories
- True Public Health Response — 163 tweets: Legitimate public health measures and responses
Other (n=1,484; 32%)¶
- Irrelevant — 131 tweets: Not related to COVID-19 misinformation
- Politics — 512 tweets: Political content tangentially related
- Ambiguous / Difficult to Classify — 143 tweets
- Commercial Activity or Promotion — 37 tweets
- Emergency Response — 17 tweets
- News — 95 tweets
- Panic Buying — 70 tweets
- False Public Health Response — 3 tweets
Annotation Methodology¶
Phase 1 (Primary annotation): - 4,573 tweets annotated by single annotator - Randomly and uniformly sampled to maintain topic diversity
Phase 2 (Inter-rater reliability): - 651 tweets (14.2%) re-annotated by six additional annotators - Cross-annotator agreement computed for reliability assessment
Data representation: - For each tweet: status ID (tweet ID), status_created_at (timestamp), annotation1 (primary class), annotation2 (secondary annotator class if exists)
Data Collection¶
Keywords (used with "coronavirus" and "covid"): - Treatments/prevention: bleach, vaccine, acetic acid, steroids, essential oil, saltwater, ethanol, garlic, chlorine, sesame oil, hydroxychloroquine, chloroquine - Conspiracy: 5G, bioweapon, cocaine, gates, conspiracy - Other health: poison, cure, treat, immune, doctor, colloidal silver - Behavioral: children, kids, panic, fake - Miscellaneous: dryer, senna makki, senna tea
Hashtags: #nCoV20199, #CoronaOutbreak, #CoronaVirus, #CoronavirusCoverup, #COVID19, #Coronavirus, #WuhanCoronavirus, #coronaviris, #Wuhan
Data Schema¶
Tweet Annotations CSV¶
Each row contains:
- status_id: Tweet ID (for rehydration per Twitter ToS)
- status_created_at: Tweet creation timestamp
- annotation1: Primary annotator's category
- annotation2: Secondary annotator's category (if re-annotated, otherwise blank)
Access: Tweet IDs provided; full tweet JSONs not included to comply with Twitter Terms of Service. Tweets can be rehydrated using tools like twarc.
Codebook¶
The paper provides a comprehensive codebook with: - Detailed definitions for each 17 category - Example tweets for each category - Decision rules for ambiguous cases - Guidance for handling multi-topic tweets
Use in Research¶
The dataset and accompanying codebook have enabled: - Community-level characterization of COVID-19 misinformation spreaders - Bot prevalence analysis (19% of misinformed vs 11% informed users) - Sociolinguistic comparison via LIWC (narrative use, formality, emotional tone) - Network topology analysis (echo-chamber density, cross-group communication patterns) - Vaccination stance prediction and cross-community analysis (41% of misinformed users are anti-vaxxers)
Data Access and FAIR Compliance¶
Repository: https://zenodo.org/record/4024154
License & Attribution: Public release in compliance with FAIR principles
Reproducibility: - Annotations and timestamps provided - Tweet IDs allow rehydration via public APIs - Codebook and decision rules documented in full paper - Collection methodology and parameters reproducible
Related Datasets¶
- CoAID: Multimodal COVID-19 dataset with news articles, claims, social posts, and engagement metrics
- ReCOVery: COVID-19 news with NewsGuard publisher credibility labels
- CHECKED: Chinese-language COVID-19 dataset with expert labels
- MM-COVID: Multilingual COVID-19 dataset