CMU-MisCOV19¶

Paper: Memon & Carley, 2020

Repository: https://zenodo.org/record/4024154

Language: English

Collection period: March 29, June 15–24, 2020

Overview¶

CMU-MisCOV19 is a manually annotated COVID-19 Twitter dataset designed to characterize misinformed and informed online communities. Unlike many COVID-19 datasets focused solely on false claims, this dataset explicitly includes categories for true information, corrections, and sarcasm, enabling comparison of competing communities.

Core statistics: - Annotated tweets: 4,573 - Unique users: 3,629 - User tweets (average): 1.24 annotated tweets per user - Augmented dataset: 330,609 COVID-19 related tweets (91 per user average) extracted from user timelines - Categories: 17 distinct classes

Categories and Distribution¶

Organized by information type:

Misinformation (n=1,420; 31%)¶

Conspiracy — 924 tweets: Conspiracy theories (5G, bioweapons, lab-created)
Fake Cure — 141 tweets: Claims of non-medical treatments (bleach, essential oils, colloidal silver)
Fake Treatment — 34 tweets: False medical treatments
False Fact or Prevention — 321 tweets: Incorrect prevention advice or scientific claims

True Information (n=1,669; 37%)¶

True Prevention — 175 tweets: Correct prevention guidance
True Treatment — 0 tweets: (not present in sample)
Calling out / Correction — 1,331 tweets: Explicit correction of misinformation or fact-checking
Sarcasm / Satire — 476 tweets: Sarcastic critiques of misinformation or conspiracy theories
True Public Health Response — 163 tweets: Legitimate public health measures and responses

Other (n=1,484; 32%)¶

Irrelevant — 131 tweets: Not related to COVID-19 misinformation
Politics — 512 tweets: Political content tangentially related
Ambiguous / Difficult to Classify — 143 tweets
Commercial Activity or Promotion — 37 tweets
Emergency Response — 17 tweets
News — 95 tweets
Panic Buying — 70 tweets
False Public Health Response — 3 tweets

Annotation Methodology¶

Phase 1 (Primary annotation): - 4,573 tweets annotated by single annotator - Randomly and uniformly sampled to maintain topic diversity

Phase 2 (Inter-rater reliability): - 651 tweets (14.2%) re-annotated by six additional annotators - Cross-annotator agreement computed for reliability assessment

Data representation: - For each tweet: status ID (tweet ID), status_created_at (timestamp), annotation1 (primary class), annotation2 (secondary annotator class if exists)

Data Collection¶

Keywords (used with "coronavirus" and "covid"): - Treatments/prevention: bleach, vaccine, acetic acid, steroids, essential oil, saltwater, ethanol, garlic, chlorine, sesame oil, hydroxychloroquine, chloroquine - Conspiracy: 5G, bioweapon, cocaine, gates, conspiracy - Other health: poison, cure, treat, immune, doctor, colloidal silver - Behavioral: children, kids, panic, fake - Miscellaneous: dryer, senna makki, senna tea

Hashtags: #nCoV20199, #CoronaOutbreak, #CoronaVirus, #CoronavirusCoverup, #COVID19, #Coronavirus, #WuhanCoronavirus, #coronaviris, #Wuhan

Data Schema¶

Tweet Annotations CSV¶

Each row contains: - status_id: Tweet ID (for rehydration per Twitter ToS) - status_created_at: Tweet creation timestamp - annotation1: Primary annotator's category - annotation2: Secondary annotator's category (if re-annotated, otherwise blank)

Access: Tweet IDs provided; full tweet JSONs not included to comply with Twitter Terms of Service. Tweets can be rehydrated using tools like twarc.

Codebook¶

The paper provides a comprehensive codebook with: - Detailed definitions for each 17 category - Example tweets for each category - Decision rules for ambiguous cases - Guidance for handling multi-topic tweets

Use in Research¶

The dataset and accompanying codebook have enabled: - Community-level characterization of COVID-19 misinformation spreaders - Bot prevalence analysis (19% of misinformed vs 11% informed users) - Sociolinguistic comparison via LIWC (narrative use, formality, emotional tone) - Network topology analysis (echo-chamber density, cross-group communication patterns) - Vaccination stance prediction and cross-community analysis (41% of misinformed users are anti-vaxxers)

Data Access and FAIR Compliance¶