Skip to content

CMU-MisCOV19

Paper: Memon & Carley, 2020

Repository: https://zenodo.org/record/4024154

Language: English

Collection period: March 29, June 15–24, 2020

Overview

CMU-MisCOV19 is a manually annotated COVID-19 Twitter dataset designed to characterize misinformed and informed online communities. Unlike many COVID-19 datasets focused solely on false claims, this dataset explicitly includes categories for true information, corrections, and sarcasm, enabling comparison of competing communities.

Core statistics: - Annotated tweets: 4,573 - Unique users: 3,629 - User tweets (average): 1.24 annotated tweets per user - Augmented dataset: 330,609 COVID-19 related tweets (91 per user average) extracted from user timelines - Categories: 17 distinct classes

Categories and Distribution

Organized by information type:

Misinformation (n=1,420; 31%)

  • Conspiracy — 924 tweets: Conspiracy theories (5G, bioweapons, lab-created)
  • Fake Cure — 141 tweets: Claims of non-medical treatments (bleach, essential oils, colloidal silver)
  • Fake Treatment — 34 tweets: False medical treatments
  • False Fact or Prevention — 321 tweets: Incorrect prevention advice or scientific claims

True Information (n=1,669; 37%)

  • True Prevention — 175 tweets: Correct prevention guidance
  • True Treatment — 0 tweets: (not present in sample)
  • Calling out / Correction — 1,331 tweets: Explicit correction of misinformation or fact-checking
  • Sarcasm / Satire — 476 tweets: Sarcastic critiques of misinformation or conspiracy theories
  • True Public Health Response — 163 tweets: Legitimate public health measures and responses

Other (n=1,484; 32%)

  • Irrelevant — 131 tweets: Not related to COVID-19 misinformation
  • Politics — 512 tweets: Political content tangentially related
  • Ambiguous / Difficult to Classify — 143 tweets
  • Commercial Activity or Promotion — 37 tweets
  • Emergency Response — 17 tweets
  • News — 95 tweets
  • Panic Buying — 70 tweets
  • False Public Health Response — 3 tweets

Annotation Methodology

Phase 1 (Primary annotation): - 4,573 tweets annotated by single annotator - Randomly and uniformly sampled to maintain topic diversity

Phase 2 (Inter-rater reliability): - 651 tweets (14.2%) re-annotated by six additional annotators - Cross-annotator agreement computed for reliability assessment

Data representation: - For each tweet: status ID (tweet ID), status_created_at (timestamp), annotation1 (primary class), annotation2 (secondary annotator class if exists)

Data Collection

Keywords (used with "coronavirus" and "covid"): - Treatments/prevention: bleach, vaccine, acetic acid, steroids, essential oil, saltwater, ethanol, garlic, chlorine, sesame oil, hydroxychloroquine, chloroquine - Conspiracy: 5G, bioweapon, cocaine, gates, conspiracy - Other health: poison, cure, treat, immune, doctor, colloidal silver - Behavioral: children, kids, panic, fake - Miscellaneous: dryer, senna makki, senna tea

Hashtags: #nCoV20199, #CoronaOutbreak, #CoronaVirus, #CoronavirusCoverup, #COVID19, #Coronavirus, #WuhanCoronavirus, #coronaviris, #Wuhan

Data Schema

Tweet Annotations CSV

Each row contains: - status_id: Tweet ID (for rehydration per Twitter ToS) - status_created_at: Tweet creation timestamp - annotation1: Primary annotator's category - annotation2: Secondary annotator's category (if re-annotated, otherwise blank)

Access: Tweet IDs provided; full tweet JSONs not included to comply with Twitter Terms of Service. Tweets can be rehydrated using tools like twarc.

Codebook

The paper provides a comprehensive codebook with: - Detailed definitions for each 17 category - Example tweets for each category - Decision rules for ambiguous cases - Guidance for handling multi-topic tweets

Use in Research

The dataset and accompanying codebook have enabled: - Community-level characterization of COVID-19 misinformation spreaders - Bot prevalence analysis (19% of misinformed vs 11% informed users) - Sociolinguistic comparison via LIWC (narrative use, formality, emotional tone) - Network topology analysis (echo-chamber density, cross-group communication patterns) - Vaccination stance prediction and cross-community analysis (41% of misinformed users are anti-vaxxers)

Data Access and FAIR Compliance

Repository: https://zenodo.org/record/4024154

License & Attribution: Public release in compliance with FAIR principles

Reproducibility: - Annotations and timestamps provided - Tweet IDs allow rehydration via public APIs - Codebook and decision rules documented in full paper - Collection methodology and parameters reproducible

  • CoAID: Multimodal COVID-19 dataset with news articles, claims, social posts, and engagement metrics
  • ReCOVery: COVID-19 news with NewsGuard publisher credibility labels
  • CHECKED: Chinese-language COVID-19 dataset with expert labels
  • MM-COVID: Multilingual COVID-19 dataset

Topics