Skip to content
Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

Authors: Shahan Ali Memon, Kathleen M. Carley
Affiliation: Carnegie Mellon University
Venue: arXiv, 2020 — arXiv:2008.00791

TL;DR

This paper presents CMU-MisCOV19, a manually annotated Twitter dataset of 4,573 COVID-19 tweets across 17 categories, and characterizes misinformed versus informed online communities. Key findings: misinformed communities are significantly denser (higher network density), contain more bots (19% vs 11%), and demonstrate higher echo-chamber effects. The majority of misinformed users (41%) align with anti-vaccination stances, suggesting COVID-19 misinformation overlaps with broader health-skeptic communities.

Contributions

  • CMU-MisCOV19 dataset: 4,573 manually annotated tweets from 3,629 users collected across March and June 2020, with a comprehensive 17-category codebook distinguishing misinformation types (conspiracy, fake cures/treatments, false prevention/facts) from true information and corrective responses.
  • Community structure analysis: Demonstrates that COVID-19 misinformed communities have higher network density (9.7e-4 vs 6.5e-4) and stronger echo-chamber effects than informed communities, despite both showing echo-chamber patterns.
  • Bot detection and disinformation campaigns: Identifies 19% of misinformed users as bots (vs 11% for informed), with an even higher concentration (22%) among anti-vax segments, suggesting organized disinformation campaigns.
  • Sociolinguistic characterization: Shows informed users employ more narrative structures (higher pronoun use, lower analytical language) while both communities display predominantly negative emotional tone.
  • Vaccination stance analysis: Reveals 41% of COVID-19 misinformed users identify as anti-vaxxers, indicating coalescence of health-skeptic ideologies.

Method

The paper uses a three-phase annotation process:

  1. Data collection: Twitter search API queries using 40+ keywords and 10 hashtags related to COVID-19 treatments, prevention, conspiracy theories, and vaccines. Data collected on three dates (March 29, June 15 and 24, 2020) to capture different time periods.

  2. Annotation codebook: 17-category taxonomy covering:

  3. Misinformation: Conspiracy, Fake Cure/Treatment, False Fact or Prevention, False Public Health Response
  4. True information: True Prevention, True Treatment, True Public Health Response, Calling out/Correction, Sarcasm/Satire
  5. Other: Irrelevant, Politics, Commercial Activity, News, Emergency Response, Panic Buying, Ambiguous

First phase: 4,573 tweets annotated by primary annotator. Second phase: 651 tweets (14%) re-annotated by six additional annotators for inter-rater reliability.

  1. Community detection: Assign valence scores to annotation categories (+1 for true/corrective, -1 for false information). Compute per-user valence as weighted sum; classify users as informed (positive), misinformed (negative), or ambiguous.

  2. Data augmentation: Retrieve full timelines of community members, filter to COVID-19-related tweets using keyword filtering, yielding 330,609 total tweets (91 per user average).

  3. Network analysis: Extract retweet, mention, and reply networks; compute network density. Bot detection via Bot-Hunter tool (threshold ≥ 0.75 confidence). Linguistic analysis via LIWC2015.

Results

Network structure: - Misinformed users: 923 nodes, 826 links, density 9.7e-4 - Informed users: 1,515 nodes, 1,489 links, density 6.5e-4 - Despite denser misinformed networks, reply network shows more cross-group engagement, hypothesized to reflect corrective behavior.

Bot prevalence: - 14% overall bot prevalence (505 of 3,629 users) - Misinformed: 19% bots (202 of 1,043 users) - Informed: 11% bots (184 of 1,697 users) - Difference statistically significant (p < 0.001, z = −6.23)

Linguistic patterns (LIWC analysis, independent z-tests, α = 0.05): - Function words & pronouns: Informed users use significantly more (M=33.90 vs 29.32 for function words, p<.001; M=7.97 vs 6.53 for pronouns, p<.001), suggesting narrative discourse. - Analytic language: Misinformed users more analytical (M=76.01 vs 69.83, p<.001), contrary to typical misinformation patterns; authors note COVID-19 misinformed community differs from other health-skeptic communities. - Authenticity: Informed users higher (M=25.12 vs 16.43, p<.001). - Tone: No significant difference (p=.15); both communities predominantly negative.

Vaccination stance (among 1,027 COVID-19 misinformed users with vaccine-related tweets): - Anti-vaxxers: 423 (41%) - Pro-vaxxers: 224 (22%) - Ambiguous: 380 (37%) - Anti-vax segments have higher bot concentration (22% vs 17%), indicating organized amplification.

Connections

Notes

Strengths: - Comprehensive 17-category codebook balancing misinformation and corrective information, addressing a gap in many COVID-19 datasets that focus only on false claims. - Clear methodology for separating informed vs misinformed communities via annotation-based valence scoring rather than external credibility labels. - Bot analysis with high precision (0.957) using the Bot-Hunter tool provides strong evidence for organized disinformation. - Dataset made publicly available on Zenodo in compliance with FAIR principles and Twitter ToS.

Limitations: - Majority of annotations (86%) conducted by single annotator; inter-rater reliability only available for 14% of data. Authors mitigate by including multi-annotator data in community membership calculations. - Correlational analysis; cannot infer causation from observed patterns. - Three-week collection window with timeline augmentation limits ability to assess temporal change in misinformation dynamics. - Bot detection relies on second-level model inference; authors mitigate with high confidence threshold (≥0.75). - COVID-19 lacks clear user stances (unlike vaccination), complicating stance-based analysis; authors categorize on misinformation rather than underlying stances.

Technical insight: The finding that misinformed COVID-19 users are more analytical than informed users contradicts prior work on anti-vaccination communities and suggests COVID-19 misinformation attracts a different demographic or narrative framing than other health-skeptic movements. Authors hypothesize this reflects their informed group's composition from corrective tweets, which use personal narratives rather than analytical arguments.

Methodological contribution: The paper's approach to building balanced datasets that include both negative (misinformation) and positive (true information, corrections, sarcasm) categories is now standard practice in misinformation research; this paper was early to emphasize this distinction.