Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset¶

Authors: Shahan Ali Memon, Kathleen M. Carley
Affiliation: Carnegie Mellon University
Venue: arXiv, 2020 — arXiv:2008.00791

TL;DR¶

This paper presents CMU-MisCOV19, a manually annotated Twitter dataset of 4,573 COVID-19 tweets across 17 categories, and characterizes misinformed versus informed online communities. Key findings: misinformed communities are significantly denser (higher network density), contain more bots (19% vs 11%), and demonstrate higher echo-chamber effects. The majority of misinformed users (41%) align with anti-vaccination stances, suggesting COVID-19 misinformation overlaps with broader health-skeptic communities.

Contributions¶

CMU-MisCOV19 dataset: 4,573 manually annotated tweets from 3,629 users collected across March and June 2020, with a comprehensive 17-category codebook distinguishing misinformation types (conspiracy, fake cures/treatments, false prevention/facts) from true information and corrective responses.
Community structure analysis: Demonstrates that COVID-19 misinformed communities have higher network density (9.7e-4 vs 6.5e-4) and stronger echo-chamber effects than informed communities, despite both showing echo-chamber patterns.
Bot detection and disinformation campaigns: Identifies 19% of misinformed users as bots (vs 11% for informed), with an even higher concentration (22%) among anti-vax segments, suggesting organized disinformation campaigns.
Sociolinguistic characterization: Shows informed users employ more narrative structures (higher pronoun use, lower analytical language) while both communities display predominantly negative emotional tone.
Vaccination stance analysis: Reveals 41% of COVID-19 misinformed users identify as anti-vaxxers, indicating coalescence of health-skeptic ideologies.

Method¶

The paper uses a three-phase annotation process:

Data collection: Twitter search API queries using 40+ keywords and 10 hashtags related to COVID-19 treatments, prevention, conspiracy theories, and vaccines. Data collected on three dates (March 29, June 15 and 24, 2020) to capture different time periods.
Annotation codebook: 17-category taxonomy covering:
Misinformation: Conspiracy, Fake Cure/Treatment, False Fact or Prevention, False Public Health Response
True information: True Prevention, True Treatment, True Public Health Response, Calling out/Correction, Sarcasm/Satire
Other: Irrelevant, Politics, Commercial Activity, News, Emergency Response, Panic Buying, Ambiguous

First phase: 4,573 tweets annotated by primary annotator. Second phase: 651 tweets (14%) re-annotated by six additional annotators for inter-rater reliability.

Community detection: Assign valence scores to annotation categories (+1 for true/corrective, -1 for false information). Compute per-user valence as weighted sum; classify users as informed (positive), misinformed (negative), or ambiguous.
Data augmentation: Retrieve full timelines of community members, filter to COVID-19-related tweets using keyword filtering, yielding 330,609 total tweets (91 per user average).
Network analysis: Extract retweet, mention, and reply networks; compute network density. Bot detection via Bot-Hunter tool (threshold ≥ 0.75 confidence). Linguistic analysis via LIWC2015.

Results¶

Network structure: - Misinformed users: 923 nodes, 826 links, density 9.7e-4 - Informed users: 1,515 nodes, 1,489 links, density 6.5e-4 - Despite denser misinformed networks, reply network shows more cross-group engagement, hypothesized to reflect corrective behavior.

Bot prevalence: - 14% overall bot prevalence (505 of 3,629 users) - Misinformed: 19% bots (202 of 1,043 users) - Informed: 11% bots (184 of 1,697 users) - Difference statistically significant (p < 0.001, z = −6.23)

Linguistic patterns (LIWC analysis, independent z-tests, α = 0.05): - Function words & pronouns: Informed users use significantly more (M=33.90 vs 29.32 for function words, p<.001; M=7.97 vs 6.53 for pronouns, p<.001), suggesting narrative discourse. - Analytic language: Misinformed users more analytical (M=76.01 vs 69.83, p<.001), contrary to typical misinformation patterns; authors note COVID-19 misinformed community differs from other health-skeptic communities. - Authenticity: Informed users higher (M=25.12 vs 16.43, p<.001). - Tone: No significant difference (p=.15); both communities predominantly negative.

Vaccination stance (among 1,027 COVID-19 misinformed users with vaccine-related tweets): - Anti-vaxxers: 423 (41%) - Pro-vaxxers: 224 (22%) - Ambiguous: 380 (37%) - Anti-vax segments have higher bot concentration (22% vs 17%), indicating organized amplification.

Connections¶

Related to ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research via shared focus on COVID-19 misinformation dataset construction.
Related to The Rise of Social Bots and The spread of low-credibility content by social bots via methodology for bot detection and characterization.
Related to The echo chamber effect on social media via network analysis of echo chambers.
Extends vaccination misinformation analysis from The Rise of Social Bots.
Cited by and contributes methodology to community characterization literature alongside The COVID-19 Social Media Infodemic.

Notes¶

Strengths: - Comprehensive 17-category codebook balancing misinformation and corrective information, addressing a gap in many COVID-19 datasets that focus only on false claims. - Clear methodology for separating informed vs misinformed communities via annotation-based valence scoring rather than external credibility labels. - Bot analysis with high precision (0.957) using the Bot-Hunter tool provides strong evidence for organized disinformation. - Dataset made publicly available on Zenodo in compliance with FAIR principles and Twitter ToS.

Limitations: - Majority of annotations (86%) conducted by single annotator; inter-rater reliability only available for 14% of data. Authors mitigate by including multi-annotator data in community membership calculations. - Correlational analysis; cannot infer causation from observed patterns. - Three-week collection window with timeline augmentation limits ability to assess temporal change in misinformation dynamics. - Bot detection relies on second-level model inference; authors mitigate with high confidence threshold (≥0.75). - COVID-19 lacks clear user stances (unlike vaccination), complicating stance-based analysis; authors categorize on misinformation rather than underlying stances.

Technical insight: The finding that misinformed COVID-19 users are more analytical than informed users contradicts prior work on anti-vaccination communities and suggests COVID-19 misinformation attracts a different demographic or narrative framing than other health-skeptic movements. Authors hypothesize this reflects their informed group's composition from corrective tweets, which use personal narratives rather than analytical arguments.

Methodological contribution: The paper's approach to building balanced datasets that include both negative (misinformation) and positive (true information, corrections, sarcasm) categories is now standard practice in misinformation research; this paper was early to emphasize this distinction.