The COVID-19 Social Media Infodemic¶

Authors: Matteo Cinelli, Walter Quattrociocchi, Alessandro Galeazzi, Carlo Michele Valensise, Emanuele Brugnoli, Ana Lucia Schmidt, Paola Zola, Fabiana Zollo, Antonio Scala

Venue: arXiv preprint — arxiv:2003.05004

TL;DR¶

Comparative analysis of COVID-19 information diffusion across five social media platforms (Twitter, Instagram, YouTube, Reddit, Gab) using epidemic models. All platforms exhibit infodemic characteristics (basic reproduction number R₀ > 1), with platform-dependent differences in how misinformation spreads relative to reliable information. Gab amplifies unreliable sources 400%, while YouTube reduces their impact to 10%.

Contributions¶

Large-scale empirical study of COVID-19 discourse on five major social media platforms with 1.3M+ posts and 7.5M+ comments from 3.7M+ users over 45 days (January–February 2020)
Epidemic modeling framework (EXP and SIR models) to characterize platform-specific information spreading with basic reproduction numbers (R₀)
Comparative analysis of misinformation amplification across platforms with novel metrics for rumor amplification and relative amplification coefficients
Empirical characterization of platform-dependent interaction patterns and content consumption dynamics

Method¶

The authors collected COVID-19 content from January 1 to February 14, 2020 across five platforms using different collection methods appropriate to each: Twitter's standard API and stream endpoints, manual collection with visual inspection for Instagram, YouTube Data API, Reddit's Pushshift.io archive, and Gab's proprietary API. Content was filtered using keywords derived from Google Trends queries (coronavirus, pandemic, wuhan, etc.).

They model information spreading as an epidemic process where individuals "get infected" by encountering and sharing content. The EXP (exponential) model captures early-stage monotonic growth, while the SIR (Susceptible-Infected-Recovered) compartmental model captures more realistic saturation. For each platform and the entire dataset, they fit both models to estimate the basic reproduction number R₀ — the average number of secondary cases (users sharing content) generated by one initially infected user.

To analyze misinformation versus reliable information, they classified posts and comments using Media Bias/Fact Check classifications. They compute the average engagement (reactions, shares, comments) per post as engagement amplification factor ξ, separately for unreliable (ξ^U) and reliable (ξ^R) sources. The relative amplification coefficient α = ξ^U / ξ^R measures whether a platform amplifies unreliable (α > 1) or reliable (α < 1) content.

Results¶

Basic reproduction numbers by platform (95% CI from bootstrapping): - Gab: R₀^EXP = 1.42–1.52, R₀^SIR = 2.2–2.5 - Reddit: R₀^EXP = 1.44–1.51, R₀^SIR = 2.4–2.8 - YouTube: R₀^EXP = 1.56–1.70, R₀^SIR = 3.2–3.5 - Instagram: R₀^EXP = 2.02–2.64, R₀^SIR = 1.1×10²–1.6×10² - Twitter: R₀^EXP = 1.65–2.06, R₀^SIR = 4.0–5.1

All platforms exceed the epidemic threshold R₀ > 1, indicating infodemic conditions where misinformation can spread uncontrollably.

Amplification factors by platform: - Gab: ξ^U = 5.6, ξ^R = 1.4, α ≈ 400% - Reddit: ξ^U = 22.7, ξ^R = 40.1, α ≈ 50% - YouTube: ξ^U = 1.4×10⁴, ξ^R = 3.9×10⁴, α ≈ 35% - Twitter: ξ^U = 15.1, ξ^R = 15.6, α ≈ 97%

Twitter shows the most neutral behavior (nearly equal amplification of reliable and unreliable sources), while Gab strongly amplifies unreliable sources and YouTube suppresses them. Posts from questionable and reliable sources show nearly identical spreading dynamics on most platforms, but the fraction of unreliable posts differs significantly by platform.

Connections¶

Extends Cinelli et al.'s later work on echo chambers with pandemic-era empirical focus
Related to Guess et al. and Pennycook et al. on COVID-19 misinformation interventions
Foundational for understanding COVID-19 misinformation and Infodemic in research literature
Contributes to Multi Platform Analysis methodology by demonstrating platform-specific epidemic dynamics

Notes¶

Strengths: large-scale comparative dataset, rigorous epidemic modeling, clear operationalization of amplification metrics, analysis conducted in real-time during early pandemic period capturing genuine information environment constraints. Provides platform-level baselines for misinformation diffusion.

Limitations: classification of reliable/unreliable sources depends on external fact-checking services (Media Bias/Fact Check), which may have incomplete or inconsistent coverage. Instagram data collection via manual visual inspection may not be fully representative. Twitter's stream API represents only ~1% of tweets, limiting generalizability of Twitter findings. Early-stage pandemic dataset may not reflect stabilized information dynamics. Does not directly measure truth or falsity of individual claims, only source reputation.

Notably, the finding that reliable and unreliable posts spread at similar rates contradicts simpler narratives about "fake news spreads faster." The mechanism is more nuanced: platform affordances and audience composition interact with source reputation to determine actual amplification.