A Survey of Information Cascade Analysis: Models, Predictions, and Recent Advances¶

Authors: Fan Zhou, Xovee Xu, Goce Trajcevski, Kunpeng Zhang

Venue: ACM Computing Surveys, Vol. 54, No. 2, Article 27, March 2021

arXiv: 2005.11041

TL;DR¶

This comprehensive survey systematizes research on predicting information cascade growth and structure across social networks, covering feature-based models, generative models, and deep learning approaches. It provides a taxonomy of prediction methods across temporal, structural, user/item, and content features; evaluation metrics and datasets; and identifies open challenges in cross-platform, cross-domain, and explainable cascade prediction.

Contributions¶

General perspective: Extends cascade prediction beyond single-platform and single-content-type studies to cover user-generated content microbloggs, scientific publications, online sharing networks, and citation cascades.
Wider network coverage: Reviews information cascades across diverse platforms (Twitter, Weibo, Facebook, Flickr, YouTube, Reddit, DBLP) rather than focusing on a single social network.
Comprehensive methodology taxonomy: Provides fine-grained analysis of prediction methodologies using three independent classification dimensions:
Formulation: Classification, regression, both
Strategy: Ex-ante (before publication), peeking (early observation), both
Prediction granularity: Macro-level (collective cascade behavior), micro-level (individual user behavior), meso-level (community-level dynamics)
Balanced methodological review: Covers feature-based models, generative models (Poisson processes, survival analysis, epidemic models), and deep learning approaches with systematic comparison of trade-offs, advantages, and limitations across all three categories.
Recent advances: Incorporates advances in graph representation learning and deep learning techniques including graph convolutional networks (GCNs), attention mechanisms, and variational autoencoders (VAEs) for cascade modeling.

Method¶

Problem Definition¶

Information cascade prediction is framed as predicting popularity P(t_p) of an information item at future prediction time t_p given partial cascade observation up to time t_o. The survey distinguishes:

Classification vs. Regression: Whether to predict binary (outbreak/viral) or exact volume of future popularity
Ex-ante vs. Peeking: Ex-ante prediction uses only pre-publication information (content, publisher profile); peeking uses early-stage cascade observations
Macro- vs. Micro-level: Macro-level models cascade growth holistically; micro-level models individual user activation and adoption; meso-level focuses on community-level spreading behavior

Taxonomy of Feature Classes¶

The survey identifies five complementary feature groups extracted from cascades and network structure:

Temporal features (22.8% of reviewed papers): change rate, dormant period, publication time, early popularity growth velocity, observation time window properties
Cascade structure (14.8%): depth, breadth, edges, branching factor, direct/indirect connectivity
Global graph features (17.4%): degree distribution, centrality, clustering, global graph properties
User/item features (23.7%): user activity, follower counts, influence, historical success, account age, content properties
Content features (21.9%): textual, visual, semantic properties of the shared item

Methodological Taxonomy¶

Feature-based models employ traditional machine learning on hand-crafted features: - Regression (linear, logistic, SVM, decision trees, random forests, naïve Bayes) - Peaking strategies (peeking is more effective than ex-ante; most papers adopt peeking)

Generative models assume explicit stochastic processes governing cascade evolution: - Poisson processes and variants (Hawkes processes, self-exciting point processes) - Survival analysis and hazard models - Epidemic models (SIR, SEIR variants) - Self-exciting Hawkes processes

Deep learning models learn end-to-end representations without explicit feature engineering: - MLP-based architectures - RNNs (LSTM, GRU) for temporal sequences - CNNs for local structure - GNNs (GCNs, attention-based) for graph-level and node-level prediction - Attention mechanisms and variational autoencoders (VAE) - Reinforcement learning for sequential prediction

Evaluation Metrics and Datasets¶

The survey documents standard evaluation metrics (Accuracy, Precision, Recall, F1, RMSE, Coefficient of Determination, ROC-AUC) and three major benchmark datasets:

Twitter (1.3M cascades, 595k nodes, 14.4M edges): Hashtag cascades from Twitter; avg. depth 2.1, avg. popularity 21.4
Weibo (139k cascades, 6.7M nodes, 15.2M edges): Sina Weibo retweet cascades; avg. depth 2.3, avg. popularity 56.5
APS papers (514k cascades, 616k nodes, 54.6M edges): Citation cascades; avg. depth 4.1, avg. popularity 4.6

Results¶

Key findings from the literature review:

Growth dynamics: Cascade growth follows power-law distributions (exponent α ≈ 1.9–2.8). Early growth rate is highly predictive of final size; strong correlation between early cascade shape and eventual popularity.
Feature effectiveness: Temporal features demonstrate consistently high predictiveness across datasets; structural features (cascade depth, breadth) and user-item features also critical. Content features alone perform poorly.
Strategy comparison: Peeking strategies substantially outperform ex-ante prediction because early cascade dynamics reveal underlying propagation patterns unavailable at publication time. Observation window of 1 hour to 1 day typically optimal.
Methodology trends: Feature-based approaches dominated 2009–2014; generative models gained traction 2014–2016; deep learning (especially RNNs, GNNs) has dominated since 2017 with superior performance.
Predictability plateau: For very large cascades, predictability saturates—major viral events enter a regime where additional observations yield diminishing improvements. Early detection of "will go viral" is easier than precise size estimation.
Platform variation: Twitter, Weibo, and citation cascades exhibit distinct growth signatures. Weibo cascades gain attention faster from early adopters; Twitter hashtag cascades show slower but more sustained growth; citation cascades reach size asymptotically over months.

Connections¶

Related to Cascade Prediction which specifically focuses on prediction methods and performance comparisons
Provides methodological foundation for Propagation-based fake news detection which uses cascade features for misinformation classification
Complements Information diffusion in social networks which studies broader message and network properties affecting spread
Cited by and extends work on Deep learning applications to network analysis and sequential prediction
Shares benchmark datasets and evaluation protocols with Temporal Prediction and Graph Neural Networks

Notes¶

Strengths:

Exceptionally comprehensive scope: spans 20+ years of literature (2000–2019) across 252 papers, covering multiple platforms, content types, and methodological approaches
Clear, formal problem definitions and mathematical notation enabling comparison across heterogeneous work
Systematic taxonomy that organizes work along multiple independent dimensions (formulation, strategy, granularity, features, methodologies) rather than ad-hoc categorization
Extensive tables (8–12) documenting paper-by-paper feature usage, venue distribution, and dataset statistics, facilitating meta-analysis
Honest discussion of trade-offs: e.g., feature-based models offer interpretability while deep learning sacrifices interpretability for better predictive accuracy; ex-ante prediction is difficult but valuable for early intervention vs. peeking is easier but less actionable
Recognition of class imbalance, data sparsity, and other practical challenges in real deployments

Weaknesses:

Submitted May 2020, revised/accepted October 2020—misses very recent (2020+) advances in vision transformers, contrastive learning, and multimodal cascades
Limited cross-platform generalization analysis: most papers train and test on single datasets; transfer learning and domain adaptation largely unexplored in the reviewed literature
Explainability largely absent from reviewed work: few papers connect prediction decisions to underlying user behavior or network properties; opens opportunity for interpretable cascade models
Early detection emphasis but limited coverage of intervention strategies: knowing a cascade will go viral is useful for platforms only if paired with actionable interventions (content promotion, throttling, labeling)
Citation cascades (DBLP papers) underrepresented in review despite representing a distinct, high-quality data source; social media dominance may limit generalizability to other domains (e.g., scientific impact prediction, viral discovery, commercial product adoption)

Follow-up opportunities:

Zero-shot and few-shot cascade prediction across platforms and content types
Explainable prediction via attention mechanisms, LIME, SHAP, or causal inference connecting cascade features to user decisions
Integration of multimodal features (image+text, video+comments) for richer cascade representations
Real-time, streaming prediction under budget constraints and cold-start conditions
Combining micro-, meso-, and macro-level predictions in a unified framework
Adversarial cascade manipulation detection: identifying synthetic or manipulated cascade signatures