Skip to content
A Survey of Information Cascade Analysis: Models, Predictions, and Recent Advances

A Survey of Information Cascade Analysis: Models, Predictions, and Recent Advances

Authors: Fan Zhou, Xovee Xu, Goce Trajcevski, Kunpeng Zhang

Venue: ACM Computing Surveys, Vol. 54, No. 2, Article 27, March 2021

DOI: 10.1145/3433000

arXiv: 2005.11041

TL;DR

This comprehensive survey systematizes research on predicting information cascade growth and structure across social networks, covering feature-based models, generative models, and deep learning approaches. It provides a taxonomy of prediction methods across temporal, structural, user/item, and content features; evaluation metrics and datasets; and identifies open challenges in cross-platform, cross-domain, and explainable cascade prediction.

Contributions

  • General perspective: Extends cascade prediction beyond single-platform and single-content-type studies to cover user-generated content microbloggs, scientific publications, online sharing networks, and citation cascades.
  • Wider network coverage: Reviews information cascades across diverse platforms (Twitter, Weibo, Facebook, Flickr, YouTube, Reddit, DBLP) rather than focusing on a single social network.
  • Comprehensive methodology taxonomy: Provides fine-grained analysis of prediction methodologies using three independent classification dimensions:
  • Formulation: Classification, regression, both
  • Strategy: Ex-ante (before publication), peeking (early observation), both
  • Prediction granularity: Macro-level (collective cascade behavior), micro-level (individual user behavior), meso-level (community-level dynamics)
  • Balanced methodological review: Covers feature-based models, generative models (Poisson processes, survival analysis, epidemic models), and deep learning approaches with systematic comparison of trade-offs, advantages, and limitations across all three categories.
  • Recent advances: Incorporates advances in graph representation learning and deep learning techniques including graph convolutional networks (GCNs), attention mechanisms, and variational autoencoders (VAEs) for cascade modeling.

Method

Problem Definition

Information cascade prediction is framed as predicting popularity P(t_p) of an information item at future prediction time t_p given partial cascade observation up to time t_o. The survey distinguishes:

  • Classification vs. Regression: Whether to predict binary (outbreak/viral) or exact volume of future popularity
  • Ex-ante vs. Peeking: Ex-ante prediction uses only pre-publication information (content, publisher profile); peeking uses early-stage cascade observations
  • Macro- vs. Micro-level: Macro-level models cascade growth holistically; micro-level models individual user activation and adoption; meso-level focuses on community-level spreading behavior

Taxonomy of Feature Classes

The survey identifies five complementary feature groups extracted from cascades and network structure:

  1. Temporal features (22.8% of reviewed papers): change rate, dormant period, publication time, early popularity growth velocity, observation time window properties
  2. Cascade structure (14.8%): depth, breadth, edges, branching factor, direct/indirect connectivity
  3. Global graph features (17.4%): degree distribution, centrality, clustering, global graph properties
  4. User/item features (23.7%): user activity, follower counts, influence, historical success, account age, content properties
  5. Content features (21.9%): textual, visual, semantic properties of the shared item

Methodological Taxonomy

Feature-based models employ traditional machine learning on hand-crafted features: - Regression (linear, logistic, SVM, decision trees, random forests, naïve Bayes) - Peaking strategies (peeking is more effective than ex-ante; most papers adopt peeking)

Generative models assume explicit stochastic processes governing cascade evolution: - Poisson processes and variants (Hawkes processes, self-exciting point processes) - Survival analysis and hazard models - Epidemic models (SIR, SEIR variants) - Self-exciting Hawkes processes

Deep learning models learn end-to-end representations without explicit feature engineering: - MLP-based architectures - RNNs (LSTM, GRU) for temporal sequences - CNNs for local structure - GNNs (GCNs, attention-based) for graph-level and node-level prediction - Attention mechanisms and variational autoencoders (VAE) - Reinforcement learning for sequential prediction

Evaluation Metrics and Datasets

The survey documents standard evaluation metrics (Accuracy, Precision, Recall, F1, RMSE, Coefficient of Determination, ROC-AUC) and three major benchmark datasets:

  • Twitter (1.3M cascades, 595k nodes, 14.4M edges): Hashtag cascades from Twitter; avg. depth 2.1, avg. popularity 21.4
  • Weibo (139k cascades, 6.7M nodes, 15.2M edges): Sina Weibo retweet cascades; avg. depth 2.3, avg. popularity 56.5
  • APS papers (514k cascades, 616k nodes, 54.6M edges): Citation cascades; avg. depth 4.1, avg. popularity 4.6

Results

Key findings from the literature review:

  • Growth dynamics: Cascade growth follows power-law distributions (exponent α ≈ 1.9–2.8). Early growth rate is highly predictive of final size; strong correlation between early cascade shape and eventual popularity.
  • Feature effectiveness: Temporal features demonstrate consistently high predictiveness across datasets; structural features (cascade depth, breadth) and user-item features also critical. Content features alone perform poorly.
  • Strategy comparison: Peeking strategies substantially outperform ex-ante prediction because early cascade dynamics reveal underlying propagation patterns unavailable at publication time. Observation window of 1 hour to 1 day typically optimal.
  • Methodology trends: Feature-based approaches dominated 2009–2014; generative models gained traction 2014–2016; deep learning (especially RNNs, GNNs) has dominated since 2017 with superior performance.
  • Predictability plateau: For very large cascades, predictability saturates—major viral events enter a regime where additional observations yield diminishing improvements. Early detection of "will go viral" is easier than precise size estimation.
  • Platform variation: Twitter, Weibo, and citation cascades exhibit distinct growth signatures. Weibo cascades gain attention faster from early adopters; Twitter hashtag cascades show slower but more sustained growth; citation cascades reach size asymptotically over months.

Connections

Notes

Strengths:

  • Exceptionally comprehensive scope: spans 20+ years of literature (2000–2019) across 252 papers, covering multiple platforms, content types, and methodological approaches
  • Clear, formal problem definitions and mathematical notation enabling comparison across heterogeneous work
  • Systematic taxonomy that organizes work along multiple independent dimensions (formulation, strategy, granularity, features, methodologies) rather than ad-hoc categorization
  • Extensive tables (8–12) documenting paper-by-paper feature usage, venue distribution, and dataset statistics, facilitating meta-analysis
  • Honest discussion of trade-offs: e.g., feature-based models offer interpretability while deep learning sacrifices interpretability for better predictive accuracy; ex-ante prediction is difficult but valuable for early intervention vs. peeking is easier but less actionable
  • Recognition of class imbalance, data sparsity, and other practical challenges in real deployments

Weaknesses:

  • Submitted May 2020, revised/accepted October 2020—misses very recent (2020+) advances in vision transformers, contrastive learning, and multimodal cascades
  • Limited cross-platform generalization analysis: most papers train and test on single datasets; transfer learning and domain adaptation largely unexplored in the reviewed literature
  • Explainability largely absent from reviewed work: few papers connect prediction decisions to underlying user behavior or network properties; opens opportunity for interpretable cascade models
  • Early detection emphasis but limited coverage of intervention strategies: knowing a cascade will go viral is useful for platforms only if paired with actionable interventions (content promotion, throttling, labeling)
  • Citation cascades (DBLP papers) underrepresented in review despite representing a distinct, high-quality data source; social media dominance may limit generalizability to other domains (e.g., scientific impact prediction, viral discovery, commercial product adoption)

Follow-up opportunities:

  • Zero-shot and few-shot cascade prediction across platforms and content types
  • Explainable prediction via attention mechanisms, LIME, SHAP, or causal inference connecting cascade features to user decisions
  • Integration of multimodal features (image+text, video+comments) for richer cascade representations
  • Real-time, streaming prediction under budget constraints and cold-start conditions
  • Combining micro-, meso-, and macro-level predictions in a unified framework
  • Adversarial cascade manipulation detection: identifying synthetic or manipulated cascade signatures