Can Cascades be Predicted?¶

Authors: Justin Cheng, Lada A. Adamic, P. Alex Dow, Jon Kleinberg, Jure Leskovec Venue: International Conference on Web and Social Media (ICWSM), 2014 — arXiv

TL;DR¶

Information cascades on social networks can be predicted with reasonable accuracy (~80%) using temporal and structural features extracted from early observations. Performance improves substantially as more cascade stages are observed. Temporal properties are the strongest predictor, while content features alone are weak, and user-initiated cascades are inherently more predictable than page-initiated ones.

Contributions¶

Proposes a cascade growth prediction framework that views cascades as dynamic objects evolving through stages, predicting the next stage from current observations.
Demonstrates that multiple distinct feature classes (content, temporal, structural, user) contribute meaningfully to prediction, with temporal features most predictive.
Shows that prediction accuracy improves linearly with more observations, but the rate of improvement saturates for larger cascades.
Reveals fundamental differences in predictability between user-initiated and page-initiated cascades due to structural properties.
Achieves 79.5% accuracy and 0.877 AUC predicting whether photo reshare cascades on Facebook will reach median size or larger.

Method¶

The authors formulate cascade prediction as a sequence of binary classification problems. Given the first k reshares of a cascade, they predict whether the cascade will eventually reach at least 2k reshares (doubling-in-size prediction). They extract features from five classes:

Content features — properties of the original photo: whether it contains food, nature, people, animals, outdoor/indoor scenes; whether it has caption; emotion words in caption.

Root features — properties of the original poster: account age, gender, page vs. user account, number of friends/followers, viewing activity.

Resharer features — properties of the users who reshared: number of viewers until each reshare, pages responsible for each reshare, percentile friend/subscriber counts, activity levels.

Structural features — graph properties: out-degree in induced subgraph, connection count to root, border nodes, tree depth, subgraph size and edges. Uses the Wiener index (average pairwise distance) to measure structural virality.

Temporal features — time properties: elapsed time since original post, time between reshares, change in time between reshares, views per unit time, number of new viewers per reshare.

The authors use logistic regression as the primary classifier (also evaluate SVM, decision trees, random forests). They perform 10-fold cross-validation and report classification accuracy, F1 score, and AUC.

Results¶

Overall performance: Using all feature classes, the model achieves 79.5% accuracy, 0.795 F1 score, and 0.877 AUC. Performance is robust in the sense that multiple distinct feature classes achieve similar performance (~78% using temporal alone).

Feature importance ranking (by accuracy): - Temporal features alone: 78.0% accuracy, 0.870 AUC — most predictive - Resharer features: 73.0% accuracy, 0.797 AUC - Structural features: 67.1% accuracy, 0.735 AUC - Root features: 63.7% accuracy, 0.707 AUC - Content features alone: 55.8% accuracy, 0.582 AUC — weaker than random (50%)

Observation window effects: Accuracy improves nonlinearly with the number of observed reshares. Early observations have high impact; performance plateaus for large cascades. At k=25 observed reshares, accuracy reaches ~81% and rises to ~83% at k=100.

Cascade size stratification: Performance differs substantially by cascade size. For cascades with only 10-20 total reshares, prediction is harder (~72% accuracy); for cascades with 100+ reshares, accuracy is ~80%.

Cascade structure: The model can predict cascade shape (whether it will have a shallow or deep structure) with 72.5% accuracy and 0.790 AUC when predicting the Wiener index. User-initiated cascades form deeper, more tree-like structures. Page-initiated cascades spread broader and shallower, and are less predictable in structure.

Feature correlation shifts: The relative importance of features changes as more of the cascade is observed. Initially, structural virality (Wiener index) is predictive; as the cascade grows, temporal features become dominant. Content and root features remain stable in importance.

Connections¶

Related to The Spread of True and False News Online via shared focus on understanding information spread dynamics and temporal patterns in cascade growth.
Addresses similar questions to Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation on how network structure influences information diffusion.
Provides methodology applicable to Anatomy of an online misinformation network and misinformation cascade analysis.
Foundational work on cascade prediction referenced by Beyond News Contents: The Role of Social Context for Fake News Detection and related diffusion models.

Notes¶

This work is foundational for understanding information cascade predictability on social networks. The key insight—that temporal features are far more predictive than content—has implications for misinformation spread: viral false content may be driven more by user activity patterns and network structure than by the falsity or novelty of the claim itself. The distinction between user and page-initiated cascades suggests that personification (presenting disinformation as from users vs. organizations) affects both spread patterns and predictability. The paper's demonstration that predictability saturates for large cascades is important: viral content may enter a regime where structural/temporal dynamics dominate, making human judgment or intervention timing critical. One limitation is that the work uses only reshares (not comments, reactions, or other signals), and generalizes only to a single platform and content type (photos).