On the Origins of Memes by Means of Fringe Web Communities¶

Authors: Savvas Zannettou, Tristan Caulfield, Jeremy Blackburn, Emiliano De Cristofaro, Michael Sirivianos, Gianluca Stringhini, Guillermo Suarez-Tangil

Venue: ACM IMC 2018 — arXiv:1805.12512

TL;DR¶

Large-scale empirical study of how image memes originate and propagate across fringe and mainstream Web communities. Analyzes 160M images posted to Twitter, Reddit, /pol/, and Gab (2016–2017) using perceptual hashing, clustering, and custom distance metrics. Finds /pol/ and The_Donald substantially influence meme ecosystems despite modest size; documents phylogenetic relationships between meme variants using Hawkes processes; reveals that hateful memes (anti-semitic, racist) are disproportionately common on fringe communities.

Contributions¶

Large-scale visual analysis framework: Custom processing pipeline combining perceptual hashing (pHash) with density-based clustering (DBSCAN) to identify 12.6K meme clusters from 160M images; introduces custom distance metric balancing visual similarity (pHash) and metadata (Know Your Meme taxonomy)
Ground-truth meme annotation: Collaborates with Know Your Meme (KYM), a crowdsourced meme encyclopedia, to validate 200 clusters and assess quality of automated clustering (89% agreement among three annotators)
Community influence quantification: Uses Hawkes processes to model temporal influence between Web communities; finds /pol/ and The_Donald substantially influence mainstream meme ecosystems by posting large volumes early
Cross-platform ecosystem analysis: Traces meme propagation across four heterogeneous communities (Twitter: 1.47B posts; Reddit: 681M posts; /pol/: 48.7M posts; Gab: 12.4M posts) revealing how fringe communities seed memes that later appear in mainstream networks
Political content characterization: Documents prevalence of hateful, controversial, and political memes; ranks top memes by community (e.g., Donald Trump, Happy Merchant, Smug Frog across /pol/, T_D, Gab); reveals racist memes extremely common in fringe Web communities
Screenshot classifier: Develops CNN for detecting screenshots of social media posts within meme images; achieves 91.3% accuracy, 94.3% precision, 93.5% recall, 93.9% F1; enables detection of meme variants originating from different platforms

Method¶

Data sources:

Twitter: 1.47B posts (242.7M with images, 114.5M unique images) via 1% streaming (Jan–Jul 2017)
Reddit: 681.7M posts (62.1M with images, 40.5M unique images) via Reddit API (Jul 2016–Jul 2017)
/pol/ (4chan Politically Incorrect): 48.7M posts (13.4M with images, 4.3M unique images) via archives (Jul 2016–Jul 2017)
Gab: 12.4M posts (1.2M with images, 325K unique images); community launched August 2016
Know Your Meme (KYM): 15.8K meme entries with 597K unique pHashes; covers meme origins, variants, keywords, subcategories, cultures, people, events, sites

Processing pipeline:

pHash extraction (Step 1): Extracts Discrete Cosine Transform–based 64-element perceptual hash fingerprint for each image
Pairwise clustering (Steps 2–3): Computes Hamming distance between all pHashes; uses DBSCAN with distance threshold θ=8 (balances diversity vs. false positives)
Cluster annotation (Steps 4–5): Compares clusters with KYM entries using Hamming distance; selects representative KYM meme per cluster via minimum average distance
Custom distance metric (Step 6): Defines hybrid distance combining perceptual similarity and metadata:
Features: {perceptual, meme, people, culture}
Distance: \(d(c_i, c_j) = 1 - \sum_j w_j r_j(c_i, c_j)\) where \(r_j\) denotes feature-specific similarity
Full-mode weights: \(w_{perceptual}=0.4, w_{meme}=0.4, w_{people}=0.1, w_{culture}=0.1\) (equal weight to perceptual and meme categories; lower to metadata)
Partial-mode weights: Used when only one cluster is annotated; sets \(w_{perceptual}=1.0\)
Influence estimation (Step 7): Uses Hawkes processes to model temporal influence between communities; quantifies per-community reciprocal influence on meme dissemination

Datasets:

12.6K annotated clusters: 38.9K clusters identified from /pol/; 21K for The_Donald; 3K for Gab; 9.2K from /pol/ had KYM coverage
Noise assessment: Of clusters with distance ≤8, 62.8% are "clean" meme clusters; 24% noise; 14% edge cases

Results¶

Cluster statistics:

/pol/: 4.3M images → 38.9K clusters (24% coverage with KYM tags); 63% noise
The_Donald: 1.2M images → 21K clusters (13% coverage); 64% noise
Gab: 235K images → 3.1K clusters (15% coverage); 69% noise
Noise distribution: Strongly dependent on clustering distance; distances 2–6 show 62.8%–73% noise; distance 10 has lowest noise (27%)

Top KYM entries:

Top 20 by frequency: Donald Trump (207 /pol/ clusters, 177 T_D clusters, 25 Gab), Happy Merchant (124 /pol/, 84 T_D, 10 Gab), Smug Frog (63 T_D, 35 /pol/), Pepe the Frog (49 T_D), Dubs Guy (51 /pol/), anti-semitic memes disproportionately common on /pol/
Hateful content prevalence: /pol/ shows high concentration of hateful and controversial memes; anti-semitic Happy Merchant meme extremely prevalent on /pol/, less so on mainstream networks
Community specialization: Each fringe community has distinctive meme repertoire; racist/hateful memes dominant on /pol/; political Trump memes dominant on T_D; Gab attracts right-wing users and conspiracy-related content

Meme evolution & branching:

Phylogenetic relationships: Smug Frog shows 525 clusters representing 23 distinct meme variants; Pepe the Frog (152 clusters), Upustaja, Sad Frog, Savepepe variants
Cluster evolution patterns: Meme variants branch hierarchically; phylogenetic distance metric reveals parent-child relationships; examples include "Arbeit macht frei" Pepe (anti-semitic variant combining Auschwitz imagery with Pepe), Happy Merchant variants showing Nazi symbology
Branching dynamics: Two clusters perceptually similar but disseminating different messages considered separate meme variants (e.g., Smug Frog in two clusters: one showing character as dinosaur-like creature, another hiding behind kitchen counter)

Influence estimation (Hawkes processes):

/pol/ influence: Most influential community; 4.1% mean increase in event probability for Reddit; 0.01% for Twitter (minimal direct influence); substantial influence on /pol/ ecosystem itself
The_Donald: Highly efficient at pushing memes to both fringe and mainstream communities despite modest size
Community dynamics: Fringe communities (/pol/, T_D, Gab) serve as seed sources for meme variants later appearing on mainstream platforms; meme propagation shows clear directional flow from fringe → mainstream
Temporal patterns: Memes originating in fringe communities often appear on Twitter/Reddit weeks later, suggesting temporal lag in mainstream adoption

Clustering quality assessment:

Annotator agreement: 89% clustering accuracy after majority vote among three annotators
1.85% "bad" clusters: Determined by checking whether KYM annotations match cluster images and whether labels are "appropriate"

Screenshot classifier performance:

Architecture: Two Conv-MaxPool blocks followed by fully-connected layer (512 units) with Dropout(d=0.5)
Dataset: 28.8K images from Twitter, 4chan, Reddit, Facebook, Instagram with binary screenshot/random label
Performance: AUC = 0.96; Accuracy = 91.3%; Precision = 94.3%; Recall = 93.5%; F1 = 93.9%
Application: Enables detection of meme variants that appear as screenshots of social media posts

Connections¶

Propagation models: Uses Hawkes processes to quantify community influence, extending temporal cascade analysis
Social media analysis: Large-scale empirical study of content dissemination across heterogeneous platforms
Extremism and radicalization: Documents disproportionate prevalence of hateful and anti-semitic memes on fringe communities
Political extremism: Analyzes role of /pol/ and T_D in coordinating and amplifying political memes
Meme culture and politics: Foundational study of how memes originate and propagate in context of 2016 US election and political polarization
Zannettou et al. (2018) — Disinformation Warfare: Parallel study by same authors examining state-sponsored troll influence; both papers characterize fringe community dynamics
Co-author Blackburn's work on 4chan toxicity and identity provides complementary perspective on platform dynamics

Notes¶

Strengths:

Large-scale visual dataset: 160M images from four diverse platforms over 13-month period provides unprecedented empirical scope for studying meme ecosystems
Methodological rigor: Custom distance metric balancing visual and semantic similarity; validation against crowdsourced KYM encyclopedia; manual annotation with inter-rater agreement assessment
Cross-platform comparison: Heterogeneous community analysis (Twitter, Reddit, /pol/, Gab) reveals systematic differences in meme types and prevalence; shows how fringe communities specialize in offensive content
Temporal influence modeling: Hawkes processes quantify directed influence between communities, providing evidence that fringe communities seed mainstream memes
Practical output: Screenshot classifier enables downstream analysis of meme variants originating from different sources; valuable tool for understanding cross-platform propagation

Limitations & caveats:

Clustering noise: DBSCAN clustering produces 62.8%–73% noise depending on distance threshold; manual assessment shows "bad" cluster annotations at 1.85%, but full false positive rate unclear
KYM coverage limitations: Only 15.6K out of 18.3K KYM entries available at crawl time; coverage varies by community (24% for /pol/, 13% for T_D); misses newer memes and community-specific variants not in KYM
Annotation subjectivity: Cluster appropriateness determined by three annotators with guidelines, but KYM galleries themselves are crowdsourced and may contain errors; no assessment of label correctness independent of KYM
Temporal truncation: Dataset collected Jul 2016–Jul 2017; misses pre-election period when initial meme ecosystems formed and post-2017 evolution
Platform selection bias: Focuses on four communities with known political content; excludes mainstream news, YouTube, TikTok, Instagram (which may host memes in different forms)
Causality vs. correlation: Hawkes process modeling shows timing relationships but doesn't prove causal influence; temporal precedence doesn't rule out independent creation of similar variants
Ethical considerations: Paper documents prevalence of hateful memes (anti-semitic, racist) on /pol/ without fully analyzing why offensive content dominates these communities or policy implications

Significance:

This paper provides quantitative evidence that fringe Web communities (/pol/, The_Donald, Gab) play an outsized role in originating and seeding meme variants that propagate to mainstream social media. The key finding—that /pol/ and T_D influence the broader meme ecosystem despite relatively small size—suggests these communities function as innovation hubs for internet culture. The temporal lag between fringe and mainstream meme appearance indicates memes undergo curation as they move between communities.

The prevalence of hateful memes on fringe platforms raises critical questions about platform moderation, algorithmic amplification, and the role of anonymous imageboards in coordinating offensive content. The paper demonstrates that studying internet culture requires visual analysis and cross-platform comparison—text-only approaches miss how memes encode meaning through visual variation and how their meanings evolve as they propagate.

The distance metric and clustering approach provide a template for large-scale visual meme analysis; the screenshot classifier enables distinguishing meme variants that reference specific platforms. This work opens research directions in meme phylogenetics, community-specific content moderation, and understanding how fringe communities influence broader cultural narratives.