Skip to content
Scalable and Generalizable Social Bot Detection through Data Selection

Scalable and Generalizable Social Bot Detection through Data Selection

Authors: Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, Filippo Menczer

Venue: arXiv, November 2019 — arXiv:1911.09179

TL;DR

This paper addresses scalability and generalization challenges in social bot detection by proposing a framework using only 20 user metadata features, enabling real-time analysis of the full Twitter stream. The key insight is that strategic data selection—training on a carefully chosen subset of 13 compiled datasets—achieves better generalization and consistency than exhaustively training on all available data.

Contributions

  • Framework for scalable bot detection using minimal metadata: achieves 900M tweets/day classification speed, exceeding Twitter Firehose capacity
  • Compilation and systematic analysis of 13 labeled bot/human datasets (94,124 bots, 43,396 humans) from literature and three newly-created sources
  • Quantitative analysis of dataset characteristics: shows separability and generalizability are uncorrelated; some datasets have contradictory labels suggesting labeling noise and fundamental feature limitations
  • Data selection methodology: exhaustive search over 247 dataset combinations to identify training sets optimizing cross-validation accuracy, cross-domain generalization, and consistency with Botometer
  • Evidence that careful subset selection (model M196) outperforms training on all data (model M246) in generalization tests
  • Interpretability: with only 20 features, SHAP explains which characteristics signal bot behavior (e.g., high friends growth, low favorites—bots expand networks without organic engagement)

Method

Features: 20 user metadata features extracted from Twitter API user object: statuses count, followers count, friends count, favourites count, listed count, verified status, default profile, profile background image, and derived features (growth rates, ratios, screen name characteristics). The screen name likelihood feature uses 2M+ unique screen names to detect random naming patterns via bigram likelihood.

Datasets: Compiled 13 datasets: caverlee (honeypot-based), varol-icwsm (manually labeled from bot score deciles), cresci-17 (fine-grained: spambots, social spambots, fake followers), pronbots (scam bots), celebrity, vendor-purchased (paid followers), botometer-feedback (flagged by tool users), political-bots, gilani-17 (undergrad-annotated), cresci-rtbust (Italian retweets), cresci-stock (stock-related), midterm-18 (2018 U.S. election), botwiki (self-identified), verified (balance class).

Dataset Analysis: PCA visualization and homogeneity scoring show five datasets with clear bot/human separation; others (cresci-stock, varol-icwsm) are difficult due to coordination-based labeling or low inter-annotator agreement (75%). Cross-dataset AUC matrix reveals contradictory patterns: training on dataset X and testing on Y sometimes yields AUC < 0.5, indicating datasets with opposite labels and highlighting the noise and feature-space gaps in bot detection.

Data Selection: Trained 247 random forest models (one per dataset combination). Evaluated each via five-fold cross-validation, cross-domain tests on four holdout datasets (botwiki-verified, midterm-18, gilani-17, cresci-rtbust), and correlation with Botometer on 100k random accounts. Selected the model with minimal product of ranks across six evaluation metrics—balancing performance rather than maximizing any single metric.

Evaluation: Strict validation system: cross-validation accuracy, cross-domain generalization (unseen datasets), and consistency with reference tool (Botometer).

Results

Data Selection Effectiveness: - Model M196 (best model): trained on caverlee, varol-icwsm, cresci-17, celebrity, botometer-feedback, political-bots with AUC 0.99 on botwiki-verified and midterm-18 (vs. Botometer's 0.92 and 0.96) - M196 achieves higher precision across holdout datasets at 0.5 threshold compared to Botometer - M246 (all data): AUC 0.91 and 0.83 on same tests—selective training beats exhaustive training - M196 five-fold cross-validation AUC: 0.98; correlation with Botometer on 100k random accounts: Spearman's r = 0.60

Scalability: - Classification: 9.612 ± 0.00006 seconds per account, enabling 900M tweets/day (exceeds Firehose's 500M/day) - 200× faster than Botometer (which requires 200 tweets per account and user mentions) - Users lookup API rate: 8.6M accounts/day (vs. 43.2k/day for feature-rich methods)

Interpretability (SHAP): - Long screen names, high friends count, high friends growth rate → bot-like - Verified status, high favorites count, high followers count, high followers/friends ratio → human-like - Bots eagerly add friends but don't attract followers; humans show organic engagement patterns

Connections

Notes

Strengths: - First systematic analysis of contradictions across bot datasets—reveals fundamental labeling noise and feature-space limitations - Practical insight: data selection beats exhaustive training, with implications for other noisy domains (speech recognition, emotion detection) - Scalable to real-time stream analysis, enabling large-scale studies - Transparent evaluation: cross-domain validation, holdout testing, consistency with established tools - Highly interpretable 20-feature model; SHAP analysis provides actionable insights - Reproducible: curated datasets published in bot repository

Weaknesses: - 20 features capture only a "tiny portion" of account characteristics (authors' words); methods requiring timeline, content, or social network still outperform on some datasets (e.g., cresci-rtbust) - Cross-domain performance drops on two difficult datasets: gilani-17 and cresci-rtbust (both AUC ~0.68 vs. 0.92 on others), suggesting limits to metadata-only approaches - Feature selection not optimized; threshold choice (0.48 vs. 0.32 for cross-validation vs. cross-domain) requires dataset-specific tuning - Doesn't detect coordinated behavior (explicitly out of scope), only individual account classification

Future directions: - Smarter data selection algorithms beyond exhaustive search - Adaptive thresholds or per-dataset calibration - Integration with network/temporal methods to detect coordinated inauthentic behavior - Generalization analysis on non-Twitter platforms