Machine learning¶
Machine learning methods—particularly supervised learning with feature engineering and ensemble approaches—are widely used in misinformation and bot detection systems. Researchers train classifiers on labeled datasets of authentic and inauthentic accounts or content, extract features from user profiles (metadata, posting patterns, temporal signatures, content), social networks (follower structure, retweet cascades), and text (linguistic patterns, sentiment), and deploy these models to score new accounts or posts.
Common approaches¶
Supervised classification: Train a classifier (random forest, logistic regression, SVM, neural networks) to distinguish bots from humans, false claims from true ones, or misinformation sources from credible ones. Features are hand-engineered from metadata, network structure, and text.
Ensemble methods: Combine multiple classifiers (bagging, boosting, stacking) to improve robustness and generalization. Example: Botometer v4 uses an ensemble of specialized classifiers, one per bot type.
Unsupervised clustering: Group accounts or content by behavioral similarity without labeled training data. Useful for discovering bot networks and coordinated behavior.
Deep learning: Use neural networks (CNNs, RNNs, transformers) to learn representations of text and network structure end-to-end, without hand-engineered features.
Challenges¶
Data quality and bias: Labeled datasets are expensive and subject to annotation error; ground truth is often ambiguous (e.g., what counts as "disinformation"?).
Concept drift: Bots and disinformation tactics evolve faster than models are retrained; systems that perform well on historical data degrade on new data.
Cross-domain generalization: Models trained on one domain (e.g., 2016 U.S. election) fail to generalize to different contexts or time periods. Cross-domain generalization
Fairness: Classifiers may exhibit disparate error rates across demographic groups, languages, or regions.
Interpretability: Black-box models (deep neural networks) make predictions hard to explain; practitioners need to understand why an account is flagged as a bot to act on the decision.
Key papers in this wiki¶
- Causal Machine Learning: A Survey and Open Problems — Foundational survey of causal machine learning methods for robust, transferable, and fair predictions; covers invariant feature learning, counterfactual fairness, and causal generative modeling
- Detection of Novel Social Bots by Ensembles of Specialized Classifiers — Ensemble of specialized classifiers; addresses cross-domain generalization by training separate classifiers for bot types that exhibit heterogeneous features
- Scalable and Generalizable Social Bot Detection through Data Selection — Demonstrates that data selection (training on carefully chosen subset) improves model generalization and consistency better than exhaustive training; shows how noisy and contradictory labeled datasets can be managed via combinatorial analysis of 247 training set combinations
- A Benchmark Study of Machine Learning Models for Online Fake News Detection — Benchmark study comparing 8 traditional ML models (SVM, logistic regression, decision trees, naive bayes, k-NN, boosting) against deep learning and pre-trained transformers on fake news datasets; finds traditional approaches require feature engineering but can match deep learning performance on large diverse datasets.