Feature engineering for fake news detection¶
Feature engineering covers approaches that manually design or extract structured feature representations from news content, social context, or knowledge sources, as opposed to end-to-end deep learning methods that learn representations directly from raw input. Classical classifiers (logistic regression, random forests, SVM) are typically applied to these hand-crafted features.
Common feature families in fake news detection include: - Linguistic / stylometric: RST discourse features, LIWC psycholinguistic categories, n-gram representations, readability scores, lexical/syntactic features. - Social-context / user-profile: account metadata, behavioral features (post frequency, follower ratios), inferred demographics (age, personality, political bias). - Knowledge-based: fact-check signals from external knowledge bases. - Network: propagation graph statistics, stance distribution.
Feature importance analysis (e.g., Gini impurity in Random Forest) is frequently used to identify which features drive classification, providing interpretability absent in black-box neural models.
Key papers¶
- Zhou & Zafarani (2019) — Network-based Fake News Detection: A Pattern-driven Approach: 138 network-structural features across five levels (node, ego, triad, community, network); pattern-based feature groups are individually ablatable — More-Spreader and Stronger-Engagement dominate; effective distance and community density features are theoretically motivated but empirically modest.
- Zhou et al. (2020) — Fake News Early Detection: An Interdisciplinary Study: the most comprehensive theory-grounded feature engineering paper in the wiki; 116+ features spanning four linguistic levels (sBOW, POS, CFG, DIA, CBA, RST); reveals that BOW and deep-syntax CFG features each individually exceed 80% accuracy; empirically maps fake news to deception and clickbait patterns; XGBoost/RF classifiers.
- Shu et al. (2019) — The Role of User Profiles for Fake News Detection: UPF feature vector; Gini-based importance ranking with RegisterTime and political bias as top features.
- Sitaula et al. (2019) — Credibility-based Fake News Detection: 26 credibility features (source + content); source features (author count, past history) dominate content features; LR achieves F1-macro 0.80.
- Cao et al. (2025) — SLIM: replaces full-text input with strategically selected limited features — keywords (MMR-extracted), POS tags, NER tags, or metadata — and uses information-theoretic measures (normalised Shannon entropy, token counts) to verify sparsity; 30% keyword extraction recovers ~99% of full-text accuracy on ReCOVery.
Connections¶
- User profiles is the specific application of feature engineering to social media account attributes.
- Social-context detection relies heavily on feature-engineered representations of user and network signals.