Content-based features for misinformation detection¶
Structured extraction and use of linguistic, stylistic, and structural features from news articles and claims for computational detection of misinformation. Encompasses features based on syntax (POS tags, dependency structures), semantics (word embeddings, sentiment, emotion), pragmatics (writing style, readability), and rhetorical signals (clickbait, bias lexicons).
Key papers¶
- Horne et al. (2018) — Sampling the News Producers (NELA2017): Defines and extracts 130 content features spanning structure (POS, clickbait detection, readability), sentiment, emotion, engagement signals, bias lexicons, and morality indicators. Shows that systematic feature differences exist between source types and that combinations of features enable source characterization.
- Potthast et al. (2017) — A Stylometric Inquiry: Applies stylometric features (lexical, syntactic, structural patterns) to distinguish hyperpartisan from mainstream news; demonstrates style-based detection achieves moderate performance (F1=0.78 for hyperpartisan).
- Sharma et al. (2018) — Combating Fake News: A Survey: Systematic survey of content-based detection methods including POS tags, PCFG features, CNNs on word embeddings, and RNN architectures; discusses trade-offs between feature engineering and learned representations.
- Rashkin et al. (2017) — Truth of Varying Shades: Linguistic analysis of fake news across satire, hoax, and propaganda; identifies distinctive lexical and syntactic patterns that characterize different misinformation genres.
Related topics¶
- Fake news detection uses content features as primary signal
- Media characterization employs similar feature sets for source-level analysis
- Natural language processing provides foundational techniques for feature extraction
- Neural networks and deep learning approaches learn feature representations end-to-end rather than hand-engineering
Notes¶
The transition from hand-engineered content features to learned representations (embeddings, transformers) represents a methodological shift in the field. However, interpretable content features remain valuable for understanding why models make decisions and for detecting misinformation in low-resource settings where deep learning is less practical.