Is Less Really More? Fake News Detection with Limited Information¶

Authors: Zhaoyang Cao, John Nguyen, Reza Zafarani Venue: arXiv preprint, April 2025 — arXiv:2504.01922 Code: https://github.com/kappakant/SLIM

TL;DR¶

SLIM (Systematically-selected Limited Information) replaces full news article text with strategically chosen subsets — extracted keywords, POS/NER sequence tags, or textual metadata — to address the computational cost and data-sparsity limitations of full-text fake news detection. Information-theoretic measures (normalized Shannon entropy, average token count) confirm these subsets are substantially sparser than full text. Fine-tuning XLNet_base on SLIM variants achieves 95.55% on ReCOVery and 97.60% on Fake_And_Real_News, matching or beating MisRoBÆRTA and BiLSTM_CapsNet. Retaining just 30% of keywords recovers ~99% of full-text accuracy, and combining keywords with title or NER tags yields additional gains.

Contributions¶

First framework to systematically quantify and compare multiple limited-information strategies for fake news detection using information-theoretic measures (normalized Shannon entropy and average token count).
Demonstrated that MMR-selected keywords at 30% of full-text length are sufficient to achieve near-full-text accuracy (~99% accuracy ratio) across multiple benchmarks.
Showed that multi-modal combination of limited information (keywords + title, keywords + NER) consistently outperforms single-modality subsets.
Established that metadata alone (title, author) cannot substitute for text body — accuracy drops ~10% below baseline — but serves as a useful complement when combined with keywords.

Method¶

SLIM formalises the news article as an ordered sequence \(A = \{w_1, w_2, \ldots, w_p\}\) and defines four variants based on the type of limited input:

SLIM_KEYWORD. BERT computes a document embedding \(e_d \in \mathbb{R}^n\). N-gram word embeddings \(e_{w_i}\) are computed for each word; candidates must have positive cosine similarity to \(e_d\). MMR (Maximal Marginal Relevance, \(\lambda = 0.5\)) then iteratively selects the word \(w^*\) that maximises:

\[w^* = \arg\max_{w_i \in C}\left[\lambda \cdot \text{sim}(e_d, e_{w_i}) - (1-\lambda)\max_{w_j \in R}\text{sim}(e_{w_i}, e_{w_j})\right]\]

The keyword proportion \(k\) is expressed as a fraction of total word count and is swept from 10% to 35% in experiments.

SLIM_SEQUENCE. Tokenised text is POS-tagged; adjectives and adverbs are retained (SLIM_POS). Named-entity chunks are extracted without count limits (SLIM_NER), since named entities are already sparse.

SLIM_METADATA. Only title and/or author fields are passed to the encoder, bypassing the body text entirely.

SLIM_MULTIMODAL. Concatenation (\(\oplus\)) of keyword sets with NER words (\(\text{SLIM}^I\)), author (\(\text{SLIM}^{II}\)), or title (\(\text{SLIM}^{III}\)).

All variants share the same backbone: XLNet_base, pre-trained with XLNet's permutation language modelling objective \(\mathcal{F}(\theta) = \max\mathbb{E}[\sum_t \log p(x_{z_t} \mid \mathbf{x}_{z_{<t}};\theta)]\) and fine-tuned with cross-entropy loss using Adam (\(\eta = 5 \times 10^{-5}\)). Prediction is \(\hat{y} = \text{argmax}_i(\mathbf{z}_i)\) over XLNet logits.

Information density of each input type is measured via normalised Shannon entropy \(S_\text{normalized} = \sum_{w \in A} H(w)/\text{sig}(w)\) where significance \(\text{sig}(w) = f_w^{\mathcal{T}} / |\mathcal{T}|\) normalises by relative word frequency in the full article.

Results¶

Full-text SLIM baseline (Table 3):

Dataset	Accuracy	Macro-F₁	AUC
ReCOVery	95.55 ± 0.0046	94.71	95.53
Fake_And_Real_News	97.60 ± 0.0031	97.60	97.62

Comparison with baselines (Table 1, 25% keywords):

Method	ReCOVery	Fake_And_Real_News
DocEmb_TFIDF BiLSTM	89.56	92.26
MisRoBÆRTA	91.35	97.34
BiLSTM_CapsNet	95.49	95.56
SLIM	95.55	97.60
SLIM_KEYWORD	92.86	92.76
SLIM_MULTIMODAL^III	93.72	93.72

Key findings by research question:

RQ2 (keywords): 30% keyword extraction achieves ~99% accuracy ratio on ReCOVery. Accuracy ratio increases monotonically with keyword proportion on both datasets.
RQ3 (sequences): POS tagging achieves ~94% accuracy ratio at 10–20% of full text. NER alone reaches 86.82% on ReCOVery (significantly below baseline) and 90.08% on Fake_And_Real_News.
RQ4 (metadata): Title alone: 82.25% (ReCOVery), 85.21% (Fake_And_Real_News) — statistically significant degradation (~10% drop). Author alone: 76.99% (ReCOVery).
RQ5 (multimodal): Keywords + title outperforms keywords alone on ReCOVery. Keywords + NER slightly reduces accuracy (~0.5%) on Fake_And_Real_News due to heterogeneous NER effects.

Connections¶

Uses ReCOVery as a primary benchmark; the dataset was introduced in Zhou et al. (2020).
SLIM_MULTIMODAL extends the multi-signal approach of SAFE (Zhou et al., 2020) from image+text fusion to keyword+metadata fusion.
The strategic feature-selection philosophy complements the hand-crafted feature tradition surveyed in feature engineering, but uses a language-model-guided MMR rather than manual selection.
Multimodal detection topic covers the broader context of combining heterogeneous information sources.
Zhou et al. (2023) — HERO is a concurrent Syracuse group work using hierarchical linguistic trees; SLIM takes the opposite direction — minimising rather than enriching the input representation.

Notes¶

The central empirical insight — that 30% of keywords, selected by a principled MMR procedure, recovers ~99% of full-text accuracy — is practically significant for deployment in sparse-data or real-time contexts. However, this claim rests on two benchmarks (ReCOVery, Fake_And_Real_News), both English-language and relatively balanced. Generalisation to highly imbalanced, multilingual, or domain-shifted datasets remains open.

The metadata result is instructive: titles alone are sufficient for ~82–85% accuracy, confirming that clickbait-style linguistic patterns are detectable from headlines. The author feature is weaker, suggesting author identity is either noisy or absent in these datasets.

Future work flagged by the authors: syntactic-semantic augmentation (controlled paraphrase generation, dependency shuffling) to further close the gap between limited-information and full-text performance.