Skip to content
Is Less Really More? Fake News Detection with Limited Information

Is Less Really More? Fake News Detection with Limited Information

Authors: Zhaoyang Cao, John Nguyen, Reza Zafarani Venue: arXiv preprint, April 2025 — arXiv:2504.01922 Code: https://github.com/kappakant/SLIM

TL;DR

SLIM (Systematically-selected Limited Information) replaces full news article text with strategically chosen subsets — extracted keywords, POS/NER sequence tags, or textual metadata — to address the computational cost and data-sparsity limitations of full-text fake news detection. Information-theoretic measures (normalized Shannon entropy, average token count) confirm these subsets are substantially sparser than full text. Fine-tuning XLNet_base on SLIM variants achieves 95.55% on ReCOVery and 97.60% on Fake_And_Real_News, matching or beating MisRoBÆRTA and BiLSTM_CapsNet. Retaining just 30% of keywords recovers ~99% of full-text accuracy, and combining keywords with title or NER tags yields additional gains.

Contributions

  • First framework to systematically quantify and compare multiple limited-information strategies for fake news detection using information-theoretic measures (normalized Shannon entropy and average token count).
  • Demonstrated that MMR-selected keywords at 30% of full-text length are sufficient to achieve near-full-text accuracy (~99% accuracy ratio) across multiple benchmarks.
  • Showed that multi-modal combination of limited information (keywords + title, keywords + NER) consistently outperforms single-modality subsets.
  • Established that metadata alone (title, author) cannot substitute for text body — accuracy drops ~10% below baseline — but serves as a useful complement when combined with keywords.

Method

SLIM formalises the news article as an ordered sequence \(A = \{w_1, w_2, \ldots, w_p\}\) and defines four variants based on the type of limited input:

SLIM_KEYWORD. BERT computes a document embedding \(e_d \in \mathbb{R}^n\). N-gram word embeddings \(e_{w_i}\) are computed for each word; candidates must have positive cosine similarity to \(e_d\). MMR (Maximal Marginal Relevance, \(\lambda = 0.5\)) then iteratively selects the word \(w^*\) that maximises:

\[w^* = \arg\max_{w_i \in C}\left[\lambda \cdot \text{sim}(e_d, e_{w_i}) - (1-\lambda)\max_{w_j \in R}\text{sim}(e_{w_i}, e_{w_j})\right]\]

The keyword proportion \(k\) is expressed as a fraction of total word count and is swept from 10% to 35% in experiments.

SLIM_SEQUENCE. Tokenised text is POS-tagged; adjectives and adverbs are retained (SLIM_POS). Named-entity chunks are extracted without count limits (SLIM_NER), since named entities are already sparse.

SLIM_METADATA. Only title and/or author fields are passed to the encoder, bypassing the body text entirely.

SLIM_MULTIMODAL. Concatenation (\(\oplus\)) of keyword sets with NER words (\(\text{SLIM}^I\)), author (\(\text{SLIM}^{II}\)), or title (\(\text{SLIM}^{III}\)).

All variants share the same backbone: XLNet_base, pre-trained with XLNet's permutation language modelling objective \(\mathcal{F}(\theta) = \max\mathbb{E}[\sum_t \log p(x_{z_t} \mid \mathbf{x}_{z_{<t}};\theta)]\) and fine-tuned with cross-entropy loss using Adam (\(\eta = 5 \times 10^{-5}\)). Prediction is \(\hat{y} = \text{argmax}_i(\mathbf{z}_i)\) over XLNet logits.

Information density of each input type is measured via normalised Shannon entropy \(S_\text{normalized} = \sum_{w \in A} H(w)/\text{sig}(w)\) where significance \(\text{sig}(w) = f_w^{\mathcal{T}} / |\mathcal{T}|\) normalises by relative word frequency in the full article.

Results

Full-text SLIM baseline (Table 3):

Dataset Accuracy Macro-F₁ AUC
ReCOVery 95.55 ± 0.0046 94.71 95.53
Fake_And_Real_News 97.60 ± 0.0031 97.60 97.62

Comparison with baselines (Table 1, 25% keywords):

Method ReCOVery Fake_And_Real_News
DocEmb_TFIDF BiLSTM 89.56 92.26
MisRoBÆRTA 91.35 97.34
BiLSTM_CapsNet 95.49 95.56
SLIM 95.55 97.60
SLIM_KEYWORD 92.86 92.76
SLIM_MULTIMODAL^III 93.72 93.72

Key findings by research question:

  • RQ2 (keywords): 30% keyword extraction achieves ~99% accuracy ratio on ReCOVery. Accuracy ratio increases monotonically with keyword proportion on both datasets.
  • RQ3 (sequences): POS tagging achieves ~94% accuracy ratio at 10–20% of full text. NER alone reaches 86.82% on ReCOVery (significantly below baseline) and 90.08% on Fake_And_Real_News.
  • RQ4 (metadata): Title alone: 82.25% (ReCOVery), 85.21% (Fake_And_Real_News) — statistically significant degradation (~10% drop). Author alone: 76.99% (ReCOVery).
  • RQ5 (multimodal): Keywords + title outperforms keywords alone on ReCOVery. Keywords + NER slightly reduces accuracy (~0.5%) on Fake_And_Real_News due to heterogeneous NER effects.

Connections

  • Uses ReCOVery as a primary benchmark; the dataset was introduced in Zhou et al. (2020).
  • SLIM_MULTIMODAL extends the multi-signal approach of SAFE (Zhou et al., 2020) from image+text fusion to keyword+metadata fusion.
  • The strategic feature-selection philosophy complements the hand-crafted feature tradition surveyed in feature engineering, but uses a language-model-guided MMR rather than manual selection.
  • Multimodal detection topic covers the broader context of combining heterogeneous information sources.
  • Zhou et al. (2023) — HERO is a concurrent Syracuse group work using hierarchical linguistic trees; SLIM takes the opposite direction — minimising rather than enriching the input representation.

Notes

The central empirical insight — that 30% of keywords, selected by a principled MMR procedure, recovers ~99% of full-text accuracy — is practically significant for deployment in sparse-data or real-time contexts. However, this claim rests on two benchmarks (ReCOVery, Fake_And_Real_News), both English-language and relatively balanced. Generalisation to highly imbalanced, multilingual, or domain-shifted datasets remains open.

The metadata result is instructive: titles alone are sufficient for ~82–85% accuracy, confirming that clickbait-style linguistic patterns are detectable from headlines. The author feature is weaker, suggesting author identity is either noisy or absent in these datasets.

Future work flagged by the authors: syntactic-semantic augmentation (controlled paraphrase generation, dependency shuffling) to further close the gap between limited-information and full-text performance.