Skip to content
User Preference-aware Fake News Detection

User Preference-aware Fake News Detection

Authors: Yingtong Dou, Kai Shu, Congying Xia, Philip S. Yu, Lichao Sun Venue: SIGIR 2021 — arXiv:2104.12259

TL;DR

Proposes UPFD, which detects fake news by jointly modeling user endogenous preferences (from historical posts) and exogenous context (news propagation graphs) via GNNs. Shows confirmation bias drives news sharing decisions, and historical user behavior provides complementary signals to improve detection over content-only and graph-only baselines.

Contributions

  • Identifies that user endogenous preferences—implicit from historical social media posts—help predict sharing behavior and detection
  • Proposes UPFD framework integrating user preference encoding, propagation graph construction, and hierarchical information fusion via GNNs
  • Releases code and augmented FakeNewsNet benchmark with crawled user historical tweets
  • Demonstrates 1% improvement over prior graph-based methods on Politifact and Gossipcop datasets

Method

Endogenous Preference Encoding: For each user, crawls ~200 recent tweets from their engagement history. Encodes tweets and news content using pretrained word2vec (spaCy 680k vectors) or BERT, then averages embeddings to obtain user preference and news textual representations.

Exogenous Context Extraction: Builds a tree-structured news propagation graph where the root is the news piece and edges represent retweet cascades on Twitter. Uses timestamp and follower count heuristics to infer edges when direct reply links are unavailable.

Information Fusion: Passes news and user embeddings as node features into a GNN (GraphSAGE or GCN). The GNN aggregates neighbor information across the propagation graph via message passing, then applies a readout function (mean pooling) over all nodes to obtain user engagement embedding. Concatenates user engagement and news textual embeddings, feeds to an MLP classifier for fake/real prediction.

Results

On FakeNewsNet (Politifact and Gossipcop): - UPFD (BERT + GraphSAGE): 84.62% accuracy, 84.65% F1 on Politifact; 97.23% accuracy, 97.22% F1 on Gossipcop - Outperforms GCNFN baseline ~1% on both datasets (statistically significant) - BERT encoder consistently outperforms word2vec; exogenous information (graph) matters more on Politifact, endogenous (user preference) dominates on Gossipcop - Ablation: removing either component degrades performance; jointly modeling both is optimal

Connections

Notes

The key insight—confirmation bias drives sharing—is sociologically grounded but the implementation via historical-post embeddings is implicit rather than explicit. The 1% improvement over GCNFN is modest, though the ablation study clearly separates endogenous contributions from exogenous ones. A notable limitation: user tweets are crawled only for accessible accounts; deleted/suspended accounts are replaced with random accessible-user tweets, which may introduce noise. The method assumes that historical tweets encode genuine preference rather than noise or performative signaling. Future work on fine-tuned BERT embeddings and transfer to other platforms (Facebook, Reddit) is suggested.