Skip to content
Learning Hierarchical Discourse-level Structure for Fake News Detection

Learning Hierarchical Discourse-level Structure for Fake News Detection

Authors: Hamid Karimi, Jiliang Tang Venue: arXiv, 2019 — arXiv:1903.07389

TL;DR

Proposes HDSF, a framework that learns hierarchical discourse-level structure (dependency trees of sentences) for fake news detection without annotated corpora. Identifies three structure-related properties that distinguish fake from real news and shows HDSF achieves 82.19% accuracy, outperforming linguistic baselines. Real news documents exhibit significantly higher coherence in their discourse structures.

Contributions

  • First to automatically learn document structure for fake news detection
  • Proposes HDSF framework that constructs discourse-level dependency trees via inter-sentential parent-child probabilities
  • Identifies three structure-related properties: leaf node count, preorder difference, parent-child distance — all showing discriminative power between fake and real news

Method

HDSF comprises three components:

Hierarchical Discourse-level Structure Learning: Treats sentences as discourse units and uses dependency parsing principles to identify parent-child relationships. Uses Bi-directional LSTM (BLSTM) to encode each sentence into a fixed-size representation. An attention matrix captures inter-sentential parent-child probabilities: entry (m, n) represents the probability that sentence \(s_m\) is the parent of \(s_n\). A greedy algorithm constructs the discourse dependency tree by iteratively adding sentences with highest root probability, then searching for maximum parent-child probabilities among remaining nodes.

Structural Document-level Representation: For each sentence, computes a structurally-aware representation by aggregating sentence embeddings weighted by parent and child probabilities from the discourse tree. Document representation is the average of these structural representations.

Fake News Classification: Binary classifier on top of the structural document representation.

Results

Evaluated on five datasets (BuzzFeed, PolitiFact, Kaggle, Meditre) with 3360 fake and 3360 real documents:

  • HDSF: 82.19% accuracy (single model)
  • Baselines: N-grams 69.37%, LIWC 70.26%, RST 71.19%, BiGRNN-CNN 77.06%, LSTM[w+s] 80.54%, LSTM[s] 73.63%

Structural analysis reveals statistically significant differences in all three properties: - Property 1 (Leaf Nodes): Fake news have more isolated sentences (mean 18.1 vs 12.9) — lower inter-linkage indicates less coherent discourse structure - Property 2 (Preorder Difference): Real news maintain closer alignment to original sentence order (mean 152.5 vs 338.3) — fake news exhibit more fragmented structures - Property 3 (Parent-Child Distance): Real news show children closer to parents positionally (mean 152.3 vs 324.2)

Connections

Notes

Strong contribution identifying that fake news exhibit degraded coherence at discourse level. The unsupervised structure learning via dependency parsing is novel for this domain and avoids the corpus annotation bottleneck. Property 3 (parent-child distance) is particularly interpretable: fake stories lack the connective tissue that binds coherent real narratives.

Limitations: evaluation limited to primarily English news datasets; unclear generalization to other languages or domains (e.g., scientific misinformation). Future work mentions investigating word-level structure, but the discourse-level structure as defined here is the paper's main contribution.