Skip to content
Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection

Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection

Authors: Nguyen Vo, Kyumin Lee Venue: arXiv, 2021 — arxiv:2102.02680

TL;DR

The paper proposes MAC, a hierarchical multi-head attention network that jointly learns word-level and document-level attention for fact-checking textual claims against evidence from external sources. Multi-head attention mechanisms capture different semantic aspects of claims and evidence documents, enabling the model to focus on important words within documents and important documents among retrieved evidence. Evaluated on Snopes and PolitiFact datasets, MAC achieves 9.47% and 6.78% improvements respectively over state-of-the-art baselines, demonstrating the importance of combining word and evidence-level attention.

Contributions

  • Hierarchical multi-head attention for evidence-aware fact-checking. Proposes MAC which jointly combines word-level and document-level attention mechanisms to fact-check textual claims using external evidence.
  • Multi-head word attention mechanism. Uses multiple attention heads to capture different semantic contributions of words within claims and evidence documents.
  • Evidence-aware document attention. Introduces multi-head document-level attention to identify important evidence among multiple retrieved articles for verifying a claim's credibility.
  • Comprehensive evaluation. Demonstrates effectiveness on two public datasets (Snopes and PolitiFact) with significant improvements and includes ablation studies showing the importance of both attention layers.

Method

The Hierarchical Multi-head Attentive Network (MAC) consists of four main components: (1) an embedding layer, (2) a multi-head word attention layer, (3) a multi-head document attention layer, and (4) an output layer.

Embedding and word-level attention. Claims are modeled as sequences of words. Both claims and evidence documents are embedded using BiLSTM (bidirectional LSTM) to capture contextual information. A multi-head word attention mechanism extends the standard attention mechanism to use \(h_1\) different attention heads, each learning to focus on different semantic aspects. The model creates multiple attention distributions over the claim's words, with each head capturing different important words or phrases. Attention weights are computed using the standard attention formula and normalized by the softmax operation.

Document-level attention and evidence representation. To understand which evidence documents are most relevant for fact-checking, the paper proposes a multi-head document attention mechanism. Each retrieved document's importance is computed using \(h_2\) attention heads, allowing the model to capture diverse semantic contributions of documents toward determining claim veracity. Documents are further enriched by concatenating publisher embeddings (as publisher credibility varies) to incorporate source information. The model learns to downgrade less informative documents while emphasizing crucial ones.

Output layer. The final representation of a tuple (claim, supporting documents, publishers) is fed through an MLP to compute the probability that the claim is true or false. The model is trained with binary cross-entropy loss to minimize classification error on the fact-checking task.

Results

Snopes dataset: MAC achieves AUC of 0.88715, representing a 9.47% improvement over the best baseline. When considering true news as the positive class, it attains F1 Macro of 0.78660 and F1 Micro of 0.88706. Improvements are statistically significant (p-value < 0.05 by one-sided paired Wilcoxon test).

PolitiFact dataset: MAC achieves AUC of 0.75756, a 6.78% improvement over baselines, with F1 Macro of 0.68642 and improvements of 3.47% in precision for true news.

Ablation studies show: (1) Using only word attention or only document attention performs worse than the combined approach, demonstrating the importance of both mechanisms. (2) The model is sensitive to the number of attention heads, with optimal performance at \(h_1 = 5, h_2 = 2\) on Snopes and \(h_1 = 3, h_2 = 1\) on PolitiFact. (3) Speaker and publisher information provides additional benefits, with Text+Speakers achieving 2~3% improvements over text-only models.

Connections

Notes

The paper makes a solid contribution by recognizing that both word-level and document-level attention are complementary for evidence-aware fact-checking. The hierarchical approach with multiple heads at each level is well-motivated—different words carry different importance for assessing credibility, and different evidence documents have varying reliability. The ablation studies are thorough and clearly demonstrate the necessity of both components.

One limitation is the relatively modest improvements on PolitiFact compared to Snopes (6.78% vs 9.47%), suggesting the approach may be more effective on certain datasets. The paper could benefit from analysis of failure cases or discussion of when hierarchical attention may not be sufficient. Additionally, the incorporation of publisher information is somewhat ad hoc—a more principled approach to source credibility might further improve results.

The work is among the first to systematically combine multi-head attention mechanisms at both word and document levels for fact-checking, making it a valuable contribution to evidence-aware misinformation detection research.