Skip to content

Hierarchical attention mechanisms

Hierarchical attention networks encode documents (news articles, long texts) at multiple levels — word-level and sentence-level (or comment-level) — learning importance weights at each level. This allows the model to identify both informative words and salient sentences, producing interpretable document representations.

Structure:

  1. Word Encoder: Bidirectional RNN (LSTM or GRU) encodes word sequences within a sentence. Attention mechanism computes weights for each word, producing a sentence representation as the weighted sum of word embeddings.
  2. Sentence/Document Encoder: Bidirectional RNN encodes sentence representations. Attention weights over sentences identify which sentences are most important, producing a document representation.
  3. Co-attention variants: In multi-input settings (e.g., news + comments), sentence-comment co-attention can jointly compute importance across both modalities, capturing dependencies between news sentences and user feedback.

Advantages:

  • Interpretability: Attention weights surface which words and sentences drive the model's decision.
  • Multi-level representation: Captures linguistic structure at different granularities without manual feature engineering.
  • Scalability: RNN-based encoding is efficient for documents of varying length.

Applications in fake news detection:

  • Content-only: Hierarchical attention on news articles identifies check-worthy sentences and language patterns.
  • Multi-modal (content + comments): Co-attention reveals which sentences are questioned or verified by readers.

Key papers

  • Singhania et al. (2023) — 3HAN: A Deep Neural Network for Fake News Detection: Three-level hierarchical attention network with word, sentence, and headline-body levels. Special focus on headlines as a distinctive feature of fake news; headline-body attention captures the stance of headlines relative to article bodies. Uses supervised pre-training on headlines for better initialization. Achieves 96.77% accuracy on balanced dataset of ~41K articles; attention visualizations identify key words and sentences for human fact-checkers.
  • Shu et al. (2019) — dEFEND: Hierarchical attention + sentence-comment co-attention; co-attention component is critical to performance (18% F1 drop on GossipCop when removed).
  • Wang et al. (2018) — EANN: Early attention network for news headlines and images; visual and textual attention fusion.
  • Vo & Lee (2021) — Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection: Proposes MAC which extends hierarchical attention to evidence-aware fact-checking with two levels: (1) multi-head word attention identifies important phrases in claims and evidence articles, (2) multi-head document attention weights evidence sources by relevance for claim credibility assessment. Demonstrates multi-head mechanisms at both levels improve upon single-head or single-level approaches.

Connections

  • Explainable detection heavily relies on hierarchical attention to surface which text spans explain the fake-news prediction.
  • Neural methods for fake news detection frequently use hierarchical attention as a core building block.
  • Multimodal detection extends hierarchical attention to align attention across text and images.