Skip to content

Attention mechanisms in NLP

Learned mechanisms that compute context-dependent importance weights over input sequences, allowing models to focus on task-relevant elements. In NLP, attention is typically applied to word sequences or sentence sequences to learn differential relevance without manual feature engineering.

Foundational papers

  • Attention Is All You Need — introduces scaled dot-product and multi-head attention mechanisms; the canonical reference for attention-based sequence modeling.

Key mechanisms

  • Self-attention / Scaled dot-product attention: Computes importance via similarity between query and key vectors; scaled to stabilize gradients. Foundation of Transformer models.
  • Additive attention (Bahdanau): Uses a learned MLP to compute relevance scores between query and each input element.
  • Multi-head attention: Applies multiple independent attention mechanisms in parallel, capturing different aspects of relevance (syntactic, semantic, pragmatic).

Limitations and biases

  • [[2023-liu-lost-in-middle]] reveals that transformers exhibit severe positional bias in long contexts, with performance following a U-shaped curve (best at beginning and end, worst in middle)—indicating fundamental limits in how attention mechanisms process extended sequences.

Key papers in fake news detection

  • Singhania et al. (2023) — 3HAN: Applies attention at three hierarchical levels (words, sentences, headlines) to fake news detection. Word-level attention learns which words are relevant to veracity assessment within each sentence; sentence-level attention identifies informative sentences in article bodies; headline-body attention captures stance between headline and article. Achieves 96.77% accuracy.

  • Vo & Lee (2021) — MAC (Multi-head Attentive Network): Applies multi-head attention at word and document levels for evidence-aware fact-checking. Word-level attention identifies important phrases in claims and evidence articles; document-level attention weights evidence sources by relevance. Jointly optimized; achieves 88.7% AUC on Snopes.

  • Shu et al. (2019) — dEFEND: Explainable Fake News Detection: Hierarchical attention on news sentences + user comments with sentence-comment co-attention; demonstrates that modeling sentence-comment interactions improves both detection and explainability.

  • Zhou et al. (2020) — SAFE: Similarity-Aware Multi-Modal Fake News Detection: Uses attention implicitly via cross-modal similarity weighting between text and image representations.