Attention mechanisms in NLP¶
Learned mechanisms that compute context-dependent importance weights over input sequences, allowing models to focus on task-relevant elements. In NLP, attention is typically applied to word sequences or sentence sequences to learn differential relevance without manual feature engineering.
Foundational papers¶
- Attention Is All You Need — introduces scaled dot-product and multi-head attention mechanisms; the canonical reference for attention-based sequence modeling.
Key mechanisms¶
- Self-attention / Scaled dot-product attention: Computes importance via similarity between query and key vectors; scaled to stabilize gradients. Foundation of Transformer models.
- Additive attention (Bahdanau): Uses a learned MLP to compute relevance scores between query and each input element.
- Multi-head attention: Applies multiple independent attention mechanisms in parallel, capturing different aspects of relevance (syntactic, semantic, pragmatic).
Limitations and biases¶
- [[2023-liu-lost-in-middle]] reveals that transformers exhibit severe positional bias in long contexts, with performance following a U-shaped curve (best at beginning and end, worst in middle)—indicating fundamental limits in how attention mechanisms process extended sequences.
Key papers in fake news detection¶
-
Singhania et al. (2023) — 3HAN: Applies attention at three hierarchical levels (words, sentences, headlines) to fake news detection. Word-level attention learns which words are relevant to veracity assessment within each sentence; sentence-level attention identifies informative sentences in article bodies; headline-body attention captures stance between headline and article. Achieves 96.77% accuracy.
-
Vo & Lee (2021) — MAC (Multi-head Attentive Network): Applies multi-head attention at word and document levels for evidence-aware fact-checking. Word-level attention identifies important phrases in claims and evidence articles; document-level attention weights evidence sources by relevance. Jointly optimized; achieves 88.7% AUC on Snopes.
-
Shu et al. (2019) — dEFEND: Explainable Fake News Detection: Hierarchical attention on news sentences + user comments with sentence-comment co-attention; demonstrates that modeling sentence-comment interactions improves both detection and explainability.
-
Zhou et al. (2020) — SAFE: Similarity-Aware Multi-Modal Fake News Detection: Uses attention implicitly via cross-modal similarity weighting between text and image representations.
Related topics¶
- Hierarchical attention mechanisms — multi-level attention over document structure
- Neural networks — deep learning architectures using attention
- Content-based fake news detection — methods that learn from article content