Attention mechanisms in NLP¶

Learned mechanisms that compute context-dependent importance weights over input sequences, allowing models to focus on task-relevant elements. In NLP, attention is typically applied to word sequences or sentence sequences to learn differential relevance without manual feature engineering.

Foundational papers¶

Attention Is All You Need — introduces scaled dot-product and multi-head attention mechanisms; the canonical reference for attention-based sequence modeling.

Key mechanisms¶

Self-attention / Scaled dot-product attention: Computes importance via similarity between query and key vectors; scaled to stabilize gradients. Foundation of Transformer models.
Additive attention (Bahdanau): Uses a learned MLP to compute relevance scores between query and each input element.
Multi-head attention: Applies multiple independent attention mechanisms in parallel, capturing different aspects of relevance (syntactic, semantic, pragmatic).

Limitations and biases¶

[[2023-liu-lost-in-middle]] reveals that transformers exhibit severe positional bias in long contexts, with performance following a U-shaped curve (best at beginning and end, worst in middle)—indicating fundamental limits in how attention mechanisms process extended sequences.

Key papers in fake news detection¶

Singhania et al. (2023) — 3HAN: Applies attention at three hierarchical levels (words, sentences, headlines) to fake news detection. Word-level attention learns which words are relevant to veracity assessment within each sentence; sentence-level attention identifies informative sentences in article bodies; headline-body attention captures stance between headline and article. Achieves 96.77% accuracy.
Vo & Lee (2021) — MAC (Multi-head Attentive Network): Applies multi-head attention at word and document levels for evidence-aware fact-checking. Word-level attention identifies important phrases in claims and evidence articles; document-level attention weights evidence sources by relevance. Jointly optimized; achieves 88.7% AUC on Snopes.
Shu et al. (2019) — dEFEND: Explainable Fake News Detection: Hierarchical attention on news sentences + user comments with sentence-comment co-attention; demonstrates that modeling sentence-comment interactions improves both detection and explainability.
Zhou et al. (2020) — SAFE: Similarity-Aware Multi-Modal Fake News Detection: Uses attention implicitly via cross-modal similarity weighting between text and image representations.

Hierarchical attention mechanisms — multi-level attention over document structure
Neural networks — deep learning architectures using attention
Content-based fake news detection — methods that learn from article content

Attention mechanisms in NLP¶

Foundational papers¶

Key mechanisms¶

Limitations and biases¶

Key papers in fake news detection¶

Related topics¶