Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks¶

Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

Venue: ICLR 2021 — arXiv:2005.11401

TL;DR¶

This work proposes RAG (Retrieval-Augmented Generation), a hybrid architecture that augments pre-trained language models with a non-parametric memory component. The model retrieves relevant documents from Wikipedia at generation time and uses them to ground the generation process, achieving state-of-the-art results on multiple knowledge-intensive NLP tasks without task-specific retraining or architecture changes.

Contributions¶

Hybrid architecture combining parametric and non-parametric memory: parametric memory is a pre-trained seq2seq model (BART); non-parametric memory is a dense vector index of Wikipedia
Two formulations: RAG-Sequence uses the same retrieved document for the entire sequence, while RAG-Token selects different documents for each output token
Unified framework for knowledge-intensive tasks: evaluated on open-domain QA, abstractive QA, Jeopardy question generation, and fact verification
Knowledge updatability: the retrieval index can be swapped at test time without retraining, enabling the model to adapt to new facts
End-to-end training: both retriever and generator components are jointly fine-tuned on task-specific data

Method¶

RAG models use Maximum Inner Product Search (MIPS) to efficiently retrieve the top-K documents using a query encoder and document index. The retrieved documents are then treated as latent variables that condition the generation process.

RAG-Sequence: The top-K retrieved documents are scored jointly, and the model marginalizes over their probabilities to compute the generation likelihood:

\[p(y|x) \approx \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) p_\theta(y|x, z)\]

The retriever uses Dense Passage Retrieval (DPR) with a BERT-based bi-encoder. The generator is initialized from BART.

RAG-Token: For each generated token, a different set of documents can be retrieved and marginalized:

\[p(y|x) \approx \prod_i \sum_{z \in \text{top-}k(p(\cdot|x))} p_\eta(z|x) p_\theta(y_i|x, z, y_{1:i-1})\]

This formulation allows the model to leverage different evidence for different output tokens.

Results¶

Open-Domain QA: RAG-Token achieves 44.1% on Natural Questions, outperforming REALM (41.5%) and the retrieval-only DPR baseline (41.3%)
Abstractive QA (MS-MARCO NLG): RAG-Sequence outperforms BART by 2.6 BLEU points
Jeopardy Question Generation: RAG-Token surpasses BART in factuality (42.7% vs 7.1% of evaluators preferring it)
Fact Verification (FEVER): RAG achieves 4.3% within state-of-the-art on 3-way classification
Knowledge grounding: RAG can generate correct answers even when the answer is not in any retrieved document (11.8% accuracy for NQ), showing it leverages both parametric and non-parametric memory

Generation from RAG models is more specific, diverse, and factually accurate than parametric-only BART baselines. The model can also be updated at test time by replacing the retrieval index (e.g., switching from December 2016 to 2018 world leader data, achieving 70% accuracy with the newer index).

Connections¶

Related to Claim Verification via the fact verification evaluation on FEVER
Applied to Question Answering Systems tasks across multiple datasets
Builds on Information Retrieval techniques for the retrieval component
Contrasts with purely parametric approaches by incorporating external knowledge sources
Similar spirit to Misinformation Has High Perplexity in using knowledge to ground NLP systems

Notes¶

This is a highly influential paper that introduced RAG and has spawned numerous follow-up works. The key insight—that pre-trained models can be augmented with retrieval at generation time without task-specific architectures—has become foundational in modern NLP. The factuality improvements are particularly relevant for applications like fact-checking and misinformation detection where grounding in evidence is critical. The work demonstrates that both parametric and non-parametric knowledge are useful, and that jointly training both components is effective.