Hallucinations in language models¶

Hallucination is a failure mode where a language model generates text that is internally coherent and fluent but factually incorrect, contradicts known facts, or invents information not present in its training data or input context. Hallucinations are distinct from random noise—they are plausible, which makes them particularly deceptive.

Types of hallucinations¶

Extrinsic hallucinations (contradicting input/world): - Model generates facts inconsistent with the provided context or known ground truth - Example: Summarizing an article and inventing quoted statements not in the source - Example: Answering "Who is the current president of France?" with a wrong name

Intrinsic hallucinations (self-contradictory): - Model contradicts itself within the same output - Example: "Alice is taller than Bob. Bob is taller than Alice." - Example: "John has two children: Mary and Peter. John has three children."

Semantic hallucinations (nonsense): - Model generates grammatically correct but semantically meaningless output - Example: Neologisms or word combinations that fail to refer to anything - Example: Generating fictional scientific evidence or fake citations

Problem scope¶

Hallucination is endemic to large language models because:

Next-token prediction objective: Models are trained to maximize the likelihood of the next token given context, not to ensure consistency with ground truth or factuality.
Training on human-generated text: Training data includes common misconceptions, false claims, and unverified information.
Exposure bias: Models see only token sequences from training data during training, but at inference time generate arbitrary sequences, encountering distribution shift.
Confidence decoupling: Models can be highly confident in hallucinated outputs, making it difficult for users to distinguish plausible falsehoods from truth.

Evaluation and measurement¶

Benchmark datasets: TruthfulQA tests models on knowledge-based questions with clear ground truth
Human evaluation: Crowdsourced judgment of factuality and consistency
Automated metrics: Check consistency with retrieved documents, cross-referencing with knowledge bases
Entailment-based metrics: Natural language inference to check whether generated text contradicts premises

Mitigation strategies¶

Retrieval-augmented generation (RAG): Ground outputs in retrieved documents to reduce hallucinations
Fine-tuning with factuality: Train on high-quality, verified data; use reinforcement learning to reward factuality
Constitutional AI and RLHF: Use human feedback to reduce hallucinated outputs
Uncertainty quantification: Train models to express confidence/uncertainty rather than always confident assertions
Deferral mechanisms: Allow models to output "I don't know" instead of confabulating

Key papers¶

Survey of Hallucination in Natural Language Generation — comprehensive survey of hallucination across NLG tasks including abstractive summarization, dialogue generation, QA, data-to-text, machine translation, and vision-language models
A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT — discusses factuality, hallucination, and trustworthiness issues across generative AI models
[[2023-guerreiro-hallucinations-multilingual]] — empirical study of hallucinations in massively multilingual machine translation models (M2M family, ChatGPT) across 100+ language pairs, analyzing both perturbation-induced and natural hallucinations

Language model truthfulness — truthfulness evaluation and improvement
Generated text detection — detecting when text is machine-generated (hallucinated text is typically machine-generated)
Neural language models — foundational models that hallucinate