Skip to content

Hallucinations in language models

Hallucination is a failure mode where a language model generates text that is internally coherent and fluent but factually incorrect, contradicts known facts, or invents information not present in its training data or input context. Hallucinations are distinct from random noise—they are plausible, which makes them particularly deceptive.

Types of hallucinations

Extrinsic hallucinations (contradicting input/world): - Model generates facts inconsistent with the provided context or known ground truth - Example: Summarizing an article and inventing quoted statements not in the source - Example: Answering "Who is the current president of France?" with a wrong name

Intrinsic hallucinations (self-contradictory): - Model contradicts itself within the same output - Example: "Alice is taller than Bob. Bob is taller than Alice." - Example: "John has two children: Mary and Peter. John has three children."

Semantic hallucinations (nonsense): - Model generates grammatically correct but semantically meaningless output - Example: Neologisms or word combinations that fail to refer to anything - Example: Generating fictional scientific evidence or fake citations

Problem scope

Hallucination is endemic to large language models because:

  1. Next-token prediction objective: Models are trained to maximize the likelihood of the next token given context, not to ensure consistency with ground truth or factuality.

  2. Training on human-generated text: Training data includes common misconceptions, false claims, and unverified information.

  3. Exposure bias: Models see only token sequences from training data during training, but at inference time generate arbitrary sequences, encountering distribution shift.

  4. Confidence decoupling: Models can be highly confident in hallucinated outputs, making it difficult for users to distinguish plausible falsehoods from truth.

Evaluation and measurement

  • Benchmark datasets: TruthfulQA tests models on knowledge-based questions with clear ground truth
  • Human evaluation: Crowdsourced judgment of factuality and consistency
  • Automated metrics: Check consistency with retrieved documents, cross-referencing with knowledge bases
  • Entailment-based metrics: Natural language inference to check whether generated text contradicts premises

Mitigation strategies

  • Retrieval-augmented generation (RAG): Ground outputs in retrieved documents to reduce hallucinations
  • Fine-tuning with factuality: Train on high-quality, verified data; use reinforcement learning to reward factuality
  • Constitutional AI and RLHF: Use human feedback to reduce hallucinated outputs
  • Uncertainty quantification: Train models to express confidence/uncertainty rather than always confident assertions
  • Deferral mechanisms: Allow models to output "I don't know" instead of confabulating

Key papers