Skip to content

Hallucination in language models

Hallucination is a critical failure mode where language models generate text that is fluent and coherent but factually incorrect or inconsistent with the provided context. Unlike random errors, hallucinations are often internally coherent falsehoods that can mislead users into trusting incorrect information.

Definition and characteristics

Hallucination occurs when a model generates output that: - Contradicts established facts or ground truth - Contradicts its own context or prior statements (internal inconsistency) - Invents entities, relationships, or events that don't exist - Attributes false statements to real people or sources

Key distinction from other failures: - Unlike non-hallucinating errors (refusals, vague answers), hallucinations are confidently wrong - The plausibility and fluency make hallucinations particularly dangerous—users may not notice the errors

Types of hallucinations

Intrinsic hallucinations: Output contradicts source material provided in the context (e.g., a model is given a Wikipedia article and generates a fact contradicting it).

Extrinsic hallucinations: Output is unverifiable against any source but inconsistent with world knowledge (e.g., inventing a false scientific discovery or historical event).

Internal hallucinations: The output is internally contradictory (the model says "Paris is in France" then later "Paris is the capital of Germany").

Causes

Knowledge gaps: Models lack certain facts in their training data, particularly recent information or specialized domain knowledge.

Retrieval failures: Even when facts are in the model's parameters, it may fail to retrieve them correctly during generation.

Decoding artifacts: The greedy or nucleus sampling procedures can select tokens that are individually likely but collectively form falsehoods.

Prompt adversarialism: Some prompts or tasks naturally elicit hallucinations more than others (e.g., prompts asking for novel facts vs. retrieval of training data).

Impact and applications

Hallucinations are especially problematic in high-stakes domains: - Medicine: False diagnoses or drug interactions - Law: Fabricated case law or statutes - Finance: False market data or investment advice - Journalism: Spreading misinformation - Academic research: False citations or invented results

Detection and measurement

Fact-checking approaches: - Comparing output against curated knowledge bases - Using external retrievers to verify claims - Human expert annotation

Metrics: - Hallucination rate: percentage of generated statements that are false - Inconsistency rate: fraction of internally contradictory statements - Attribution scores: whether claims are supported by provided context

Benchmarks: - HaluEval: specifically designed to test hallucination propensity - TruthfulQA: knowledge-based questions designed to catch confabulations - AFHB: Adversarial Factual Hallucination Benchmark

Mitigation strategies

Retrieval-augmented generation (RAG): - Retrieve relevant documents and condition generation on them - Provides grounding and access to up-to-date information

Fine-tuning approaches: - Training on data with high factual accuracy - RLHF with rewards for factual outputs - Learning to abstain when uncertain

Prompting techniques: - Chain-of-thought to improve reasoning - Few-shot examples of correct factuality - Explicitly instructing models to cite sources

Decoding constraints: - Constrained decoding to enforce consistency with context - Reducing sampling temperature for more certain outputs - Beam search with factuality-aware scoring

Key papers