Factuality in large language models¶

Factuality is the core problem of ensuring that large language models generate outputs that are consistent with established facts, ground truth, and reliable knowledge. Unlike hallucinations (which are internally coherent but false), factuality encompasses the broader question: does the model produce information that is actually true?

The factuality problem¶

Large language models exhibit a fundamental challenge: they can generate plausible, grammatically correct text that is factually incorrect. This arises from several sources:

Knowledge gaps: Models may lack certain facts in their training data, especially recent information or specialized domain knowledge. A model trained on data through April 2023 cannot know about events in May 2023.

Knowledge retrieval failures: Even when a model has seen relevant facts during training, it may fail to retrieve or apply them correctly at inference time. The model might "know" that Paris is the capital of France but still produce "London" when asked.

Reasoning failures: Models may fail to correctly combine facts or reason through multi-hop inferences. For instance, "If A is north of B, and B is north of C, then A is north of C"—yet models sometimes reason incorrectly.

Inference-time errors: During generation, models produce tokens left-to-right without global oversight. A token that is individually likely given context might still be factually wrong in context.

Distinction from hallucinations¶

While closely related, factuality and hallucinations are not identical:

Hallucinations are outputs that are internally coherent but false or inconsistent with provided context or facts.
Factuality failures are broader: a model can fail to be factual without hallucinating (e.g., simply refusing to answer or producing vague output), and conversely, some hallucinated content may accidentally align with ground truth.

Causes of factual errors¶

Model-level errors: - Insufficient parametric knowledge (the model never learned the fact) - Inability to retrieve learned knowledge at inference time - Competing knowledge (the model learned conflicting information from training data)

Retrieval-level errors (in retrieval-augmented systems): - Failed or incomplete retrieval of relevant documents - Retrieval of irrelevant or contradictory documents - Poor ranking of retrieved passages

Inference-level errors: - Misuse of retrieved knowledge (e.g., ignoring evidence or weighting it incorrectly) - Exposure bias and distribution shift at generation time - Incorrect application of reasoning

Impact and applications¶

Factuality becomes critical in high-stakes domains: - Medical guidance: LLM-generated medical information must be factually correct to avoid harm. - Legal advice: Hallucinated case law or regulations can lead to malpractice. - Financial decisions: False market data or financial forecasts can cause losses. - Journalistic integrity: AI-assisted writing must not propagate misinformation.

Evaluation approaches¶

Rule-based metrics: - Exact match against gold-standard answers - Common metrics (accuracy, precision, recall, F1) - Calibration scores measuring confidence alignment

Neural evaluation metrics: - BERTScore, ROUGE, BLEU (semantic similarity to reference) - Entailment-based scoring (does output contradict facts?)

Human evaluation: - Expert judgment of factuality - Attribution/support scoring (are claims backed by evidence?) - FActScore: breaking text into atomic facts and verifying each

LLM-based metrics: - Using one LLM to evaluate another's factuality - Scalable but potentially biased by the evaluator

Benchmarks: - TruthfulQA: tests models on knowledge-based questions designed to catch hallucinations - MMLU, C-Eval: broad knowledge assessment across domains - BigBench: 200+ tasks assessing reasoning, knowledge, and accuracy - Domain-specific benchmarks for medicine, law, finance, etc.

Enhancement strategies¶

Retrieval-augmented generation (RAG): - Retrieve relevant documents at inference time and condition generation on them - Reduces reliance on parametric knowledge; provides access to up-to-date information - Challenges: quality of retrieval, integration of multiple sources

Fine-tuning and preference learning: - Supervised fine-tuning on factually correct data - RLHF to reward factual outputs - Domain-specific fine-tuning for specialized knowledge

Prompting and in-context learning: - Chain-of-thought prompting to improve reasoning - Few-shot examples of correct factuality - Retrieval-in-context (retrieving and showing evidence in the prompt)

Knowledge integration: - Pretraining on high-quality, factually accurate corpora - Knowledge graph integration - Structured knowledge embedding

Decoding-time approaches: - Constrained decoding to enforce consistency - Iterative refinement and fact-checking loops - Uncertainty quantification to flag untrustworthy outputs

Key papers¶

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — Proposes SELF-RAG framework to improve LLM factuality through adaptive retrieval and learned self-critique
A Survey on Evaluation of Large Language Models: comprehensive survey on evaluation methodologies for LLMs, including substantial section on factuality and hallucination assessment
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity: comprehensive survey of factuality in LLMs covering evaluation metrics, benchmarks, causes, and enhancement strategies
A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities: foundational survey on fake news and misinformation detection, complementary to LLM factuality

Hallucinations in language models: when LLMs generate plausible but false content
Language Models: the neural architectures underlying factuality challenges
Retrieval-Augmented Generation: using external knowledge to improve factuality
Fact-checking and corrections: manual and automatic verification of factual claims
Information Retrieval: finding relevant knowledge for grounding LLM outputs