Skip to content

Language model truthfulness

The rapid scaling of large language models has created a critical safety concern: these models can generate fluent, confident-sounding outputs that are factually false. Truthfulness—the degree to which a model avoids asserting false claims about the world—is essential for deploying language models in high-stakes domains (medicine, law, news, science).

Problem definition

Language models trained on text from the internet learn to mimic human language patterns, including common misconceptions, conspiracy theories, and outright falsehoods. At inference time, they generate text greedily or via sampling, optimizing for fluency (high likelihood under their learned distribution) rather than accuracy about ground truth.

Three key challenges:

  1. Imitating falsehoods: Models learn that certain false statements have high probability in their training data (e.g., "the human brain uses 10% of its capacity"). Larger models learn the training distribution better, sometimes making them more likely to output these learned falsehoods rather than correct information.

  2. Confident hallucinations: Models can generate false statements with high confidence, making it difficult for downstream users to detect errors without external fact-checking.

  3. Inverse scaling: Unlike most NLP tasks where larger models perform better, truthfulness sometimes exhibits inverse scaling—larger models are less truthful, likely because they learn human falsehoods from their training data more thoroughly.

Evaluation approaches

Benchmark-based evaluation: - Design questions or statements where the ground truth is well-established and human-understandable - Evaluate whether models generate truthful vs. false answers - Examples: TruthfulQA (general knowledge facts), medical/legal domain-specific benchmarks - Metrics: accuracy, calibration, confidence-correctness alignment

Automated metrics: - Fine-tune a classifier (e.g., GPT-judge) to predict whether generated text is truthful based on human evaluation data - Trade-off: eliminates human-in-the-loop cost but requires large labeled corpus

Key papers

  • Hallucination vs. truthfulness: Hallucinations are internally inconsistent or nonsensical outputs; truthfulness is consistency with ground truth. A hallucination may sound plausible but be factually wrong; a truthful statement may be boring but accurate.
  • Factuality in downstream tasks: Models fine-tuned for question-answering, summarization, or dialogue often have lower truthfulness than base models, trading accuracy for task performance.
  • Cross-domain truthfulness: Models may be truthful on some domains (e.g., common knowledge) and hallucinate in others (e.g., rare facts or specialized domains).

Open questions

  • Can scaling up model size be decoupled from the inverse scaling of truthfulness? Are there training objectives (beyond next-token prediction) that prevent learning false statements?
  • How do in-context learning and prompt engineering affect truthfulness? Can prompting strategies teach models to defer ("I don't know") rather than confabulate?
  • What role do reinforcement learning from human feedback (RLHF) and constitutional AI play in improving truthfulness?