Language model truthfulness¶

The rapid scaling of large language models has created a critical safety concern: these models can generate fluent, confident-sounding outputs that are factually false. Truthfulness—the degree to which a model avoids asserting false claims about the world—is essential for deploying language models in high-stakes domains (medicine, law, news, science).

Problem definition¶

Language models trained on text from the internet learn to mimic human language patterns, including common misconceptions, conspiracy theories, and outright falsehoods. At inference time, they generate text greedily or via sampling, optimizing for fluency (high likelihood under their learned distribution) rather than accuracy about ground truth.

Three key challenges:

Imitating falsehoods: Models learn that certain false statements have high probability in their training data (e.g., "the human brain uses 10% of its capacity"). Larger models learn the training distribution better, sometimes making them more likely to output these learned falsehoods rather than correct information.
Confident hallucinations: Models can generate false statements with high confidence, making it difficult for downstream users to detect errors without external fact-checking.
Inverse scaling: Unlike most NLP tasks where larger models perform better, truthfulness sometimes exhibits inverse scaling—larger models are less truthful, likely because they learn human falsehoods from their training data more thoroughly.

Evaluation approaches¶

Benchmark-based evaluation: - Design questions or statements where the ground truth is well-established and human-understandable - Evaluate whether models generate truthful vs. false answers - Examples: TruthfulQA (general knowledge facts), medical/legal domain-specific benchmarks - Metrics: accuracy, calibration, confidence-correctness alignment

Automated metrics: - Fine-tune a classifier (e.g., GPT-judge) to predict whether generated text is truthful based on human evaluation data - Trade-off: eliminates human-in-the-loop cost but requires large labeled corpus

Key papers¶

Quelle & Bovet (2023) — The Perils & Promises of Fact-checking with Large Language Models — Empirically evaluates whether GPT-3.5 and GPT-4 can reliably distinguish true from false claims in fact-checking tasks across multiple languages and datasets; reveals inconsistent accuracy on ambiguous verdicts (half-true, mostly-true) and language-dependent performance suggesting fundamental limitations in truthfulness across non-English contexts
Burns et al. (2022) — Discovering Latent Knowledge: Unsupervised method for extracting what language models know about truth by finding linear probes consistent across statement-negation pairs.
Lin, Hilton & Evans (2021) — TruthfulQA: Benchmark of 817 questions testing model tendency to mimic human misconceptions; demonstrates inverse scaling where larger models are less truthful.
Evans et al. (2021) — Truthful AI: Developing and Governing AI That Does Not Lie: Governance framework for developing and regulating systems that avoid false outputs; discusses institutional and technical approaches to truthfulness.

Hallucination vs. truthfulness: Hallucinations are internally inconsistent or nonsensical outputs; truthfulness is consistency with ground truth. A hallucination may sound plausible but be factually wrong; a truthful statement may be boring but accurate.
Factuality in downstream tasks: Models fine-tuned for question-answering, summarization, or dialogue often have lower truthfulness than base models, trading accuracy for task performance.
Cross-domain truthfulness: Models may be truthful on some domains (e.g., common knowledge) and hallucinate in others (e.g., rare facts or specialized domains).

Open questions¶

Can scaling up model size be decoupled from the inverse scaling of truthfulness? Are there training objectives (beyond next-token prediction) that prevent learning false statements?
How do in-context learning and prompt engineering affect truthfulness? Can prompting strategies teach models to defer ("I don't know") rather than confabulate?
What role do reinforcement learning from human feedback (RLHF) and constitutional AI play in improving truthfulness?

Language model truthfulness¶

Problem definition¶

Evaluation approaches¶

Key papers¶

Related challenges¶

Open questions¶