Skip to content

Latent Knowledge

Latent knowledge refers to information that a language model possesses in its internal representations but does not express in its generated outputs. This gap between what a model knows and what it says is a fundamental challenge in deploying language models for truthful information tasks like fact-checking and misinformation detection.

Models can fail to express their knowledge due to training objective misalignment: they may have been trained with imitation learning (reproducing human-generated text, including human errors), reward models that optimize for outputs that appear true to human raters, or have learned spurious correlations between their training objectives and truthfulness.

Key papers