Latent Knowledge¶
Latent knowledge refers to information that a language model possesses in its internal representations but does not express in its generated outputs. This gap between what a model knows and what it says is a fundamental challenge in deploying language models for truthful information tasks like fact-checking and misinformation detection.
Models can fail to express their knowledge due to training objective misalignment: they may have been trained with imitation learning (reproducing human-generated text, including human errors), reward models that optimize for outputs that appear true to human raters, or have learned spurious correlations between their training objectives and truthfulness.
Key papers¶
- [[2022-burns-latent-knowledge|Burns et al. (2022) — Discovering Latent Knowledge in Language Models Without Supervision]] — proposes Contrast-Consistent Search (CCS), an unsupervised method leveraging logical consistency to recover latent knowledge from hidden activations across six models and ten QA datasets.
- Wang et al. (2023) — Survey on Factuality in Large Language Models — comprehensive survey covering knowledge, retrieval, and domain-specific factuality in LLMs.
- Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — framework for evaluating LLM trustworthiness including reliability and factuality dimensions.
Related topics¶
- Language model truthfulness (related: focus on truthfulness; latent knowledge is one approach to achieving it)
- Model Interpretability (related: understanding internal representations that encode truth)
- Language Model Alignment (broader: ensuring model behavior matches intended goals)