Latent Knowledge¶

Latent knowledge refers to information that a language model possesses in its internal representations but does not express in its generated outputs. This gap between what a model knows and what it says is a fundamental challenge in deploying language models for truthful information tasks like fact-checking and misinformation detection.

Models can fail to express their knowledge due to training objective misalignment: they may have been trained with imitation learning (reproducing human-generated text, including human errors), reward models that optimize for outputs that appear true to human raters, or have learned spurious correlations between their training objectives and truthfulness.

Key papers¶

[[2022-burns-latent-knowledge|Burns et al. (2022) — Discovering Latent Knowledge in Language Models Without Supervision]] — proposes Contrast-Consistent Search (CCS), an unsupervised method leveraging logical consistency to recover latent knowledge from hidden activations across six models and ten QA datasets.
Wang et al. (2023) — Survey on Factuality in Large Language Models — comprehensive survey covering knowledge, retrieval, and domain-specific factuality in LLMs.
Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — framework for evaluating LLM trustworthiness including reliability and factuality dimensions.

Language model truthfulness (related: focus on truthfulness; latent knowledge is one approach to achieving it)
Model Interpretability (related: understanding internal representations that encode truth)
Language Model Alignment (broader: ensuring model behavior matches intended goals)

Latent Knowledge¶

Key papers¶

Related topics¶