Skip to content

AI Alignment

The problem of ensuring that large language models and other AI systems behave in ways that are helpful, honest, and harmless—and more broadly, consistent with human values and intentions. Alignment encompasses both training techniques (like reinforcement learning from human feedback) and evaluation methodologies for measuring whether a model exhibits desired properties like truthfulness, harmlessness, and capability.

Key papers

  • Askell et al. (2021) — Develops evaluation framework using helpfulness, honesty, and harmlessness criteria; compares scaling behavior of alignment techniques