AI Alignment¶

The problem of ensuring that large language models and other AI systems behave in ways that are helpful, honest, and harmless—and more broadly, consistent with human values and intentions. Alignment encompasses both training techniques (like reinforcement learning from human feedback) and evaluation methodologies for measuring whether a model exhibits desired properties like truthfulness, harmlessness, and capability.

Key papers¶

Askell et al. (2021) — Develops evaluation framework using helpfulness, honesty, and harmlessness criteria; compares scaling behavior of alignment techniques

Reinforcement Learning from Human Feedback (specific alignment training method)
Model Evaluation (techniques for measuring model properties)
Language Models (the systems being aligned)

AI Alignment¶

Key papers¶

Related topics¶