RLHF Training¶

Reinforcement Learning from Human Feedback (RLHF) is a technique for fine-tuning language models to align with human preferences and values. Rather than relying solely on supervised learning from human-written text, RLHF involves a two-stage process: first, training a reward model on human comparisons of model outputs (which is "better"?), then using that reward signal to update the language model via reinforcement learning.

The motivation for RLHF is that many important qualities—such as safety, honesty, harmlessness, and compliance with ethical principles—are difficult to specify as formal objectives but relatively easy for humans to judge in practice. By grounding the learning signal in human feedback, RLHF enables models to learn these nuanced preferences.

RLHF has become the primary technique for training helpful and aligned language models in recent years (e.g., in GPT-3.5 and later models). Research on RLHF explores questions about scaling laws, sample efficiency, what behaviors emerge at different model scales, and how RLHF interacts with other training techniques.

Key papers¶

Discovering Language Model Behaviors with Model-Written Evaluations — Uses language models to generate evaluations revealing inverse scaling behaviors and unintended RLHF side effects (political bias, instrumental subgoals, resistance to shutdown)

Model Alignment (the broader goal RLHF serves)
AI Safety (applications in safety and harm reduction)
Bias in Language Models (RLHF's role in bias mitigation)

RLHF Training¶

Key papers¶

Related topics¶