Skip to content

RLHF Training

Reinforcement Learning from Human Feedback (RLHF) is a technique for fine-tuning language models to align with human preferences and values. Rather than relying solely on supervised learning from human-written text, RLHF involves a two-stage process: first, training a reward model on human comparisons of model outputs (which is "better"?), then using that reward signal to update the language model via reinforcement learning.

The motivation for RLHF is that many important qualities—such as safety, honesty, harmlessness, and compliance with ethical principles—are difficult to specify as formal objectives but relatively easy for humans to judge in practice. By grounding the learning signal in human feedback, RLHF enables models to learn these nuanced preferences.

RLHF has become the primary technique for training helpful and aligned language models in recent years (e.g., in GPT-3.5 and later models). Research on RLHF explores questions about scaling laws, sample efficiency, what behaviors emerge at different model scales, and how RLHF interacts with other training techniques.

Key papers