Skip to content

Reinforcement Learning from Human Feedback

A training paradigm where language models are optimized using feedback from human evaluators. Rather than relying solely on supervised learning from labeled examples, RLHF employs a reward or preference model that learns from human comparisons or ratings of model outputs. This feedback signal is then used to fine-tune the model, typically via reinforcement learning algorithms.

RLHF has become a standard technique for alignment, enabling models to learn nuanced preferences (e.g., helpfulness vs. harmlessness trade-offs) and to generalize beyond the specific examples provided by humans.

Key papers

  • Askell et al. (2021) — Compares preference modeling and other alignment techniques; demonstrates that ranked preference modeling scales better than imitation learning