Reinforcement Learning from Human Feedback¶
A training paradigm where language models are optimized using feedback from human evaluators. Rather than relying solely on supervised learning from labeled examples, RLHF employs a reward or preference model that learns from human comparisons or ratings of model outputs. This feedback signal is then used to fine-tune the model, typically via reinforcement learning algorithms.
RLHF has become a standard technique for alignment, enabling models to learn nuanced preferences (e.g., helpfulness vs. harmlessness trade-offs) and to generalize beyond the specific examples provided by humans.
Key papers¶
- Askell et al. (2021) — Compares preference modeling and other alignment techniques; demonstrates that ranked preference modeling scales better than imitation learning
Related topics¶
- AI Alignment (broader problem RLHF addresses)
- Model Evaluation (techniques for assessing alignment outcomes)
- Language Models (systems typically trained with RLHF)