Reinforcement Learning from Human Feedback¶

A training paradigm where language models are optimized using feedback from human evaluators. Rather than relying solely on supervised learning from labeled examples, RLHF employs a reward or preference model that learns from human comparisons or ratings of model outputs. This feedback signal is then used to fine-tune the model, typically via reinforcement learning algorithms.

RLHF has become a standard technique for alignment, enabling models to learn nuanced preferences (e.g., helpfulness vs. harmlessness trade-offs) and to generalize beyond the specific examples provided by humans.

Key papers¶

Askell et al. (2021) — Compares preference modeling and other alignment techniques; demonstrates that ranked preference modeling scales better than imitation learning

AI Alignment (broader problem RLHF addresses)
Model Evaluation (techniques for assessing alignment outcomes)
Language Models (systems typically trained with RLHF)

Reinforcement Learning from Human Feedback¶

Key papers¶

Related topics¶