Skip to content

Alignment

Alignment refers to the process and challenge of ensuring that language models behave in accordance with human values and intentions. Given that large language models are trained on diverse internet text containing harmful content, alignment techniques attempt to override default behaviors and enforce safety constraints.

Key techniques

  • Reinforcement Learning from Human Feedback (RLHF): Training reward models from human preference judgments, then fine-tuning via RL
  • Instruction tuning: Fine-tuning on large datasets of instruction-following examples
  • Constitutional AI: Defining explicit principles that models should follow
  • Supervised fine-tuning: Using curated datasets to steer model behavior toward desired outputs
  • Red-teaming and adversarial testing: Proactively identifying failures and retraining to address them

Key papers in this wiki