Alignment¶
Alignment refers to the process and challenge of ensuring that language models behave in accordance with human values and intentions. Given that large language models are trained on diverse internet text containing harmful content, alignment techniques attempt to override default behaviors and enforce safety constraints.
Key techniques¶
- Reinforcement Learning from Human Feedback (RLHF): Training reward models from human preference judgments, then fine-tuning via RL
- Instruction tuning: Fine-tuning on large datasets of instruction-following examples
- Constitutional AI: Defining explicit principles that models should follow
- Supervised fine-tuning: Using curated datasets to steer model behavior toward desired outputs
- Red-teaming and adversarial testing: Proactively identifying failures and retraining to address them
Key papers in this wiki¶
- Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — Comprehensive survey and evaluation framework for LLM alignment across seven trustworthiness dimensions; empirically measures effectiveness of alignment across multiple models and identifies domain-specific gaps
- Universal and Transferable Adversarial Attacks on Aligned Language Models — Demonstrates that current alignment techniques are insufficient against automated adversarial attacks; models can be jailbroken despite extensive alignment training
Related topics¶
- LLM Safety and Adversarial Robustness (broader safety challenges in language models)
- Jailbreaking (circumventing alignment constraints)
- Reinforcement Learning From Human Feedback (a primary technique for alignment)