Alignment¶

Alignment refers to the process and challenge of ensuring that language models behave in accordance with human values and intentions. Given that large language models are trained on diverse internet text containing harmful content, alignment techniques attempt to override default behaviors and enforce safety constraints.

Key techniques¶

Reinforcement Learning from Human Feedback (RLHF): Training reward models from human preference judgments, then fine-tuning via RL
Instruction tuning: Fine-tuning on large datasets of instruction-following examples
Constitutional AI: Defining explicit principles that models should follow
Supervised fine-tuning: Using curated datasets to steer model behavior toward desired outputs
Red-teaming and adversarial testing: Proactively identifying failures and retraining to address them

Key papers in this wiki¶

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — Comprehensive survey and evaluation framework for LLM alignment across seven trustworthiness dimensions; empirically measures effectiveness of alignment across multiple models and identifies domain-specific gaps
Universal and Transferable Adversarial Attacks on Aligned Language Models — Demonstrates that current alignment techniques are insufficient against automated adversarial attacks; models can be jailbroken despite extensive alignment training

LLM Safety and Adversarial Robustness (broader safety challenges in language models)
Jailbreaking (circumventing alignment constraints)
Reinforcement Learning From Human Feedback (a primary technique for alignment)

Alignment¶

Key techniques¶

Key papers in this wiki¶

Related topics¶