Model safety¶
Model safety encompasses the practices, techniques, and evaluation frameworks designed to ensure that trained neural models—especially large generative models—produce safe, reliable, and aligned outputs.
Scope¶
Model safety differs from AI safety in its narrower scope: it focuses on individual model behavior and robustness rather than broader governance, deployment, or societal implications. Common safety challenges in models include:
- Unsafe content generation: Models generating harmful, explicit, illegal, or abusive outputs
- Hallucination: Confident generation of false information
- Prompt injection vulnerability: Manipulation through adversarially crafted inputs
- Adversarial robustness: Vulnerability to small, crafted perturbations
- Bias amplification: Disproportionate harms to specific demographic groups
Methods and approaches¶
Safety mechanisms include:
- Training-based: Reinforcement Learning from Human Feedback (RLHF), instruction fine-tuning, safety-focused pretraining
- Filtering and detection: Post-hoc filtering of generated content, classifiers to detect unsafe outputs
- Architectural constraints: Model sizing, training data curation, architectural choices to reduce unsafe generation
- Red teaming and evaluation: Systematic adversarial testing to discover failure modes
Key papers¶
- FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework demonstrating systematic vulnerabilities in safe Stable Diffusion variants and language models
- Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — Analysis of how system parameter manipulation can bypass ChatGPT's safety mechanisms
Related topics¶
- AI Safety (broader governance and alignment)
- LLM Safety and Adversarial Robustness (language model-specific safety)
- Red teaming (evaluation methodology)
- Adversarial robustness (robustness against perturbations)