Model safety¶

Model safety encompasses the practices, techniques, and evaluation frameworks designed to ensure that trained neural models—especially large generative models—produce safe, reliable, and aligned outputs.

Scope¶

Model safety differs from AI safety in its narrower scope: it focuses on individual model behavior and robustness rather than broader governance, deployment, or societal implications. Common safety challenges in models include:

Unsafe content generation: Models generating harmful, explicit, illegal, or abusive outputs
Hallucination: Confident generation of false information
Prompt injection vulnerability: Manipulation through adversarially crafted inputs
Adversarial robustness: Vulnerability to small, crafted perturbations
Bias amplification: Disproportionate harms to specific demographic groups

Methods and approaches¶

Safety mechanisms include:

Training-based: Reinforcement Learning from Human Feedback (RLHF), instruction fine-tuning, safety-focused pretraining
Filtering and detection: Post-hoc filtering of generated content, classifiers to detect unsafe outputs
Architectural constraints: Model sizing, training data curation, architectural choices to reduce unsafe generation
Red teaming and evaluation: Systematic adversarial testing to discover failure modes

Key papers¶

FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework demonstrating systematic vulnerabilities in safe Stable Diffusion variants and language models
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — Analysis of how system parameter manipulation can bypass ChatGPT's safety mechanisms

AI Safety (broader governance and alignment)
LLM Safety and Adversarial Robustness (language model-specific safety)
Red teaming (evaluation methodology)
Adversarial robustness (robustness against perturbations)

Model safety¶

Scope¶

Methods and approaches¶

Key papers¶

Related topics¶