Skip to content

Model safety

Model safety encompasses the practices, techniques, and evaluation frameworks designed to ensure that trained neural models—especially large generative models—produce safe, reliable, and aligned outputs.

Scope

Model safety differs from AI safety in its narrower scope: it focuses on individual model behavior and robustness rather than broader governance, deployment, or societal implications. Common safety challenges in models include:

  • Unsafe content generation: Models generating harmful, explicit, illegal, or abusive outputs
  • Hallucination: Confident generation of false information
  • Prompt injection vulnerability: Manipulation through adversarially crafted inputs
  • Adversarial robustness: Vulnerability to small, crafted perturbations
  • Bias amplification: Disproportionate harms to specific demographic groups

Methods and approaches

Safety mechanisms include:

  • Training-based: Reinforcement Learning from Human Feedback (RLHF), instruction fine-tuning, safety-focused pretraining
  • Filtering and detection: Post-hoc filtering of generated content, classifiers to detect unsafe outputs
  • Architectural constraints: Model sizing, training data curation, architectural choices to reduce unsafe generation
  • Red teaming and evaluation: Systematic adversarial testing to discover failure modes

Key papers