Skip to content

Jailbreaking

Jailbreaking refers to the practice of circumventing safety mechanisms and alignment constraints in language models through carefully crafted prompts or inputs. Unlike adversarial attacks in other domains, jailbreaks exploit the discrete nature of language and the generalization of models trained on diverse internet text.

Key mechanisms

  • Prompt injection: Embedding instructions within seemingly innocent queries that override system prompts
  • Role-playing and hypotheticals: Asking models to respond "as if" they were unrestricted systems
  • Adversarial suffixes: Automatically discovered token sequences that shift model behavior without human engineering
  • Multi-turn conversations: Using dialogue history to establish contexts where safety guards relax
  • Semantic obfuscation: Rephrasing harmful requests in euphemistic or indirect language

Key papers in this wiki