Jailbreaking¶
Jailbreaking refers to the practice of circumventing safety mechanisms and alignment constraints in language models through carefully crafted prompts or inputs. Unlike adversarial attacks in other domains, jailbreaks exploit the discrete nature of language and the generalization of models trained on diverse internet text.
Key mechanisms¶
- Prompt injection: Embedding instructions within seemingly innocent queries that override system prompts
- Role-playing and hypotheticals: Asking models to respond "as if" they were unrestricted systems
- Adversarial suffixes: Automatically discovered token sequences that shift model behavior without human engineering
- Multi-turn conversations: Using dialogue history to establish contexts where safety guards relax
- Semantic obfuscation: Rephrasing harmful requests in euphemistic or indirect language
Key papers in this wiki¶
- Red Teaming Language Models with Language Models — Uses language models to automatically generate adversarial prompts that bypass safety constraints, uncovering offensive outputs, data leakage, and harmful dialogue patterns
- Jailbroken: How Does LLM Safety Training Fail? — Analyzes fundamental failure modes of safety training (competing objectives and mismatched generalization); develops 30 jailbreak methods and shows vulnerabilities persist despite red-teaming
- Universal and Transferable Adversarial Attacks on Aligned Language Models — Automated discovery of universal adversarial suffixes using gradient-based optimization; demonstrates high transferability across models
Related topics¶
- LLM Safety and Adversarial Robustness (broader safety challenges in language models)
- Adversarial Attacks (adversarial examples across modalities and tasks)
- Alignment (how models are trained and constrained to behave safely)