Jailbreaking¶

Jailbreaking refers to the practice of circumventing safety mechanisms and alignment constraints in language models through carefully crafted prompts or inputs. Unlike adversarial attacks in other domains, jailbreaks exploit the discrete nature of language and the generalization of models trained on diverse internet text.

Key mechanisms¶

Prompt injection: Embedding instructions within seemingly innocent queries that override system prompts
Role-playing and hypotheticals: Asking models to respond "as if" they were unrestricted systems
Adversarial suffixes: Automatically discovered token sequences that shift model behavior without human engineering
Multi-turn conversations: Using dialogue history to establish contexts where safety guards relax
Semantic obfuscation: Rephrasing harmful requests in euphemistic or indirect language

Key papers in this wiki¶

Red Teaming Language Models with Language Models — Uses language models to automatically generate adversarial prompts that bypass safety constraints, uncovering offensive outputs, data leakage, and harmful dialogue patterns
Jailbroken: How Does LLM Safety Training Fail? — Analyzes fundamental failure modes of safety training (competing objectives and mismatched generalization); develops 30 jailbreak methods and shows vulnerabilities persist despite red-teaming
Universal and Transferable Adversarial Attacks on Aligned Language Models — Automated discovery of universal adversarial suffixes using gradient-based optimization; demonstrates high transferability across models

LLM Safety and Adversarial Robustness (broader safety challenges in language models)
Adversarial Attacks (adversarial examples across modalities and tasks)
Alignment (how models are trained and constrained to behave safely)

Jailbreaking¶

Key mechanisms¶

Key papers in this wiki¶

Related topics¶