Skip to content

Prompt injection

Prompt injection refers to adversarial input crafting where carefully designed prompts are used to manipulate AI systems (especially language models and text-to-image systems) into producing unintended outputs. These might include generating harmful content, leaking sensitive information, or violating stated safety constraints.

Threat model

Prompt injection assumes the attacker controls the input text (the "prompt") fed to an AI system. By cleverly structuring the prompt—e.g., through role-play scenarios, goal redirection, jailbreaking framing, or providing misleading context—attackers can induce models to violate their training objectives or safety guidelines.

Key papers