Prompt injection¶
Prompt injection refers to adversarial input crafting where carefully designed prompts are used to manipulate AI systems (especially language models and text-to-image systems) into producing unintended outputs. These might include generating harmful content, leaking sensitive information, or violating stated safety constraints.
Threat model¶
Prompt injection assumes the attacker controls the input text (the "prompt") fed to an AI system. By cleverly structuring the prompt—e.g., through role-play scenarios, goal redirection, jailbreaking framing, or providing misleading context—attackers can induce models to violate their training objectives or safety guidelines.
Key papers¶
- FLIRT: Feedback Loop In-context Red Teaming — automated generation of adversarial prompts using in-context learning; demonstrates large-scale prompt injection attacks on Stable Diffusion and language models
Related topics¶
- Red teaming (systematic probing for injection vulnerabilities)
- Adversarial testing (general adversarial evaluation)
- Model safety (defensive perspective)