Adversarial testing¶
Adversarial testing is a form of security and reliability testing where evaluators deliberately craft challenging, edge-case, or outright adversarial inputs to stress-test a system's robustness. Unlike standard evaluation on representative data, adversarial testing seeks out failure modes by assuming adaptive opponents will exploit weaknesses.
Applications¶
In generative AI, adversarial testing involves crafting prompts designed to trigger unsafe outputs (e.g., explicit imagery, harmful instructions). In NLP, adversarial testing probes model robustness to typos, novel phrasing, or semantic adversarial examples.
Key papers¶
- FLIRT: Feedback Loop In-context Red Teaming — automated adversarial testing of text-to-image models via feedback loops
Related topics¶
- Red teaming (security-focused variant)
- Adversarial robustness (robustness perspective)