Skip to content

Adversarial testing

Adversarial testing is a form of security and reliability testing where evaluators deliberately craft challenging, edge-case, or outright adversarial inputs to stress-test a system's robustness. Unlike standard evaluation on representative data, adversarial testing seeks out failure modes by assuming adaptive opponents will exploit weaknesses.

Applications

In generative AI, adversarial testing involves crafting prompts designed to trigger unsafe outputs (e.g., explicit imagery, harmful instructions). In NLP, adversarial testing probes model robustness to typos, novel phrasing, or semantic adversarial examples.

Key papers