Red teaming¶

Red teaming refers to a security practice where a dedicated team (the "red team") adversarially probes a system to identify vulnerabilities, weaknesses, and failure modes. In AI safety, red teaming involves deliberately crafting inputs—prompts, images, or other data—designed to trigger unsafe, undesired, or unexpected behavior in AI models.

Scope¶

Red teaming differs from standard evaluation in its adversarial stance: rather than measuring average-case performance on clean, representative data, red teams operate under the assumption that determined actors will seek to exploit the system and design targeted attacks to do so. For generative AI models, red teaming often involves finding prompts that cause models to generate harmful content (e.g., illegal guidance, discriminatory outputs, explicit imagery).

Key papers¶

FLIRT: Feedback Loop In-context Red Teaming — automated red teaming via feedback loops and in-context learning; achieves 80%+ attack success on Stable Diffusion
Red Teaming Language Models with Language Models — pioneering work on red teaming language models with human-in-the-loop approaches

AI Safety (broader goal)
Adversarial robustness (defense perspective)
Model safety (system-level safety)
Prompt injection (attack technique)

Red teaming¶

Scope¶

Key papers¶

Related topics¶