Skip to content

Red teaming

Red teaming refers to a security practice where a dedicated team (the "red team") adversarially probes a system to identify vulnerabilities, weaknesses, and failure modes. In AI safety, red teaming involves deliberately crafting inputs—prompts, images, or other data—designed to trigger unsafe, undesired, or unexpected behavior in AI models.

Scope

Red teaming differs from standard evaluation in its adversarial stance: rather than measuring average-case performance on clean, representative data, red teams operate under the assumption that determined actors will seek to exploit the system and design targeted attacks to do so. For generative AI models, red teaming often involves finding prompts that cause models to generate harmful content (e.g., illegal guidance, discriminatory outputs, explicit imagery).

Key papers