Skip to content

Adversarial Machine Learning

The study of how machine learning systems can be attacked (adversarial attacks) and how to defend them (adversarial robustness). Attacks occur at two stages: training (data poisoning, model tampering) and inference (evasion attacks, adversarial examples). Adversarial ML bridges security and ML, examining both how defenders can use adversarial techniques (domain-invariant learning, robust classifiers) and how adversaries can exploit ML systems to enhance their tactics.

Key papers

  • Red Teaming Language Models with Language Models — Automated generation of adversarial prompts using language models to systematically discover multiple harm categories in LLMs (offensive outputs, data leakage, personal info generation)
  • [[2023-shu-exploitability-instruction-tuning]] — Data poisoning attacks on instruction-tuned LLMs via AutoPoison, demonstrating content injection and over-refusal attacks while maintaining model fluency.
  • Can AI-Generated Text be Reliably Detected? — Attacks on AI-generated text detectors using recursive paraphrasing; demonstrates evasion of watermarking, neural network, and retrieval-based detection systems
  • The Threat of Offensive AI to Organizations — Comprehensive survey of 33 offensive AI capabilities that adversaries leverage, including attacks on ML models through poisoning, side-channel extraction, and model theft.
  • Wang et al. 2018 — EANN — Uses adversarial training to learn event-invariant features for fake news detection.
  • Offensive AI — Specific focus on how adversaries weaponize AI
  • Threat modeling — Systematic approach to identifying and ranking attack threats