Adversarial Machine Learning¶
The study of how machine learning systems can be attacked (adversarial attacks) and how to defend them (adversarial robustness). Attacks occur at two stages: training (data poisoning, model tampering) and inference (evasion attacks, adversarial examples). Adversarial ML bridges security and ML, examining both how defenders can use adversarial techniques (domain-invariant learning, robust classifiers) and how adversaries can exploit ML systems to enhance their tactics.
Key papers¶
- Red Teaming Language Models with Language Models — Automated generation of adversarial prompts using language models to systematically discover multiple harm categories in LLMs (offensive outputs, data leakage, personal info generation)
- [[2023-shu-exploitability-instruction-tuning]] — Data poisoning attacks on instruction-tuned LLMs via AutoPoison, demonstrating content injection and over-refusal attacks while maintaining model fluency.
- Can AI-Generated Text be Reliably Detected? — Attacks on AI-generated text detectors using recursive paraphrasing; demonstrates evasion of watermarking, neural network, and retrieval-based detection systems
- The Threat of Offensive AI to Organizations — Comprehensive survey of 33 offensive AI capabilities that adversaries leverage, including attacks on ML models through poisoning, side-channel extraction, and model theft.
- Wang et al. 2018 — EANN — Uses adversarial training to learn event-invariant features for fake news detection.
Related topics¶
- Offensive AI — Specific focus on how adversaries weaponize AI
- Threat modeling — Systematic approach to identifying and ranking attack threats