Data Poisoning¶
Data poisoning is a training-time attack where an adversary injects malicious examples into a model's training dataset to compromise its behavior. Unlike evasion attacks that exploit a trained model at inference time, poisoning attacks directly degrade or hijack model performance by corrupting the training process itself. Poisoning attacks are particularly dangerous because they occur before deployment and can be difficult to detect if the poisoned data is crafted to be inconspicuous.
Attack vectors¶
Clean-label attacks: Poisoned examples appear to have correct labels and are semantically meaningful, making them difficult to detect through automated filtering.
Trigger-based attacks: Attackers insert specific patterns (triggers) that cause the model to misbehave only in the presence of those patterns, useful for targeted behavior injection.
Availability attacks: Aim to degrade overall model performance by introducing conflicting or corrupted labels.
Targeted attacks: Aim to cause specific misclassifications on chosen test examples while maintaining overall accuracy on other data.
Key papers¶
- Red Teaming Language Models with Language Models — Discovers training data leakage in large language models, uncovering 1709 instances where LLMs leak memorized training examples in responses
Related topics¶
- Adversarial Machine Learning — Broader field of attacks and defenses for ML systems
- Large Language Models — Instruction-tuned LLMs are particularly vulnerable to poisoning due to low sample complexity
- Model Security — Defenses against data poisoning and other training-time attacks