Skip to content

Data Poisoning

Data poisoning is a training-time attack where an adversary injects malicious examples into a model's training dataset to compromise its behavior. Unlike evasion attacks that exploit a trained model at inference time, poisoning attacks directly degrade or hijack model performance by corrupting the training process itself. Poisoning attacks are particularly dangerous because they occur before deployment and can be difficult to detect if the poisoned data is crafted to be inconspicuous.

Attack vectors

Clean-label attacks: Poisoned examples appear to have correct labels and are semantically meaningful, making them difficult to detect through automated filtering.

Trigger-based attacks: Attackers insert specific patterns (triggers) that cause the model to misbehave only in the presence of those patterns, useful for targeted behavior injection.

Availability attacks: Aim to degrade overall model performance by introducing conflicting or corrupted labels.

Targeted attacks: Aim to cause specific misclassifications on chosen test examples while maintaining overall accuracy on other data.

Key papers