Skip to content

Label noise

Label noise refers to incorrect or contradictory annotations in training datasets. This can arise from crowdsourcing errors, weakly-supervised web scraping, automatic annotation, or genuine ambiguity in the labeling task (e.g., determining whether a tweet is misinformation when the truth is disputed). Deep neural networks with high capacity can memorize noisy labels, harming generalization and cross-domain transfer.

Key approaches

Noise detection: Identify and filter likely mislabeled samples before or during training using outlier detection, training dynamics, or confidence scores.

Noise-robust loss functions: Design loss functions that downweight or ignore predictions on high-loss samples, assuming high loss indicates either noise or hard-to-learn examples.

Sample reweighting: Assign lower importance weights to samples suspected to be noisy, allowing the model to learn primarily from clean data.

Meta-learning: Learn a cleaning or weighting strategy that generalizes to new noisy datasets.

Key papers