Label noise¶
Label noise refers to incorrect or contradictory annotations in training datasets. This can arise from crowdsourcing errors, weakly-supervised web scraping, automatic annotation, or genuine ambiguity in the labeling task (e.g., determining whether a tweet is misinformation when the truth is disputed). Deep neural networks with high capacity can memorize noisy labels, harming generalization and cross-domain transfer.
Key approaches¶
Noise detection: Identify and filter likely mislabeled samples before or during training using outlier detection, training dynamics, or confidence scores.
Noise-robust loss functions: Design loss functions that downweight or ignore predictions on high-loss samples, assuming high loss indicates either noise or hard-to-learn examples.
Sample reweighting: Assign lower importance weights to samples suspected to be noisy, allowing the model to learn primarily from clean data.
Meta-learning: Learn a cleaning or weighting strategy that generalizes to new noisy datasets.
Key papers¶
- Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Identifies mislabeled samples using a supervised VAE that measures outliers in latent space; no access to clean data required
Related topics¶
- Data quality — broader issue of training data reliability
- Learning With Noisy Labels — methods to train robust models despite noisy labels