Label noise¶

Label noise refers to incorrect or contradictory annotations in training datasets. This can arise from crowdsourcing errors, weakly-supervised web scraping, automatic annotation, or genuine ambiguity in the labeling task (e.g., determining whether a tweet is misinformation when the truth is disputed). Deep neural networks with high capacity can memorize noisy labels, harming generalization and cross-domain transfer.

Key approaches¶

Noise detection: Identify and filter likely mislabeled samples before or during training using outlier detection, training dynamics, or confidence scores.

Noise-robust loss functions: Design loss functions that downweight or ignore predictions on high-loss samples, assuming high loss indicates either noise or hard-to-learn examples.

Sample reweighting: Assign lower importance weights to samples suspected to be noisy, allowing the model to learn primarily from clean data.

Meta-learning: Learn a cleaning or weighting strategy that generalizes to new noisy datasets.

Key papers¶

Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Identifies mislabeled samples using a supervised VAE that measures outliers in latent space; no access to clean data required

Data quality — broader issue of training data reliability
Learning With Noisy Labels — methods to train robust models despite noisy labels

Label noise¶

Key approaches¶

Key papers¶

Related topics¶