Skip to content
Assessing the Quality of the Datasets by Identifying Mislabeled Samples

Assessing the Quality of the Datasets by Identifying Mislabeled Samples

Authors: Vaibhav Pulastya, Gaurav Nuti, Yash Kumar Atri, Tanmoy Chakraborty Venue: arXiv, 2021 — arxiv:2109.05000

TL;DR

Deep learning models with high capacity often memorize label noise, degrading generalization. This paper proposes AQUAVS, a supervised variational autoencoder that identifies mislabeled samples via a "noise score" computed from outliers in latent space representations. The method requires no access to clean subsamples or prior knowledge of noise type, and achieves >94% accuracy on MNIST under various corruption levels.

Contributions

  • A novel algorithm using latent space outlier detection to assign quality scores to samples and identify mislabeled data points.
  • A method that operates without requiring clean validation data or prior knowledge about the type of noise in the dataset.
  • Extensive experiments on MNIST, FashionMNIST, CIFAR-10, and CIFAR-100 under uniform and systematic label-corruption regimes, showing improvements over baseline methods.

Method

AQUAVS architecture: Extends standard VAEs with an auxiliary discriminative network to jointly learn latent representations and label predictions. For an input \(x\), the model learns \(p(x, z, y) = p(z)p(x|z)p(y|z)\), where \(z\) is the latent variable and \(y\) is the predicted label.

Training objective: Combines ELBO loss (reconstruction + KL regularization) with categorical cross-entropy classification loss: $\(L = L_{\text{ELBO}} + L_{\text{cl}}\)$

Noise score computation: Assuming samples of the same class cluster in latent space, the noise score for each sample is the count of outlying latent features using median absolute deviation (MAD) with hyperparameter \(\alpha=1.5\). Algorithm 1 groups the dataset by label, computes the median and MAD of latent encodings per class, then counts how many dimensions exceed \(m_j + \alpha M_j\) for each sample.

Results

Mislabel identification

On MNIST with uniform noise up to 40%, AQUAVS achieves precision, recall, and accuracy consistently ≥0.94. Systematic noise (e.g., digit 8 mislabeled as 3) is harder to detect; performance drops more significantly on CIFAR-100 with high noise due to dataset complexity. Outperforms LabelFix and entropy-based training dynamics baselines on systematic noise tasks.

Robust training

After filtering identified mislabeled samples, classification models trained on the cleaned subset show significant improvements. On MNIST with 10–20% uniform noise, AQUAVS-filtered training even exceeds oracle (clean-only) performance, suggesting the method identifies both incorrect labels and "hard" but correct samples. CIFAR-100 shows marginal gains, indicating label quality matters less when data quantity is sufficient for complex tasks.

Connections

  • Related to Label noise in noisy-label learning; extends prior work on training dynamics and sample importance weighting.
  • Complements Data quality assessment; critical for practitioners curating high-stakes datasets.
  • Uses VAE architecture with supervised extensions, similar to semi-supervised learning frameworks.

Notes

Strengths: - Clear motivation: label noise is ubiquitous in web-scraped and crowdsourced datasets. - No reliance on clean subsets or noise-type priors—practical for real-world scenarios. - Comprehensive evaluation across datasets and noise models.

Limitations: - Noise score based on MAD is univariate and may miss multivariate outliers. - Performance degrades sharply on high-noise, high-complexity datasets (CIFAR-100 at 40% noise). - No comparison with recent meta-learning or sample-weighting methods (e.g., AUM, which uses training margins). - Assumes label noise is the only source of error; ignores feature noise or class imbalance.

For misinformation research: Dataset quality is foundational. If a fake-news detector is trained on noisily labeled news articles, it will memorize annotation errors. This method offers a practical way to identify and remove mislabeled examples before training, improving robustness and generalization.