Data quality¶
The reliability and correctness of training data—particularly labels assigned by human annotators. Data quality encompasses inter-annotator agreement, label noise, systematic biases, and the overall fitness of a dataset for a modeling task. In NLP, data quality critically affects downstream model performance; models cannot exceed the quality of their training data.
Key papers¶
- Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter — Shows systems trained on higher-quality expert annotations substantially outperform those trained on noisier crowdsourced labels for hate speech detection
- Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Proposes AQUAVS, a VAE-based method to identify and filter mislabeled samples in datasets using latent space outlier detection
Related topics¶
- Annotation bias — how annotator backgrounds introduce systematic errors
- Crowdsourcing — tradeoff between cost and quality in data collection
- Evaluation Methodology — assessing whether a dataset is suitable for a task