Data quality¶

The reliability and correctness of training data—particularly labels assigned by human annotators. Data quality encompasses inter-annotator agreement, label noise, systematic biases, and the overall fitness of a dataset for a modeling task. In NLP, data quality critically affects downstream model performance; models cannot exceed the quality of their training data.

Key papers¶

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter — Shows systems trained on higher-quality expert annotations substantially outperform those trained on noisier crowdsourced labels for hate speech detection
Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Proposes AQUAVS, a VAE-based method to identify and filter mislabeled samples in datasets using latent space outlier detection

Annotation bias — how annotator backgrounds introduce systematic errors
Crowdsourcing — tradeoff between cost and quality in data collection
Evaluation Methodology — assessing whether a dataset is suitable for a task

Data quality¶

Key papers¶

Related topics¶