Dataset curation¶
Dataset curation encompasses collection, labeling, cleaning, and maintenance of research datasets. For misinformation and fake news research, curators must decide what counts as misinformation, how to obtain ground-truth labels, handle edge cases (satire, opinion, disputed claims), and ensure reproducibility. As datasets grow, manual inspection becomes infeasible; automated methods to identify and remove mislabeled or low-quality samples are essential.
Key considerations¶
Labeling quality: Multi-annotator agreement, expert vs. crowdsourced labels, inter-rater reliability metrics.
Scalability: Automated and semi-automated approaches to maintain quality as datasets grow.
Temporal drift: Labels assigned today may be outdated or incorrect tomorrow as the ground truth or context evolves.
Reproducibility: Documenting annotation guidelines, disagreement resolution procedures, and data splits enables researchers to replicate and build on prior work.
Key papers¶
- Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Proposes automated methods to identify mislabeled samples in large datasets, improving curation efficiency
Related topics¶
- Data quality — quality assurance in training data
- Annotation Methodology — how to assign labels reliably
- Crowdsourcing — using crowd workers to label data at scale