Dataset curation¶

Dataset curation encompasses collection, labeling, cleaning, and maintenance of research datasets. For misinformation and fake news research, curators must decide what counts as misinformation, how to obtain ground-truth labels, handle edge cases (satire, opinion, disputed claims), and ensure reproducibility. As datasets grow, manual inspection becomes infeasible; automated methods to identify and remove mislabeled or low-quality samples are essential.

Key considerations¶

Labeling quality: Multi-annotator agreement, expert vs. crowdsourced labels, inter-rater reliability metrics.

Scalability: Automated and semi-automated approaches to maintain quality as datasets grow.

Temporal drift: Labels assigned today may be outdated or incorrect tomorrow as the ground truth or context evolves.

Reproducibility: Documenting annotation guidelines, disagreement resolution procedures, and data splits enables researchers to replicate and build on prior work.

Key papers¶

Assessing the Quality of the Datasets by Identifying Mislabeled Samples — Proposes automated methods to identify mislabeled samples in large datasets, improving curation efficiency

Data quality — quality assurance in training data
Annotation Methodology — how to assign labels reliably
Crowdsourcing — using crowd workers to label data at scale

Dataset curation¶

Key considerations¶

Key papers¶

Related topics¶