Skip to content

Dataset curation

Dataset curation encompasses collection, labeling, cleaning, and maintenance of research datasets. For misinformation and fake news research, curators must decide what counts as misinformation, how to obtain ground-truth labels, handle edge cases (satire, opinion, disputed claims), and ensure reproducibility. As datasets grow, manual inspection becomes infeasible; automated methods to identify and remove mislabeled or low-quality samples are essential.

Key considerations

Labeling quality: Multi-annotator agreement, expert vs. crowdsourced labels, inter-rater reliability metrics.

Scalability: Automated and semi-automated approaches to maintain quality as datasets grow.

Temporal drift: Labels assigned today may be outdated or incorrect tomorrow as the ground truth or context evolves.

Reproducibility: Documenting annotation guidelines, disagreement resolution procedures, and data splits enables researchers to replicate and build on prior work.

Key papers