Crowdsourcing¶

Crowdsourcing in NLP refers to collecting human annotations and judgments from geographically distributed, often non-expert workers via platforms like Amazon Mechanical Turk. It enables large-scale evaluation but introduces challenges around quality control, inter-annotator agreement, and demographic representation.

Typical applications¶

Data annotation: Labeling datasets for training (e.g., stance, sentiment, factuality)
Evaluation: Collecting human judgments of system outputs (e.g., translation quality, generated text naturalness)
Quality control: Detecting low-quality crowdworkers and aggregating noisy judgments

Key papers¶

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter — Compares expert vs. crowdsourced annotations for hate speech detection, revealing significant quality gaps and systematic labeling biases in crowdworkers
Tacchini et al. (2017) — Some Like it Hoax — Adapts the harmonic algorithm for boolean label crowdsourcing to hoax detection on Facebook. Demonstrates that harmonic BLC can reliably aggregate implicit "votes" (likes/follows) to achieve >99% accuracy even when training labels are sparse (<1% of posts).
Clark et al. (2021) — All That's 'Human' Is Not Gold: Large-scale crowdsourced study (1,170 workers via Amazon Mechanical Turk) evaluating humans' ability to detect machine-generated text across three domains.

Human evaluation — broader framework for human assessment
NLP evaluation — evaluation methodology in NLP

Crowdsourcing¶

Typical applications¶

Key papers¶

Related topics¶