Crowdsourcing¶
Crowdsourcing in NLP refers to collecting human annotations and judgments from geographically distributed, often non-expert workers via platforms like Amazon Mechanical Turk. It enables large-scale evaluation but introduces challenges around quality control, inter-annotator agreement, and demographic representation.
Typical applications¶
- Data annotation: Labeling datasets for training (e.g., stance, sentiment, factuality)
- Evaluation: Collecting human judgments of system outputs (e.g., translation quality, generated text naturalness)
- Quality control: Detecting low-quality crowdworkers and aggregating noisy judgments
Key papers¶
- Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter — Compares expert vs. crowdsourced annotations for hate speech detection, revealing significant quality gaps and systematic labeling biases in crowdworkers
- Tacchini et al. (2017) — Some Like it Hoax — Adapts the harmonic algorithm for boolean label crowdsourcing to hoax detection on Facebook. Demonstrates that harmonic BLC can reliably aggregate implicit "votes" (likes/follows) to achieve >99% accuracy even when training labels are sparse (<1% of posts).
- Clark et al. (2021) — All That's 'Human' Is Not Gold: Large-scale crowdsourced study (1,170 workers via Amazon Mechanical Turk) evaluating humans' ability to detect machine-generated text across three domains.
Related topics¶
- Human evaluation — broader framework for human assessment
- NLP evaluation — evaluation methodology in NLP