Skip to content

Human evaluation

Human evaluation—asking people to judge the quality, truthfulness, or characteristics of system outputs—remains a critical validation method in NLP, despite being costly and subject to inter-annotator disagreement, bias, and limited scalability.

Scope

Human evaluation covers multiple dimensions:

  • Correctness / Accuracy: Does the system output match ground truth or expert judgment?
  • Quality / Fluency: Is the output grammatical, coherent, and natural-sounding?
  • Factuality: Are claims in the output supported by evidence?
  • Trustworthiness / Credibility: Would a reader believe or trust the output?
  • Detectability: Can humans distinguish system-generated outputs from human-written alternatives?

Evaluation populations

  • Expert annotators: Domain specialists (e.g., fact-checkers, linguists) with detailed instructions; typically 1–10 annotators per task; high cost, high reliability.
  • Crowdworkers: Large populations (100s–1000s) via platforms like Amazon Mechanical Turk; lower cost but noisier judgments, requiring aggregation and quality control.
  • Trained raters: Intermediate approach: brief training (10–50 examples) followed by annotation by domain-adjacent workers (e.g., NLP students for language quality tasks).

Challenges and biases

  • Disagreement: Humans disagree on subjective dimensions (quality, credibility); kappa < 0.70 is common for non-trivial tasks.
  • Demographic effects: Background, age, education, and cultural context influence judgment of credibility and trustworthiness.
  • Anchoring and framing: The order of presentation and phrasing of questions biases judgments.
  • Expertise effects: Expert annotators often disagree systematically with crowdworkers on subjective tasks.
  • Fatigue and attention: Large-scale human evaluation studies suffer from attention degradation over time.

Key papers

  • Dugan et al. (2022) — Real or Fake Text?: Boundary Detection: Game-based human evaluation of 243 participants across 21,000+ annotations on boundary detection task (where does human-written text transition to machine-generated?). Demonstrates substantial skill variance (top 10% score 3× better than bottom quartile); shows that monetary incentives improve learning while unincentivized participants plateau; analyzes genre-specific detection patterns and error types across News, Stories, Recipes, and Speeches.
  • Clark et al. (2021) — All That's 'Human' Is Not Gold: Large-scale evaluation (1,170 crowdworkers, 3 domains) showing untrained evaluators fail at chance level to distinguish GPT2/GPT3 from human text; tests three training methods (instructions, examples, comparisons) and finds example-based training improves accuracy modestly (50% → 55%); reveals evaluators focus on surface features rather than content.
  • Ippolito et al. (2019) — Detection is Easiest when Humans are Fooled: Compares expert raters (71% accuracy on 192-token excerpts) with crowdworkers (51% accuracy); studies effect of training and excerpt length; shows humans are more robust across decoding strategies than automatic systems but less accurate overall.
  • Gehrmann et al. (2019) — GLTR: Human-subjects study (35 students) demonstrating that GLTR visualization improves detection accuracy from 54% to 72%; shows interactive tools can teach humans to recognize artifacts.
  • Zellers et al. (2019) — Grover: Large-scale human credibility evaluation (600+ annotators) of GROVER-generated news vs. hand-written and real articles; finding that humans rate GROVER equally to human-written fake news (2.42/3 vs. 2.19/3) but less credible than real news.

Best practices

  • Multiple populations: Compare expert vs. crowdworker judgments to understand how reliability varies.
  • Confidence intervals: Report inter-annotator agreement (Cohen's kappa, Fleiss' kappa, Krippendorff's alpha) and confidence bounds on accuracy.
  • Qualitative analysis: Sample disagreement cases and error analyses to understand failure modes.
  • Reproducibility: Publish annotation guidelines, examples, and ideally the full annotation dataset.
  • Scale studies: Test whether results from 10–50 annotators generalize to larger, more diverse populations.