Skip to content
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Authors: Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, Noah A. Smith

Venue: arXiv, 2021 — arXiv:2107.00061

TL;DR

Human evaluators cannot reliably distinguish machine-generated text from human-written text without training, focusing instead on superficial features like grammar and style. Three training approaches (instructions, examples, and comparisons) improve evaluator accuracy modestly, with example-based training showing the most improvement across multiple text domains.

Contributions

  • Demonstrates that untrained human evaluators struggle to detect GPT2 and GPT3-generated text at better than chance levels
  • Analyzes what aspects of text evaluators focus on when making judgments (grammar, spelling, style vs. content quality)
  • Proposes and evaluates three lightweight training methods to improve evaluator detection accuracy
  • Provides evidence that consistent human evaluation methodology is crucial for reliable NLG model assessment

Method

The authors conduct a large-scale evaluation study with 1,170 approved human evaluators (using Amazon Mechanical Turk) assessing text from three domains: stories, news articles, and recipes. Evaluators rate whether each text passage is "definitely human-written," "possibly human-written," "possibly machine-generated," or "definitely machine-generated" on a 4-point scale.

For the baseline (no training), evaluators assess 5 text passages each—50 from GPT2 and 50 from GPT3 datasets conditioned on identical priming texts.

Three training methods are tested:

Instruction-based training: Evaluators receive explicit instructions highlighting dimensions of text that reveal machine generation (e.g., repetition, factuality, style consistency, genre adherence).

Example-based training: Evaluators see 3 practice rounds with labeled examples of machine vs. human text, receiving corrective feedback and explanations.

Comparison-based training: Evaluators compare paired texts (one human, one machine) written from the same prompt and identify which is machine-generated.

The authors also annotate 150 evaluator responses to understand what aspects of text evaluators cite when justifying their judgments, categorizing comments into form (grammar, level-of-detail, genre), content (factuality, consistency, common sense, coherence), and machine capabilities.

Results

Overall, evaluators choosing between human and GPT2-generated text correctly identified the author only 57% of the time (vs. 50% random baseline). For GPT3, accuracy dropped to 50%—at chance level.

Breaking down by training condition: - No training: 50% overall accuracy - Instruction-based: 52% overall accuracy - Example-based: 55% overall accuracy (statistically significant improvement) - Comparison-based: 53% overall accuracy

Example-based training showed the highest gains, though improvements varied by domain: stories saw 62% accuracy, news 65%, and recipes 55%.

Evaluators' focus when deciding: nearly twice as many comments about form (47%) vs. content (25%), with most focusing on spelling, grammar, punctuation, and style (45 of 150 comments). Despite training, evaluator agreement remained low (Krippendorff's α ≈ 0.11).

Connections

Notes

Strengths: - Large-scale crowdsourced study (1,170 evaluators, 5,850 annotations total) provides robust empirical evidence - Systematic annotation of evaluator reasoning reveals misconceptions about model capabilities (e.g., evaluators overestimating what models can/cannot do) - Tests multiple training paradigms with clear operational differences - Clear practical implications for NLG researchers collecting human judgments

Weaknesses: - Improvements are modest and inconsistent across domains; example-based training only reaches 55% accuracy overall - The paper does not deeply explore why training effects remain limited—whether evaluators lack sufficient expertise or whether human evaluation of fluency-only tasks is inherently difficult - Findings specific to open-ended generation domains; structured tasks may see different patterns - Limited discussion of why evaluator agreement is so low even after training

Open questions: - Can more sophisticated or domain-expert training push evaluator accuracy higher? - Do the low inter-rater reliabilities observed here generalize to other text-quality assessments? - How do findings transfer to detecting generated text in higher-stakes settings (e.g., academic papers, news)?