All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text¶
Authors: Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, Noah A. Smith
Venue: arXiv, 2021 — arXiv:2107.00061
TL;DR¶
Human evaluators cannot reliably distinguish machine-generated text from human-written text without training, focusing instead on superficial features like grammar and style. Three training approaches (instructions, examples, and comparisons) improve evaluator accuracy modestly, with example-based training showing the most improvement across multiple text domains.
Contributions¶
- Demonstrates that untrained human evaluators struggle to detect GPT2 and GPT3-generated text at better than chance levels
- Analyzes what aspects of text evaluators focus on when making judgments (grammar, spelling, style vs. content quality)
- Proposes and evaluates three lightweight training methods to improve evaluator detection accuracy
- Provides evidence that consistent human evaluation methodology is crucial for reliable NLG model assessment
Method¶
The authors conduct a large-scale evaluation study with 1,170 approved human evaluators (using Amazon Mechanical Turk) assessing text from three domains: stories, news articles, and recipes. Evaluators rate whether each text passage is "definitely human-written," "possibly human-written," "possibly machine-generated," or "definitely machine-generated" on a 4-point scale.
For the baseline (no training), evaluators assess 5 text passages each—50 from GPT2 and 50 from GPT3 datasets conditioned on identical priming texts.
Three training methods are tested:
Instruction-based training: Evaluators receive explicit instructions highlighting dimensions of text that reveal machine generation (e.g., repetition, factuality, style consistency, genre adherence).
Example-based training: Evaluators see 3 practice rounds with labeled examples of machine vs. human text, receiving corrective feedback and explanations.
Comparison-based training: Evaluators compare paired texts (one human, one machine) written from the same prompt and identify which is machine-generated.
The authors also annotate 150 evaluator responses to understand what aspects of text evaluators cite when justifying their judgments, categorizing comments into form (grammar, level-of-detail, genre), content (factuality, consistency, common sense, coherence), and machine capabilities.
Results¶
Overall, evaluators choosing between human and GPT2-generated text correctly identified the author only 57% of the time (vs. 50% random baseline). For GPT3, accuracy dropped to 50%—at chance level.
Breaking down by training condition: - No training: 50% overall accuracy - Instruction-based: 52% overall accuracy - Example-based: 55% overall accuracy (statistically significant improvement) - Comparison-based: 53% overall accuracy
Example-based training showed the highest gains, though improvements varied by domain: stories saw 62% accuracy, news 65%, and recipes 55%.
Evaluators' focus when deciding: nearly twice as many comments about form (47%) vs. content (25%), with most focusing on spelling, grammar, punctuation, and style (45 of 150 comments). Despite training, evaluator agreement remained low (Krippendorff's α ≈ 0.11).
Connections¶
- Related to [[2020-ippolito-automatic-detection]] via shared interest in detecting state-of-the-art machine-generated text
- Complements A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities on the broader challenge of distinguishing fabricated content
- Cites Gehrmann Visually Grounded and Gehrmann Challenges Detection on detecting generated text
- Informed by evaluation methodology literature: Belz Comparing Automatic, Van Der Lee Best Practices, Howcroft Twenty Years
Notes¶
Strengths: - Large-scale crowdsourced study (1,170 evaluators, 5,850 annotations total) provides robust empirical evidence - Systematic annotation of evaluator reasoning reveals misconceptions about model capabilities (e.g., evaluators overestimating what models can/cannot do) - Tests multiple training paradigms with clear operational differences - Clear practical implications for NLG researchers collecting human judgments
Weaknesses: - Improvements are modest and inconsistent across domains; example-based training only reaches 55% accuracy overall - The paper does not deeply explore why training effects remain limited—whether evaluators lack sufficient expertise or whether human evaluation of fluency-only tasks is inherently difficult - Findings specific to open-ended generation domains; structured tasks may see different patterns - Limited discussion of why evaluator agreement is so low even after training
Open questions: - Can more sophisticated or domain-expert training push evaluator accuracy higher? - Do the low inter-rater reliabilities observed here generalize to other text-quality assessments? - How do findings transfer to detecting generated text in higher-stakes settings (e.g., academic papers, news)?