All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text¶

Authors: Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, Noah A. Smith

Venue: arXiv, 2021 — arXiv:2107.00061

TL;DR¶

Human evaluators cannot reliably distinguish machine-generated text from human-written text without training, focusing instead on superficial features like grammar and style. Three training approaches (instructions, examples, and comparisons) improve evaluator accuracy modestly, with example-based training showing the most improvement across multiple text domains.

Contributions¶

Demonstrates that untrained human evaluators struggle to detect GPT2 and GPT3-generated text at better than chance levels
Analyzes what aspects of text evaluators focus on when making judgments (grammar, spelling, style vs. content quality)
Proposes and evaluates three lightweight training methods to improve evaluator detection accuracy
Provides evidence that consistent human evaluation methodology is crucial for reliable NLG model assessment

Method¶

The authors conduct a large-scale evaluation study with 1,170 approved human evaluators (using Amazon Mechanical Turk) assessing text from three domains: stories, news articles, and recipes. Evaluators rate whether each text passage is "definitely human-written," "possibly human-written," "possibly machine-generated," or "definitely machine-generated" on a 4-point scale.

For the baseline (no training), evaluators assess 5 text passages each—50 from GPT2 and 50 from GPT3 datasets conditioned on identical priming texts.

Three training methods are tested:

Instruction-based training: Evaluators receive explicit instructions highlighting dimensions of text that reveal machine generation (e.g., repetition, factuality, style consistency, genre adherence).

Example-based training: Evaluators see 3 practice rounds with labeled examples of machine vs. human text, receiving corrective feedback and explanations.

Comparison-based training: Evaluators compare paired texts (one human, one machine) written from the same prompt and identify which is machine-generated.

The authors also annotate 150 evaluator responses to understand what aspects of text evaluators cite when justifying their judgments, categorizing comments into form (grammar, level-of-detail, genre), content (factuality, consistency, common sense, coherence), and machine capabilities.

Results¶

Overall, evaluators choosing between human and GPT2-generated text correctly identified the author only 57% of the time (vs. 50% random baseline). For GPT3, accuracy dropped to 50%—at chance level.

Breaking down by training condition: - No training: 50% overall accuracy - Instruction-based: 52% overall accuracy - Example-based: 55% overall accuracy (statistically significant improvement) - Comparison-based: 53% overall accuracy

Example-based training showed the highest gains, though improvements varied by domain: stories saw 62% accuracy, news 65%, and recipes 55%.

Evaluators' focus when deciding: nearly twice as many comments about form (47%) vs. content (25%), with most focusing on spelling, grammar, punctuation, and style (45 of 150 comments). Despite training, evaluator agreement remained low (Krippendorff's α ≈ 0.11).

Connections¶

Related to Ippolito et al. 2019 via shared interest in detecting state-of-the-art machine-generated text
Complements Zhou et al.'s survey on the broader challenge of distinguishing fabricated content
Cites work on detecting generated text including Gehrmann et al. on visually-grounded and challenge detection
Informed by evaluation methodology literature in NLG assessment

Notes¶

Strengths: - Large-scale crowdsourced study (1,170 evaluators, 5,850 annotations total) provides robust empirical evidence - Systematic annotation of evaluator reasoning reveals misconceptions about model capabilities (e.g., evaluators overestimating what models can/cannot do) - Tests multiple training paradigms with clear operational differences - Clear practical implications for NLG researchers collecting human judgments

Weaknesses: - Improvements are modest and inconsistent across domains; example-based training only reaches 55% accuracy overall - The paper does not deeply explore why training effects remain limited—whether evaluators lack sufficient expertise or whether human evaluation of fluency-only tasks is inherently difficult - Findings specific to open-ended generation domains; structured tasks may see different patterns - Limited discussion of why evaluator agreement is so low even after training

Open questions: - Can more sophisticated or domain-expert training push evaluator accuracy higher? - Do the low inter-rater reliabilities observed here generalize to other text-quality assessments? - How do findings transfer to detecting generated text in higher-stakes settings (e.g., academic papers, news)?