Skip to content

LLM-generated text detection

The task of identifying text produced by large language models (ChatGPT, GPT-2, GPT-3, Claude, etc.) as opposed to human-written text. This problem is motivated by concerns about misinformation, academic dishonesty, unauthorized content generation, and erosion of trust in digital media.

Problem scope

As large language models produce increasingly human-like text, detection becomes critical for:

  • Education: Identifying student use of ChatGPT on assignments
  • Journalism: Preventing AI-generated fake news and deepfake articles
  • Authentication: Verifying authorship and preventing plagiarism
  • Cybersecurity: Detecting AI-aided phishing, social engineering, and manipulation attacks
  • Trust and credibility: Maintaining reader confidence in news outlets and publications

Approaches

Detection methods broadly divide into two categories:

Black-box Detection: Used by external entities with only API-level access to a language model. Approaches build binary classifiers to distinguish human from machine-generated text using: - Statistical features (perplexity, word ranking, Zipfian coefficients) - Linguistic patterns (vocabulary diversity, part-of-speech analysis, sentiment, stylometry) - Fact verification (detecting hallucinations and factual inconsistencies) - Traditional classifiers (SVM, random forests) or neural models (fine-tuned transformers like RoBERTa)

White-box Detection: Available to language model developers with full model access. Methods embed watermarks for traceability: - Post-hoc watermarking: embedding hidden signals after text generation (rule-based via syntactic/semantic modification, or neural-based via encoder-decoder-discriminator networks) - Inference-time watermarking: modifying the decoding process (e.g., constraining token sampling to "green lists" seeded by hash functions)

Key challenges

  • Bias in training data: Black-box detectors trained on limited task distributions (question-answering, news) may fail on other domains
  • Adversarial robustness: Paraphrasing attacks can degrade detection accuracy from 97% to ~80%
  • Evaluation metrics: Standard metrics (accuracy, AUC) mask poor performance in low-false-positive regimes critical for high-stakes applications
  • Confidence calibration: Lack of reliable confidence scores limits practical deployment
  • Arms race: As language models improve, black-box detection signals weaken; as watermarking becomes common, adversaries develop watermark removal techniques
  • Open-source LLMs: Detection assumes closed-system control; open-source models enable fine-tuning and watermark removal by end users

Benchmark datasets

  • HC3 (Guo et al. 2023): ChatGPT vs. human answers on 37,175 questions across English and Chinese; achieves 99.79% F1 at paragraph level
  • Neural Fake News (Zellers et al. 2019): Grover-generated news articles
  • TweepFake (Fagni et al. 2021): GPT-2 generated tweets
  • GPT2-Output (OpenAI): GPT-2 generated text on WebText corpus
  • TURINGBENCH (Uchendu et al. 2021): Diverse LLM outputs across news and other domains

Key papers