Skip to content
Can AI-Generated Text be Reliably Detected?

Can AI-Generated Text be Reliably Detected?

Authors: Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, Soheil Feizi
Venue: arXiv, 2023 — 2303.11156

TL;DR

This paper stress-tests AI-text detectors using a novel recursive paraphrasing attack, showing that watermarking-based, neural network-based, and retrieval-based detectors can be reliably broken with only minor text degradation. The work establishes a theoretical connection between detector AUROC and the total variation distance between human and AI-generated text distributions, revealing a fundamental hardness result for reliable detection as language models improve.

Contributions

  • Comprehensive analysis of robustness for four classes of AI-text detectors: watermarking, neural network-based, zero-shot, and retrieval-based methods
  • Introduction of a recursive paraphrasing attack that breaks watermarking and retrieval-based detectors while preserving text quality in 77-89% of cases
  • Demonstration of spoofing attacks where adversaries can cause detectors to flag human-written text as AI-generated (type-I errors)
  • Theoretical framework proving that detector AUROC is bounded by total variation distance between human and AI-text distributions
  • Evidence that improved language models lead to smaller distribution distances, making reliable detection increasingly difficult

Method

The attack uses neural paraphrasers (DIPPER, LLaMA-2-7B-Chat, T5-based) to recursively rewrite watermarked or generated text. Experiments test on 1000 long news articles (~300 tokens each) from the XSum dataset, evaluating multiple target LLMs including OPT-13B and GPT-2 Medium. The authors measure attack success via ROC curves and perplexity-based text quality metrics, along with human evaluation studies.

Results

On watermarked OPT-13B, the recursive paraphrasing attack reduces watermark detector AUROC from 99.8% to 80.7% in five rounds. Zero-shot detectors drop from 96.5% to 59.8%. Retrieval-based detectors fall below 60% accuracy after five paraphrasing rounds. Spoofing attacks can cause detector AUROC to drop from 99.8% to 1.3% on soft-watermarked text. Human studies show 77% of recursively paraphrased passages maintain high content quality and 89% preserve grammar, indicating practical attack feasibility.

The paper also shows that detection becomes harder as language models improve: better models generate text closer to human distributions, reducing the total variation distance and making discrimination mathematically harder.

Connections

Notes

The theoretical result (Theorem 1) is notable: it rigorously bounds the best-possible detector's AUROC in terms of total variation distance between text distributions. This has implications for detection research—as language models become more capable and their outputs approach human text distributions, reliable detection becomes provably harder. The finding that watermarking can be spoofed through paraphrasing without white-box access is particularly concerning for systems relying on watermarks for authentication. The slight text quality degradation (perplexity increases modestly) makes the attack practical for adversaries willing to accept minor quality loss.