Skip to content

Watermarking

Watermarking is a technique for embedding imperceptible or hidden signals into generated content (text, images, audio) that can later be detected to verify authenticity or origin. In the context of language models, soft watermarking encourages the model to use certain "green list" tokens more frequently, creating a statistical signature detectable by a watermark detector.

Watermarking has been proposed as a defense against misuse of generative AI, particularly for preventing plagiarism and false attribution. However, recent work shows that watermarked systems can be vulnerable to adversarial attacks such as paraphrasing, which can remove or degrade the watermark signals while maintaining semantic content.

Key papers

  • Wu et al. (2023) — Comprehensive analysis of watermarking techniques including post-hoc (rule-based and neural-based) and inference-time watermarking; discusses adversarial robustness against attacks
  • Tang et al. (2023) — Comprehensive survey covering both post-hoc watermarks (rule-based and neural-based) and inference-time watermarking; discusses watermarking requirements (effectiveness, secrecy, robustness) and vulnerabilities to adaptive attacks
  • Can AI-Generated Text be Reliably Detected? — Demonstrates recursive paraphrasing attacks reduce watermark detector AUROC from 99.8% to 80.7%; also shows spoofing attacks can cause false positives for human text
  • Kirchenbauer et al. (2023) — A Watermark for Large Language Models — Proposes soft watermarking scheme using green/red token lists