Can LLM-Generated Misinformation Be Detected?¶

Venue: ICLR 2024 — arxiv:2309.13788

TL;DR¶

LLM-generated misinformation (from ChatGPT, Llama, Vicuna) is detectably harder for both humans and automated detectors to identify than human-written misinformation with identical semantics. The paper builds a taxonomy of LLM-generated misinformation across types, domains, and generation methods, then empirically demonstrates that humans succeed at detection 40.7% of the time on human-written content but only 9.6% on hallucinated news, and that existing detectors consistently underperform on LLM-generated variants.

Contributions¶

Taxonomy of LLM-generated misinformation across five dimensions: types (fake news, rumors, conspiracy theories, clickbait, misleading claims), domains (healthcare, science, politics, finance, law, education, social media, environment), sources (hallucination, arbitrary generation, controllable generation), intents (unintentional vs. intentional), and errors (unsubstantiated content, total fabrication, outdated information, description ambiguity, incomplete fact, false context).
Categorization and validation of real-world misinformation generation methods with LLMs: Hallucination Generation (HG) where users request news without malicious intent; Arbitrary Misinformation Generation (AMG) with explicit instructions to generate false content; Controllable Misinformation Generation (CMG) that preserves semantic meaning through paraphrase, rewriting, open-ended generation, or information manipulation.
LLMFake dataset with 1,000+ pieces of misinformation generated via multiple LLMs (ChatGPT, Llama2-7b/13b, Vicuna-7b) and methods, paired with real-world human-written data from Politifact, Gossicop, and CoAID.
Empirical evidence that LLM-generated misinformation is harder to detect: humans detect only 9.6% of hallucinated news vs 40.7% of human-written misinformation (average across 10 evaluators); existing detectors (ChatGPT-3.5, GPT-4, Llama2-7b-chat) consistently underperform on LLM-generated content compared to baselines.

Method¶

Research Questions:

RQ1: How can LLMs be utilized to generate misinformation? The authors proposed three approaches based on real-world scenarios: HG through unintentional hallucinations; AMG through explicit instruction; CMG through semantic-preserving transformations.
RQ2: Can humans detect LLM-generated misinformation? Human evaluation via Amazon Mechanical Turk with 10 evaluators independently assessing whether 100 random items from each category are "factual" or "nonfactual," with success rate as the metric.
RQ3: Can detectors detect LLM-generated misinformation? Zero-shot evaluation using ChatGPT-3.5, GPT-4, and Llama2-7b-chat/13b-chat with Chain-of-Thought prompting as detectors, testing against LLMFake dataset and human-written baselines (Politifact, Gossicop, CoAID).

Dataset Construction: LLMFake dataset includes 100 hallucinated news items (ChatGPT, prompt: "write a piece of news"), 100 totally arbitrary generation items (ChatGPT, "write misinformation"), 100 partially arbitrary items (ChatGPT, domain-specific instructions), and 270 items each from CMG methods (paraphrase, rewriting, open-ended, information manipulation) across real human-written datasets. Semantic analysis used OpenAI embeddings and t-SNE visualization to compare generated vs. human-written distributions.

Detection Evaluation: Tested both human and machine detectors on zero-shot basis. Humans evaluated 600 total items (100 human-written non-factual, 100 hallucinated, 100 arbitrary, 100 partially arbitrary, 50 paraphrased, 50 rewritten, 50 open-ended). Detectors evaluated using Success Rate % metric (percentage of items correctly identified as misinformation).

Results¶

Human Detection Performance (Table 3): - Humans successfully detect human-written misinformation 40.7% of the time on average - Detection success drops significantly for LLM-generated variants: - Hallucinated News Generation: 9.6% (hardest to detect) - Totally Arbitrary Generation: 21.8% - Partially Arbitrary Generation: 33.2% - Paraphrase Generation: 38.4% - Rewriting Generation: 24.2% - Open-ended Generation: 21.4% - Information Manipulation: 34.3%

Difference is statistically significant for Rewriting and Open-ended Generation (p < 0.05).

Key Finding 2: LLM-generated misinformation can be harder for humans to detect than human-written misinformation with the same semantics. Methods preserving semantic information (Paraphrase, Rewriting, Open-ended) have more deceptive styles and can make original misinformation harder to detect.

Detector Performance (Tables 4-5): - ChatGPT-3.5 performs poorly on LLM-generated misinformation, achieving 0% detection on hallucinated news - GPT-4 outperforms humans on LLM-generated misinformation but still struggles (e.g., ~38% on Arbitrary Generation vs 40.7% on human-written) - Llama2-7b-chat and Llama2-13b-chat also show degraded performance on LLM-generated content - All tested detectors show substantially lower performance on LLM-generated vs. human-written misinformation across most generation approaches

Key Finding 3: LLM-generated misinformation can be harder for misinforation detectors to detect than human-written misinformation with the same semantics. Detection performance on LLM-generated content is mostly lower than baseline human-written misinformation, with statistical significance for multiple methods.

Connections¶

Related to Large Language Models as both tools for generation and detection of misinformation
Extends Misinformation and fake news detection research by characterizing detection difficulty gaps
Complements the same authors' survey on LLMs in misinformation contexts
Connects to Fact-checking and corrections and Rumor detection on social media as downstream detection tasks
Overlaps with Natural Language Generation concerns about factuality and deceptive writing styles
Related to LLM Hallucination as a source of unintentional misinformation
Discusses implications for Prompt injection and LLM safety
Relevant to Multimodal Misinformation Detection as semantic similarity analysis applies across modalities

Notes¶

Strengths: - First systematic empirical investigation of detection difficulty gap between LLM-generated and human-written misinformation - Rigorous taxonomy spanning five dimensions (types, domains, sources, intents, errors) provides conceptual clarity - Comprehensive evaluation covering both human and machine detectors across multiple LLM architectures - Semantic analysis using embeddings and visualization provides insight into why detection is harder - Clear methodology with reproducible dataset and publicly available code/data

Limitations: - Human evaluation limited to 10 evaluators and small sample sizes (100 items per category) - Dataset construction relies heavily on ChatGPT; coverage of other LLM types (Llama, Vicuna) is more limited - Zero-shot detector evaluation doesn't compare to specialized fine-tuned models - Statistical significance testing limited (only reported for some methods) - Doesn't deeply investigate which linguistic features distinguish LLM-generated from human-written misinformation - Real-world impact analysis missing; unclear how often LLM-generated misinformation actually propagates

Follow-up Questions: - How do detection approaches perform with fine-tuned detectors trained on mixed LLM/human data? - What linguistic markers best distinguish LLM-generated from human-written misinformation? - How does detection performance vary across different LLM architectures and versions? - Do humans improve detection accuracy with training or domain-specific instruction? - How do these findings generalize to non-English languages and low-resource settings?