On the Risk of Misinformation Pollution with Large Language Models¶
Authors: Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, William Yang Wang
Venue: arXiv, 2023 — arxiv:2305.13661
TL;DR¶
This paper investigates how modern LLMs can be misused to generate convincing misinformation at scale and its impact on downstream NLP applications. Using GPT-3.5 as a misinformation generator, the authors demonstrate that synthetically-generated false information can significantly degrade ODQA system performance (up to 87% in some settings) and propose three defense strategies: detection, vigilant prompting, and reader ensemble voting.
Contributions¶
- Establishes a comprehensive threat model encompassing both unintentional (hallucination) and intentional (adversarial) misinformation generation scenarios
- Demonstrates that LLMs are effective misinformation generators, creating credible false content that evades detection by downstream ODQA systems
- Proposes three defense mechanisms to mitigate harm: misinformation detection classifiers, vigilant prompting, and multi-reader ensemble voting
- Evaluates vulnerabilities across different retriever-reader architectures and datasets (NQ-1500 and CovidNews)
- Provides code and generated misinformation dataset for reproducibility and future research
Method¶
The authors evaluate four distinct misinformation generation settings using GPT-3.5 (text-davinci-003):
GenRead: Directly prompts the LLM to generate a background document supporting a target (false) answer to a given question. Example: generating a news article claiming Trump won the 2020 election.
CtrlGen: Similar to GenRead, but additionally conditions the generation on a predefined false opinion, simulating real-world scenarios where malicious actors have predetermined fabricated facts they wish to propagate.
Revise: Provides a human-written factual article and instructs the LLM to revise it to inject a predetermined false claim. Mirrors scenarios where attackers modify existing credible content.
Reit: Provides both a target question and predetermined response, asking the LLM to reiterate the false response in multiple ways. Simulates cases where misleading information aims to compromise downstream models without requiring human-like plausibility.
The authors then evaluate the impact on Open-Domain Question Answering (ODQA) systems using two datasets: - NQ-1500: 1,500 questions from Natural Questions dataset with Wikipedia corpus - CovidNews: 1,534 COVID-19-related questions from StreamingQA dataset
ODQA pipelines tested include two retrievers (BM25 sparse, DPR dense) combined with two readers (FiD, GPT-3.5).
Results¶
The paper reveals significant vulnerabilities:
Misinformation Generation Effectiveness: Synthetically-generated false passages are highly likely to be retrieved by both sparse and dense retrievers when injected into the corpus, indicating that LLM-generated misinformation is comparable to human-written content in retrieval relevance.
ODQA Degradation: Performance drops are substantial across all reader-retriever combinations: - DPR-based models: 14-54% EM drop - BM25-based models: 20-87% EM drop - Different misinformation generation methods have varying impact; REIT is particularly effective at degrading performance
Defense Strategy Performance: - Detection: RoBERTa-based detectors achieve 91-99.7% AUROC on in-domain data but only 50-65% out-of-domain, indicating poor generalization - Vigilant Prompting: Modest improvements (5-15% performance recovery) with inconsistent results across settings - Reader Ensemble: More effective than prompting alone, with better robustness when multiple reader predictions are aggregated via majority voting
Connections¶
- Related to LLM-Generated Misinformation through shared focus on automated generation of false information
- Extends work on Hallucination In Language Models to the adversarial domain of deliberate misinformation production
- Complements Fact Verification And Checking literature by examining vulnerabilities at the retrieval stage
- Informs defenses discussed in Robust NLP Systems and Adversarial Robustness QA
Notes¶
Strengths: - Comprehensive threat model covering both intentional and unintentional misinformation - Well-designed experimental setup with multiple generation methods and ODQA architectures - Practical focus on vulnerabilities of deployed systems (ODQA) rather than abstract concerns - Reproducible with released code and datasets
Weaknesses: - Detection approach shows poor out-of-domain generalization, limiting practical applicability - Defense strategies are computationally expensive (reader ensemble requires multiple API calls) - Analysis limited to English-language ODQA systems; cross-lingual vulnerabilities unexplored - The misinformation generation methods are relatively simple; more sophisticated attacks could be stronger
Open questions: - How do other LLMs (Claude, Gemini, open-source models) compare in misinformation generation capability? - Can detection methods trained on synthetic misinformation transfer to real-world adversarial content? - What are the computational and cost implications of deploying defense strategies at scale?