On the Risk of Misinformation Pollution with Large Language Models¶

Authors: Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, William Yang Wang

Venue: arXiv, 2023 — arxiv:2305.13661

TL;DR¶

This paper investigates how modern LLMs can be misused to generate convincing misinformation at scale and its impact on downstream NLP applications. Using GPT-3.5 as a misinformation generator, the authors demonstrate that synthetically-generated false information can significantly degrade ODQA system performance (up to 87% in some settings) and propose three defense strategies: detection, vigilant prompting, and reader ensemble voting.

Contributions¶

Establishes a comprehensive threat model encompassing both unintentional (hallucination) and intentional (adversarial) misinformation generation scenarios
Demonstrates that LLMs are effective misinformation generators, creating credible false content that evades detection by downstream ODQA systems
Proposes three defense mechanisms to mitigate harm: misinformation detection classifiers, vigilant prompting, and multi-reader ensemble voting
Evaluates vulnerabilities across different retriever-reader architectures and datasets (NQ-1500 and CovidNews)
Provides code and generated misinformation dataset for reproducibility and future research

Method¶

The authors evaluate four distinct misinformation generation settings using GPT-3.5 (text-davinci-003):

GenRead: Directly prompts the LLM to generate a background document supporting a target (false) answer to a given question. Example: generating a news article claiming Trump won the 2020 election.

CtrlGen: Similar to GenRead, but additionally conditions the generation on a predefined false opinion, simulating real-world scenarios where malicious actors have predetermined fabricated facts they wish to propagate.

Revise: Provides a human-written factual article and instructs the LLM to revise it to inject a predetermined false claim. Mirrors scenarios where attackers modify existing credible content.

Reit: Provides both a target question and predetermined response, asking the LLM to reiterate the false response in multiple ways. Simulates cases where misleading information aims to compromise downstream models without requiring human-like plausibility.

The authors then evaluate the impact on Open-Domain Question Answering (ODQA) systems using two datasets: - NQ-1500: 1,500 questions from Natural Questions dataset with Wikipedia corpus - CovidNews: 1,534 COVID-19-related questions from StreamingQA dataset

ODQA pipelines tested include two retrievers (BM25 sparse, DPR dense) combined with two readers (FiD, GPT-3.5).

Results¶

The paper reveals significant vulnerabilities:

Misinformation Generation Effectiveness: Synthetically-generated false passages are highly likely to be retrieved by both sparse and dense retrievers when injected into the corpus, indicating that LLM-generated misinformation is comparable to human-written content in retrieval relevance.

ODQA Degradation: Performance drops are substantial across all reader-retriever combinations: - DPR-based models: 14-54% EM drop - BM25-based models: 20-87% EM drop - Different misinformation generation methods have varying impact; REIT is particularly effective at degrading performance

Defense Strategy Performance: - Detection: RoBERTa-based detectors achieve 91-99.7% AUROC on in-domain data but only 50-65% out-of-domain, indicating poor generalization - Vigilant Prompting: Modest improvements (5-15% performance recovery) with inconsistent results across settings - Reader Ensemble: More effective than prompting alone, with better robustness when multiple reader predictions are aggregated via majority voting

Connections¶

Related to LLM-Generated Misinformation through shared focus on automated generation of false information
Extends work on Hallucination In Language Models to the adversarial domain of deliberate misinformation production
Complements Fact Verification And Checking literature by examining vulnerabilities at the retrieval stage
Informs defenses discussed in Robust NLP Systems and Adversarial Robustness QA

Notes¶

Strengths: - Comprehensive threat model covering both intentional and unintentional misinformation - Well-designed experimental setup with multiple generation methods and ODQA architectures - Practical focus on vulnerabilities of deployed systems (ODQA) rather than abstract concerns - Reproducible with released code and datasets

Weaknesses: - Detection approach shows poor out-of-domain generalization, limiting practical applicability - Defense strategies are computationally expensive (reader ensemble requires multiple API calls) - Analysis limited to English-language ODQA systems; cross-lingual vulnerabilities unexplored - The misinformation generation methods are relatively simple; more sophisticated attacks could be stronger

Open questions: - How do other LLMs (Claude, Gemini, open-source models) compare in misinformation generation capability? - Can detection methods trained on synthetic misinformation transfer to real-world adversarial content? - What are the computational and cost implications of deploying defense strategies at scale?