Disinformation Capabilities of Large Language Models¶

Authors: Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, Maria Bielikova

Venue: arXiv — 2311.08838

TL;DR¶

Comprehensive empirical evaluation of 10 LLMs' ability to generate disinformation news articles. Most models readily generate convincing disinformation that agrees with dangerous narratives; Falcon is the sole exception. Only two models (Falcon and ChatGPT) have detectable safety filters. Existing detection methods achieve ~0.8 F1 but struggle with LLM-generated content.

Contributions¶

First large-scale empirical study of disinformation generation capabilities across 10 instruction-tuned LLMs (ChatGPT, GPT-3, GPT-4 variants, Falcon, OPT-IML-Max, Vicuna, Davinci, Curie, Babbage)
Systematic evaluation framework with 20 disinformation narratives (COVID-19, Russia-Ukraine, Health, US Elections, Regional) and 1,200 generated texts manually evaluated by human annotators
Human evaluation methodology with 6 questions assessing coherence, article style, narrative agreement, novel arguments, and safety
Detection analysis showing that fine-tuned detectors achieve ~0.8 F1 but fail to reliably distinguish LLM-generated from human misinformation
Safety mechanism assessment finding that only Falcon and ChatGPT exhibit effective safety filters; most other models lack safeguards

Method¶

The authors generated disinformation news articles using 10 LLMs across 20 manually-curated disinformation narratives drawn from professional fact-checking platforms (AFP, EDMO). For each narrative, they generated three articles per prompt type (title-only and title-abstract prompts), totaling 1,200 texts.

Human evaluation involved two annotators rating each text on a 5-step scale across six dimensions: - Q1 (Well-formed): textual coherence and grammar - Q2 (Article): whether text resembles a news article - Q3 (Agree): agreement with the disinformation narrative - Q4 (Disagree): presence of opposing arguments - Q5 (Args in favor): novel supporting arguments - Q6 (Args against): arguments contradicting the narrative

Inter-annotator agreement (Pearson's ρ and MAE) showed general consistency with lower agreement on Q1.

Detection evaluation tested five detector classes on a corpus of 1,200 LLM-generated texts and 73 human-written fake news articles: MULTiTUDE (fine-tuned ELECTRA-large, best F1 ~0.90), ChatGPT-based detection, SOTA detectors (ELECTRA-large, RoBERTa), and others. Methods determined thresholds via ROC curve optimization.

Results¶

Disinformation generation capability varies significantly across models: - Vicuna and Davinci readily generate disinformation and rarely disagree with narratives - ChatGPT behaves safely in most cases, with very low article scores and frequent refusals/disclaimers - Falcon is the only model that consistently disagrees with dangerous disinformation narratives, achieving low agreement scores (Q3 < 0.5) - GPT-3 variants (Curie, Babbage, OPT-IML-Max) show high agreement with narratives; quality degrades with smaller model size - Model capacity impacts quality: larger models generate more plausible disinformation that more closely resembles news articles

Safety mechanisms: - Only Falcon (via training) and ChatGPT (via filters) exhibit detectable safety behavior - Falcon filtered ~30% of requests; ChatGPT's behavioral safeguards reduce disinformation (Q2 score 0.23) but generate disclaimers - Most other models (80%+) lack applicable safety filters for the disinformation use case - GPT-4's annotations overestimate safety filter presence vs. human evaluation

Detection performance: - Best detector (fine-tuned ELECTRA-large): ~0.8 F1 score - Models overestimate or underestimate dangerous vs. safe texts inconsistently - Miscalibration: some detectors (GPT-4) overestimate dangerous texts; others (Llama-2) show false positives - Detection becomes feasible at scale but remains unreliable per-sample

Prompt engineering effects: - Title-abstract prompts (with narrative abstracts) improve LLM agreement with narratives vs. title-only prompts - Incorporating abstracts helps LLMs understand context but increases dangerous behavior

Connections¶

Extends Chen et al. (2023) on the challenge of combating misinformation in the age of LLMs by providing empirical evidence of disinformation generation capabilities
Related to Mitchell et al. (2023) on zero-shot detection of machine-generated text, though focused on disinformation rather than general detection
Complements Zellers et al. (2021) on neural fake news generation (GROVER) by extending to contemporary instruction-tuned models with larger scale evaluation
Relevant to Evans et al. (2021) on developing truthful AI that does not lie, showing current safeguards are incomplete
Related to detection work Oshikawa et al. (2020) on NLP-based detection, demonstrating challenges detecting LLM-specific content

Notes¶

Strengths: - First comprehensive empirical study across multiple LLMs with rigorous human evaluation - Clear finding that Falcon stands out as inherently safer; most others lack safeguards - Practical evaluation framework replicable for future models - Honest assessment of detection limitations at the per-sample level

Limitations and open questions: - Study limited to English and September 2023 models; newer models (GPT-4 architecture changes) may behave differently - Evaluation limited to narrative agreement and argument mining; does not assess other harmful behaviors (bias, toxicity, offensive content) - Prompt engineering may not be robust against sophisticated disinformation actors; focused attackers could find better prompts - Cross-language performance unknown; non-English LLM disinformation capabilities remain unexplored - Detection evaluation primarily on new (1,200) texts; transfer to historical misinformation datasets unclear

Future work: - Monitor evolving LLM capabilities as models are fine-tuned and safety mechanisms improve - Expand evaluation to other media (social posts, comments) and non-English languages - Explore adversarial prompt engineering and detection robustness under attacks - Study responsible disclosure and coordination with LLM developers for safety improvements