Skip to content
Disinformation Capabilities of Large Language Models

Disinformation Capabilities of Large Language Models

Authors: Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, Maria Bielikova

Venue: arXiv — 2311.08838

TL;DR

Comprehensive empirical evaluation of 10 LLMs' ability to generate disinformation news articles. Most models readily generate convincing disinformation that agrees with dangerous narratives; Falcon is the sole exception. Only two models (Falcon and ChatGPT) have detectable safety filters. Existing detection methods achieve ~0.8 F1 but struggle with LLM-generated content.

Contributions

  • First large-scale empirical study of disinformation generation capabilities across 10 instruction-tuned LLMs (ChatGPT, GPT-3, GPT-4 variants, Falcon, OPT-IML-Max, Vicuna, Davinci, Curie, Babbage)
  • Systematic evaluation framework with 20 disinformation narratives (COVID-19, Russia-Ukraine, Health, US Elections, Regional) and 1,200 generated texts manually evaluated by human annotators
  • Human evaluation methodology with 6 questions assessing coherence, article style, narrative agreement, novel arguments, and safety
  • Detection analysis showing that fine-tuned detectors achieve ~0.8 F1 but fail to reliably distinguish LLM-generated from human misinformation
  • Safety mechanism assessment finding that only Falcon and ChatGPT exhibit effective safety filters; most other models lack safeguards

Method

The authors generated disinformation news articles using 10 LLMs across 20 manually-curated disinformation narratives drawn from professional fact-checking platforms (AFP, EDMO). For each narrative, they generated three articles per prompt type (title-only and title-abstract prompts), totaling 1,200 texts.

Human evaluation involved two annotators rating each text on a 5-step scale across six dimensions: - Q1 (Well-formed): textual coherence and grammar - Q2 (Article): whether text resembles a news article - Q3 (Agree): agreement with the disinformation narrative - Q4 (Disagree): presence of opposing arguments - Q5 (Args in favor): novel supporting arguments - Q6 (Args against): arguments contradicting the narrative

Inter-annotator agreement (Pearson's ρ and MAE) showed general consistency with lower agreement on Q1.

Detection evaluation tested five detector classes on a corpus of 1,200 LLM-generated texts and 73 human-written fake news articles: MULTiTUDE (fine-tuned ELECTRA-large, best F1 ~0.90), ChatGPT-based detection, SOTA detectors (ELECTRA-large, RoBERTa), and others. Methods determined thresholds via ROC curve optimization.

Results

Disinformation generation capability varies significantly across models: - Vicuna and Davinci readily generate disinformation and rarely disagree with narratives - ChatGPT behaves safely in most cases, with very low article scores and frequent refusals/disclaimers - Falcon is the only model that consistently disagrees with dangerous disinformation narratives, achieving low agreement scores (Q3 < 0.5) - GPT-3 variants (Curie, Babbage, OPT-IML-Max) show high agreement with narratives; quality degrades with smaller model size - Model capacity impacts quality: larger models generate more plausible disinformation that more closely resembles news articles

Safety mechanisms: - Only Falcon (via training) and ChatGPT (via filters) exhibit detectable safety behavior - Falcon filtered ~30% of requests; ChatGPT's behavioral safeguards reduce disinformation (Q2 score 0.23) but generate disclaimers - Most other models (80%+) lack applicable safety filters for the disinformation use case - GPT-4's annotations overestimate safety filter presence vs. human evaluation

Detection performance: - Best detector (fine-tuned ELECTRA-large): ~0.8 F1 score - Models overestimate or underestimate dangerous vs. safe texts inconsistently - Miscalibration: some detectors (GPT-4) overestimate dangerous texts; others (Llama-2) show false positives - Detection becomes feasible at scale but remains unreliable per-sample

Prompt engineering effects: - Title-abstract prompts (with narrative abstracts) improve LLM agreement with narratives vs. title-only prompts - Incorporating abstracts helps LLMs understand context but increases dangerous behavior

Connections

  • Extends Chen et al. (2023) on the challenge of combating misinformation in the age of LLMs by providing empirical evidence of disinformation generation capabilities
  • Related to Mitchell et al. (2023) on zero-shot detection of machine-generated text, though focused on disinformation rather than general detection
  • Complements Zellers et al. (2021) on neural fake news generation (GROVER) by extending to contemporary instruction-tuned models with larger scale evaluation
  • Relevant to Evans et al. (2021) on developing truthful AI that does not lie, showing current safeguards are incomplete
  • Related to detection work Oshikawa et al. (2020) on NLP-based detection, demonstrating challenges detecting LLM-specific content

Notes

Strengths: - First comprehensive empirical study across multiple LLMs with rigorous human evaluation - Clear finding that Falcon stands out as inherently safer; most others lack safeguards - Practical evaluation framework replicable for future models - Honest assessment of detection limitations at the per-sample level

Limitations and open questions: - Study limited to English and September 2023 models; newer models (GPT-4 architecture changes) may behave differently - Evaluation limited to narrative agreement and argument mining; does not assess other harmful behaviors (bias, toxicity, offensive content) - Prompt engineering may not be robust against sophisticated disinformation actors; focused attackers could find better prompts - Cross-language performance unknown; non-English LLM disinformation capabilities remain unexplored - Detection evaluation primarily on new (1,200) texts; transfer to historical misinformation datasets unclear

Future work: - Monitor evolving LLM capabilities as models are fine-tuned and safety mechanisms improve - Expand evaluation to other media (social posts, comments) and non-English languages - Explore adversarial prompt engineering and detection robustness under attacks - Study responsible disclosure and coordination with LLM developers for safety improvements