Skip to content

LLM Safety and Adversarial Robustness

Large Language Models present novel safety challenges including:

  • Hallucination and factual errors: Models generate plausible-sounding but false information
  • Adversarial attacks: Input crafting or prompting designed to elicit harmful or misleading outputs
  • Jailbreaking: Circumventing safety guidelines through creative prompting
  • Misuse for misinformation: Automated generation of convincing false narratives at scale
  • Downstream application vulnerabilities: How LLM outputs degrade performance of systems that depend on them (retrieval, QA, summarization)

Key papers in this wiki