LLM Safety and Adversarial Robustness¶

Large Language Models present novel safety challenges including:

Hallucination and factual errors: Models generate plausible-sounding but false information
Adversarial attacks: Input crafting or prompting designed to elicit harmful or misleading outputs
Jailbreaking: Circumventing safety guidelines through creative prompting
Misuse for misinformation: Automated generation of convincing false narratives at scale
Downstream application vulnerabilities: How LLM outputs degrade performance of systems that depend on them (retrieval, QA, summarization)

Key papers in this wiki¶

FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework using in-context learning; extends beyond text-to-image models to text-to-text (GPT-Neo), demonstrating a 52.4% success rate on language models
Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — Comprehensive survey and evaluation framework for LLM trustworthiness across seven dimensions (reliability, safety, fairness, resistance to misuse, explainability, social norms, robustness); presents empirical measurements and case studies on multiple LLMs
Red Teaming Language Models with Language Models — Demonstrates automated red teaming using language models to discover diverse harms including offensive replies, data leakage, and distributional biases at scale
Jailbroken: How Does LLM Safety Training Fail? — Analyzes why safety training fails through two failure modes (competing objectives and mismatched generalization); demonstrates that vulnerabilities persist despite extensive red-teaming
Universal and Transferable Adversarial Attacks on Aligned Language Models — Demonstrates automated generation of adversarial suffixes that cause aligned LLMs to produce harmful content; shows high transferability across models
On the Risk of Misinformation Pollution with Large Language Models — Demonstrates that LLMs can generate credible misinformation that significantly degrades ODQA system performance; proposes detection and defense strategies

Language model truthfulness (factuality and truth in model outputs)
Synthetic Text Generation (automated content creation and its risks)
Misinformation and fake news detection (identifying false information including LLM-generated content)