LLM Safety and Adversarial Robustness¶
Large Language Models present novel safety challenges including:
- Hallucination and factual errors: Models generate plausible-sounding but false information
- Adversarial attacks: Input crafting or prompting designed to elicit harmful or misleading outputs
- Jailbreaking: Circumventing safety guidelines through creative prompting
- Misuse for misinformation: Automated generation of convincing false narratives at scale
- Downstream application vulnerabilities: How LLM outputs degrade performance of systems that depend on them (retrieval, QA, summarization)
Key papers in this wiki¶
- FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework using in-context learning; extends beyond text-to-image models to text-to-text (GPT-Neo), demonstrating a 52.4% success rate on language models
- Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment — Comprehensive survey and evaluation framework for LLM trustworthiness across seven dimensions (reliability, safety, fairness, resistance to misuse, explainability, social norms, robustness); presents empirical measurements and case studies on multiple LLMs
- Red Teaming Language Models with Language Models — Demonstrates automated red teaming using language models to discover diverse harms including offensive replies, data leakage, and distributional biases at scale
- Jailbroken: How Does LLM Safety Training Fail? — Analyzes why safety training fails through two failure modes (competing objectives and mismatched generalization); demonstrates that vulnerabilities persist despite extensive red-teaming
- Universal and Transferable Adversarial Attacks on Aligned Language Models — Demonstrates automated generation of adversarial suffixes that cause aligned LLMs to produce harmful content; shows high transferability across models
- On the Risk of Misinformation Pollution with Large Language Models — Demonstrates that LLMs can generate credible misinformation that significantly degrades ODQA system performance; proposes detection and defense strategies
Related topics¶
- Language model truthfulness (factuality and truth in model outputs)
- Synthetic Text Generation (automated content creation and its risks)
- Misinformation and fake news detection (identifying false information including LLM-generated content)