AI Safety¶
AI safety encompasses the technical and governance approaches to ensuring that artificial intelligence systems behave as intended and do not cause harm. In the context of misinformation and disinformation, AI safety focuses on preventing or limiting LLMs' ability to generate false or misleading content.
Safety mechanisms include training-based approaches (instruction fine-tuning to refuse harmful requests), filtering mechanisms (screening for disallowed content), and architectural choices (model size, training data curation) that influence model behavior.
Key papers¶
- FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework using in-context learning to generate adversarial prompts targeting generative models; demonstrates 80%+ attack success on Stable Diffusion, outperforming prior methods
- The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation — Foundational threat analysis of malicious AI uses across digital, physical, and political domains; emphasizes dual-use nature of AI and importance of safety governance
- Discovering Language Model Behaviors with Model-Written Evaluations — Uses LM-generated evaluations to discover unintended safety issues with RLHF: amplified political bias, instrumental subgoals, resistance to shutdown, and inverse scaling phenomena where larger models behave worse on safety-relevant tasks
- Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — large-scale analysis of persona-induced toxicity in ChatGPT, showing safety mechanisms can be bypassed through system parameter manipulation
- Disinformation Capabilities of Large Language Models — evaluation of safety mechanisms in LLMs, finding that most models lack effective safeguards against disinformation generation
- Evans et al. (2021) — Truthful AI: Developing and Governing AI That Does Not Lie — policy framework for preventing AI systems from generating false or misleading statements
- Mirsky et al. (2021) — threat model for offensive AI capabilities adversaries use, including social engineering via deepfakes
Related topics¶
- Large Language Models — the primary AI systems requiring safety mechanisms
- Disinformation Generation — the specific harm that safety mechanisms aim to prevent