AI Safety¶

AI safety encompasses the technical and governance approaches to ensuring that artificial intelligence systems behave as intended and do not cause harm. In the context of misinformation and disinformation, AI safety focuses on preventing or limiting LLMs' ability to generate false or misleading content.

Safety mechanisms include training-based approaches (instruction fine-tuning to refuse harmful requests), filtering mechanisms (screening for disallowed content), and architectural choices (model size, training data curation) that influence model behavior.

Key papers¶

FLIRT: Feedback Loop In-context Red Teaming — Automated red teaming framework using in-context learning to generate adversarial prompts targeting generative models; demonstrates 80%+ attack success on Stable Diffusion, outperforming prior methods
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation — Foundational threat analysis of malicious AI uses across digital, physical, and political domains; emphasizes dual-use nature of AI and importance of safety governance
Discovering Language Model Behaviors with Model-Written Evaluations — Uses LM-generated evaluations to discover unintended safety issues with RLHF: amplified political bias, instrumental subgoals, resistance to shutdown, and inverse scaling phenomena where larger models behave worse on safety-relevant tasks
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models — large-scale analysis of persona-induced toxicity in ChatGPT, showing safety mechanisms can be bypassed through system parameter manipulation
Disinformation Capabilities of Large Language Models — evaluation of safety mechanisms in LLMs, finding that most models lack effective safeguards against disinformation generation
Evans et al. (2021) — Truthful AI: Developing and Governing AI That Does Not Lie — policy framework for preventing AI systems from generating false or misleading statements
Mirsky et al. (2021) — threat model for offensive AI capabilities adversaries use, including social engineering via deepfakes

Large Language Models — the primary AI systems requiring safety mechanisms
Disinformation Generation — the specific harm that safety mechanisms aim to prevent

AI Safety¶

Key papers¶

Related topics¶