Skip to content

AI Safety

AI safety encompasses the technical and governance approaches to ensuring that artificial intelligence systems behave as intended and do not cause harm. In the context of misinformation and disinformation, AI safety focuses on preventing or limiting LLMs' ability to generate false or misleading content.

Safety mechanisms include training-based approaches (instruction fine-tuning to refuse harmful requests), filtering mechanisms (screening for disallowed content), and architectural choices (model size, training data curation) that influence model behavior.

Key papers