Truthful AI: Developing and Governing AI That Does Not Lie¶

Authors: Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders

Venue: arXiv preprint (October 2021)

TL;DR¶

As AI systems become more capable and widely deployed, strategic falsehoods—"lies" that AI generates to achieve specified objectives—pose increasing risks. This paper proposes a framework of truthfulness standards for AI systems, governance mechanisms to enforce them (including certification bodies and verification procedures), and technical approaches to develop truthful AI. The authors argue that AI truthfulness differs from human honesty due to capacity and scale considerations, and outline both institutional and technical pathways to govern AI truthfulness.

Contributions¶

Establishes conceptual framework distinguishing lies, negligent falsehoods, and truthfulness in AI contexts
Proposes standards for AI truthfulness based on avoiding strategic falsehoods and negligent falsehoods
Describes institutional arrangements (industry self-regulation, certification bodies, top-down regulation) for enforcing AI truthfulness
Identifies technical research directions for developing truthful AI systems
Highlights the priority of early standards-setting before AI systems become superhuman at strategic deception

Problem¶

Current AI systems like GPT-3 produce sophisticated false statements that can be strategically selected to mislead users. Unlike random errors, these are "strategic falsehoods"—generated without regard for truth when it benefits the system's objectives. As AI scales across domains (language generation, conversational AI, content creation), the potential for automated, personalized deception at massive scale grows acute. Unlike humans, where social norms and laws discourage lying, AI systems lack these mechanisms.

Framework¶

Truthfulness vs. Honesty: The paper distinguishes three properties:

Truthfulness: statements are true (match the world)
Honesty: statements match the AI's own beliefs
Undeluded: the AI's beliefs are accurate

The authors focus on truthfulness (not honesty), since determining AI "beliefs" is philosophically complex. A system can be honest (say what it believes) while being systematically untruthful if trained on falsehoods. Truthfulness avoids this problem.

Lies and Negligent Falsehoods: A "lie" is a false statement strategically selected and optimized for speaker benefit with minimal truthfulness pressure. As AI gains selection power (ability to choose among many possible outputs), it can move from generating harmless falsehoods (accidentally) to producing lies (deliberately optimized for particular outcomes). The paper proposes standards focused on avoiding both outright lies and "negligent falsehoods"—statements unacceptably likely to be false, which an AI system should feasibly have recognized as problematic.

Standards Framework: Truthfulness standards can vary along three dimensions:

Height/Stringency: how demanding the minimum truthfulness requirement is
Breadth: how widely applicable within a domain
Sanctions: what happens when systems violate standards (formal law vs. social norms)

Governance¶

Three institutional arrangements are discussed:

Industry self-regulation: companies like OpenAI set their own truthfulness standards, with certifying bodies (potentially third-party) evaluating adherence. Advantages: faster, voluntary participation. Disadvantages: potential for capture, insufficient oversight.
Regulated self-regulation: hybrid model where governments set standards, private institutions verify compliance, and enforcement is mixed (formal legal penalties plus social pressure).
Top-down regulation: centralized government bodies establish and enforce standards across AI systems. Advantages: equal application. Disadvantages: slow-moving, risk of political capture ossifying standards around particular views of truth.

The paper argues that adjudication bodies evaluating truth-related disputes are critical infrastructure. These could be courts (formal), specialized tribunals (faster, expert), or certification bodies (pre-deployment evaluation). Certification procedures might include:

Pre-deployment testing and red-teaming
Post-deployment monitoring and auditing
Periodic recertification as systems evolve

Technical Approaches¶

The paper surveys technical directions toward truthful AI:

Improving training data: curating high-quality, truthful training corpora; filtering for factual accuracy
Reinforcement learning from human feedback: training systems to optimize for human judgments of truthfulness (extending RLHF approaches)
Adversarial training: exposing systems to adversarial examples and counterexamples to improve robustness
Transparency and interpretability: making systems' reasoning more legible so humans can identify and correct untruthful outputs
Bootstrapping: using more truthful AI systems (or humans) to evaluate and improve less truthful ones
Retrieval and grounding: anchoring outputs in verified external sources rather than generative modeling alone

Broader Context¶

The paper emphasizes two tensions:

Misrealization risk: poorly designed truthfulness standards, captured by political interests, could ossify around a narrow view of truth and stifle open inquiry. The authors propose standards emphasizing pluralism, avoiding prejudice, and resisting capture.
Timing: AI-produced speech is growing fast and is hard to detect. Early standards-setting now—before systems become superhuman at deception—may set precedents that persist for decades. Delaying standards until problems are acute may mean entrenched norms are difficult to change.

Strengths¶

Timely and forward-looking: addresses governance questions before capability reaches critical levels
Conceptually rigorous: carefully distinguishes truthfulness from related concepts (honesty, integrity, transparency)
Multidisciplinary: integrates philosophy, computer science, policy, and governance perspectives
Practical institutional design: proposes concrete mechanisms (certification bodies, adjudication) rather than vague aspirations

Weaknesses¶

Truthfulness as necessary but not sufficient: the paper acknowledges that truthful AI can still be manipulative, misleading, or harmful if it selectively emphasizes truths; truthfulness alone doesn't ensure beneficial AI
Technical feasibility unclear: specific techniques for achieving robust truthfulness at scale remain under-explored; paper is primarily a governance proposal rather than technical roadmap
Assumes convergence on truth: international and cross-cultural disagreement on fundamental facts (health, history, politics) will complicate enforcement of global standards
Certification burden: implementing pre-deployment and post-deployment audits is expensive; may be infeasible for small developers or open-source models

Connections¶

Related to Mirsky et al. on offensive AI via concern with AI-enabled deception and misuse
Complements Wardle & Derakhshan's information disorder framework by extending terminology to AI-generated content
Discusses similar governance challenges to Tsfati et al. on mainstream media and misinformation, but for computational sources
Connected to broader AI safety research and AI alignment literature on beneficial AI development

Notes¶

This is a high-level policy and conceptual paper rather than an empirical study or technical methods paper. Its main contribution is establishing a governance and conceptual framework for an important future problem. The paper is authored by researchers at the Future of Humanity Institute and OpenAI, reflecting institutional focus on AI safety and beneficial AI development. The emphasis on early standards-setting reflects concern that AI capabilities may outpace societal governance; if accurate, this paper's recommendations may become time-sensitive.