Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models¶

Authors: Haoran Wang, Kai Shu Affiliation: Illinois Institute of Technology, Chicago, IL, USA ArXiv: 2310.05253

TL;DR¶

This paper introduces FOLK (First-Order-Logic-Guided Knowledge-Grounded Reasoning), a method for verifying factual claims and generating explanations without requiring annotated evidence. FOLK translates claims into first-order logic predicates to decompose them into verifiable sub-claims, then uses LLMs to retrieve and reason over knowledge-grounded answers. Evaluated on three datasets (HoVER, FEVEROUS, SciFactOpen), FOLK achieves state-of-the-art performance while providing human-readable explanations that outperform baselines in coverage and readability.

Contributions¶

A novel framework that verifies claims without annotated evidence while generating comprehensive explanations
Demonstration of FOL-guided claim decomposition as a more effective reasoning strategy than chain-of-thought prompting
Evidence that knowledge-grounding (external retrieval) significantly improves claim verification accuracy over relying solely on LLM internal knowledge
High-quality explanation generation with strong inter-annotator agreement (κ > 0.67 across evaluation criteria)

Method¶

FOLK consists of three stages:

1. FOL-Guided Claim Decomposition The method first translates an input claim into First-Order-Logic (FOL) predicates, decomposing it into constituent sub-claims. For example, "Tomás Smid and Fabricio Santoro were both American tennis players" becomes a FOL clause with predicates checking nationality and profession. This symbolic representation allows systematic decomposition.

2. Knowledge-Grounded Answer Retrieval For each decomposed predicate, FOLK generates intermediate question-answer pairs. LLMs are prompted to retrieve knowledge-grounded answers from external sources (Google Search via SerPAPI). This grounds the reasoning in external knowledge rather than relying solely on LLM hallucination-prone internal knowledge.

3. Veracity Prediction and Explanation Generation Given the knowledge-grounded answers for all predicates, FOLK evaluates the FOL clause to make a final veracity prediction (SUPPORT, REFUTE, or NOT ENOUGH INFO). The intermediate questions, answers, and reasoning steps are used to generate natural-language explanations that justify the decision.

Results¶

Macro F1 scores on three benchmark datasets:

HoVER (multi-hop reasoning): - 2-hop: 66.26% (vs. 71.00% ProgramFC baseline) - 3-hop: 54.80% (vs. 51.04% ProgramFC) - 4-hop: 60.35% (vs. 52.92% ProgramFC)

FEVEROUS (structured + unstructured data): - Numerical reasoning: 59.49% (vs. 54.78% ProgramFC) - Multi-hop reasoning: 67.01% (vs. 59.84% ProgramFC) - Text and table reasoning: 63.42% (vs. 51.69% ProgramFC)

SciFactOpen (domain-specific scientific claims): - 67.59% (vs. 49.70% Direct baseline)

FOLK outperforms all baselines on 6 of 7 evaluation tasks. Key findings: (1) FOL-guided decomposition shows 11.30% average improvement over chain-of-thought, particularly on complex claims (12.13% improvement on 3-hop claims); (2) knowledge-grounded answers substantially improve reasoning—using only Wikipedia reduces performance compared to using Google Search; (3) explanations generated by FOLK receive the best ratings for coverage (1.57 MAR), soundness (1.07 MAR), and readability (1.27 MAR) compared to baselines.

Connections¶

Related to Claim Verification as a core method for automated fact-checking
Uses large language model capabilities for reasoning and explanation generation
Implements explainable AI principles by generating justified decisions
Employs knowledge grounding to reduce LLM hallucination
Applies Natural Language Inference for semantic reasoning
Relevant to Multi Hop Reasoning datasets like Hover and Feverous

Notes¶

Strengths: - Novel application of symbolic reasoning (FOL) to guide neural LLM behavior—bridges classical AI and modern language models - Comprehensive evaluation across three diverse datasets spanning multi-hop, numerical, and scientific reasoning - Rigorous manual evaluation of explanation quality by three annotators with reasonable inter-rater agreement - Achieves competitive results on smaller LLMs (llama-13B, llama-30B), suggesting practical applicability beyond massive models - Honest discussion of computational cost ($20 per 100 examples via OpenAI API) and environmental impact

Limitations: - Synthetic claims in experiments can be decomposed with explicit reasoning structure; real-world claims often have implicit semantic structure requiring more sophisticated reasoning - Requires knowledge sources accessible to the retrieval module; performance depends significantly on retrieval quality (shown in ablation: wikipedia-only substantially underperforms) - Higher computational cost than supervised baselines—claims the method is expensive compared to traditional fact-checking pipelines

Open Questions: - How would FOLK perform on adversarial claims designed to evade symbolic decomposition? - Could the approach generalize to other structured-reasoning tasks beyond claim verification? - What is the performance ceiling when knowledge retrieval is imperfect or unavailable for emerging events?