Sara Asghari
- BSc (AmirKabir University of Technology, 2023)
Topic
Enhancing Fact-Checking in Large Language Models: Cost-Effective Claim Verification through First-Order Logic Reformulation
Department of Computer Science
Date & location
- Tuesday, October 29, 2024
- 1:00 P.M.
- Virtual Defence
Examining Committee
Supervisory Committee
- Dr. Alex Thomo, Department of Computer Science, 番茄社区 (Supervisor)
- Dr. Venkatesh Srinivasan, Department of Computer Science, UVic (Member)
External Examiner
- Dr. Tiantian Chen, Department of Mathematics and Computer Science, Santa Clara University
Chair of Oral Examination
- Dr. Trevor Lantz, School of Environmental Studies, UVic
Abstract
In the realm of Large Language Models (LLMs), the ability to accurately perform Fact Checking (FC) tasks, which involves verifying complex claims against challenging evidence from multiple sources, remains a crucial yet under-explored area. Our study presents a comprehensive benchmarking of various LLMs, including GPT-4, on this critical task. We utilize a modern, challenging dataset designed explicitly for fact-checking, HOVER, which comprises thousands of evidence-claim pairs covering diverse aspects of life, history, and entertainment. This dataset differs from common datasets that evaluate the reading comprehension capabilities of LLMs, which are primarily composed of sets of question-and-answer pairs.
Our findings demonstrate that GPT-4 not only decisively surpasses the current state-of-the-art (SOTA) models in FC tasks but also shows that other, open-source, LLMs (e.g. Mixtral and Llama-3) exhibit close-to-SOTA performance out-of-the box. This implies that simply presenting these models with the evidence text and claim allows them to infer the claim’s veracity effectively. We contrast this with the existing SOTA methods, which involve complex, multi-step solutions, including the use of multiple LLMs to verify claims– a process that necessitates continuous updates and local execution, making it less accessible for regular users.
Furthermore, we explore the impact of claim formulation on the FC task’s effectiveness. By converting complex claims into first-order logic (FOL) and then back into natural language, we observe improved performance in some LLMs, particularly with more challenging dataset subsets. This method, although utilizing GPT-4 for the FOL breakdown, serves as a practical guideline for users: more formally structured claims yield more reliable responses. Keywords: Fact Checking, Claim, Evidence, First Order Logic, Large Language Models, Prompt Engineering