Sara Asghari

BSc (AmirKabir University of Technology, 2023)

Notice of the Final Oral Examination for the Degree of Master of Science

Topic

Enhancing Fact-Checking in Large Language Models: Cost-Effective Claim Verification through First-Order Logic Reformulation

Department of Computer Science

Date & location

Tuesday, October 29, 2024
1:00 P.M.
Virtual Defence

Examining Committee

Supervisory Committee

Dr. Alex Thomo, Department of Computer Science, �� (Supervisor)
Dr. Venkatesh Srinivasan, Department of Computer Science, UVic (Member)

External Examiner

Dr. Tiantian Chen, Department of Mathematics and Computer Science, Santa Clara University

Chair of Oral Examination

Dr. Trevor Lantz, School of Environmental Studies, UVic

Abstract

In the realm of Large Language Models (LLMs), the ability to accurately perform Fact Checking (FC) tasks, which involves verifying complex claims against challenging evidence from multiple sources, remains a crucial yet under-explored area. Our study presents a comprehensive benchmarking of various LLMs, including GPT-4, on this critical task. We utilize a modern, challenging dataset designed explicitly for fact-checking, HOVER, which comprises thousands of evidence-claim pairs covering diverse aspects of life, history, and entertainment. This dataset differs from common datasets that evaluate the reading comprehension capabilities of LLMs, which are primarily composed of sets of question-and-answer pairs.

Our findings demonstrate that GPT-4 not only decisively surpasses the current state-of-the-art (SOTA) models in FC tasks but also shows that other, open-source, LLMs (e.g. Mixtral and Llama-3) exhibit close-to-SOTA performance out-of-the box. This implies that simply presenting these models with the evidence text and claim allows them to infer the claim’s veracity effectively. We contrast this with the existing SOTA methods, which involve complex, multi-step solutions, including the use of multiple LLMs to verify claims– a process that necessitates continuous updates and local execution, making it less accessible for regular users.

Furthermore, we explore the impact of claim formulation on the FC task’s effectiveness. By converting complex claims into first-order logic (FOL) and then back into natural language, we observe improved performance in some LLMs, particularly with more challenging dataset subsets. This method, although utilizing GPT-4 for the FOL breakdown, serves as a practical guideline for users: more formally structured claims yield more reliable responses. Keywords: Fact Checking, Claim, Evidence, First Order Logic, Large Language Models, Prompt Engineering

Back to oral exams

��������