HalluBench: A Multi-LLM Benchmark for Hallucination Evaluation and Reliability Analysis

Betül Şenyayla¹, Aytuğ Onan^2,*
1 Department of Software Engineering, Faculty of Engineering, Sivas Cumhuriyet University, Sivas, Türkiye
2 Department of Computer Engineering, Faculty of Engineering, İzmir Institute of Technology, İzmir, Türkiye
* Corresponding Author: Aytuğ Onan. Email: email

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081260

Received 26 February 2026; Accepted 12 May 2026; Published online 12 June 2026

Download PDF

Abstract

Large Language Models (LLMs) have become a cornerstone of modern natural language processing, achieving strong performance across diverse tasks. Despite these advances, their tendency to generate hallucinated or factually unsupported content remains a critical challenge for reliable deployment. Existing evaluation approaches predominantly rely on single-task settings and aggregate performance metrics, implicitly assuming that hallucination behavior is uniform across tasks. However, this assumption is fundamentally flawed, as hallucination characteristics vary significantly depending on task formulation, linguistic context, and evaluation criteria. To address these limitations, this paper proposes HalluBench, a task-aware multi-LLM benchmarking framework designed for systematic hallucination analysis and metric-task alignment. The framework evaluates ten language models across four representative task formulations—open-domain question answering, cross-lingual question answering, scientific claim verification, and LLM-as-a-judge assessment—using four benchmark datasets (five evaluation splits) and nine complementary evaluation metrics. Unlike conventional approaches, HalluBench introduces a metric–task alignment strategy that selects evaluation metrics based on their suitability for each task. Experimental results reveal that hallucination behavior is strongly task-dependent, with substantial variations observed across models and evaluation settings. Specifically, the proposed framework demonstrates that model reliability is highly sensitive to task formulation; for instance, in adversarial open-domain settings, performance differences of up to 15% in Exact Match (EM) and 20% in F1 scores are observed between top-tier and compact (∼1B parameter) models. By integrating lexical, semantic, and reference-based metrics within a pipeline, HalluBench provides a more robust and diagnostically informative evaluation framework compared to traditional single-task and single-metric benchmarks.

Keywords

Hallucination evaluation; large language models; benchmarking framework; factual consistency; LLM reliability

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

505

View
65

Download
0

Like

Weakly Supervised Abstractive Summarization with Enhancing Factual Consistency for Chinese Complaint Reports
Ren Tao, Chen Shuang
Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review
Ekanayake Mudiyanselage Chulabhaya...
Enhancing Relational Triple Extraction in Specific Domains: Semantic Enhancement and Synergy of Large Language Models and Small Pre-Trained Language Models
Jiakai Li, Jianpeng Hu, Geng Zhang
LKPNR: Large Language Models and Knowledge Graph for Personalized News Recommendation Framework
Hao Chen, Runfeng Xie, Xiangyang...
Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
Zheyi Chen, Liuchang Xu, Hongting...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

HalluBench: A Multi-LLM Benchmark for Hallucination Evaluation and Reliability Analysis

Abstract

Keywords

505

65

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link