Open Access iconOpen Access

ARTICLE

crossmark

A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models

Haider Ali, Rashadul Islam Sumon, Abdul Rehman Khalid, Kounen Fathima, Hee Cheol Kim*

Department of Digital Anti-Aging Healthcare, Inje University, Gimhae, 50813, Republic of Korea

* Corresponding Author: Hee Cheol Kim. Email: email

Computers, Materials & Continua 2025, 84(3), 5445-5462. https://doi.org/10.32604/cmc.2025.065992

Abstract

Artificial intelligence is reshaping radiology by enabling automated report generation, yet evaluating the clinical accuracy and relevance of these reports is a challenging task, as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance. To address this gap, we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references. We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets. These datasets were selected for their public availability, large scale, and comprehensive coverage of diverse clinical cases in chest radiography, enabling robust evaluation and comparison with prior work. Results demonstrate that the Mistral model, particularly with task-oriented prompting, achieves superior performance (up to 91.9% accuracy), surpassing other models and closely aligning with established metrics like BERTScore-F1 (88.1%) and CLIP-Score (88.7%). Statistical analyses, including paired t-tests (p < 0.01) and analysis of variance (p < 0.05), confirm significant improvements driven by structured prompting. Failure case analysis reveals limitations, such as over-reliance on lexical similarity, underscoring the need for domain-specific fine-tuning. This framework advances the evaluation of artificial intelligence-driven (AI-driven) radiology report generation, offering a robust, clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.

Keywords

Semantic assessment; AI-generated radiology reports; large language models; prompt engineering; semantic score evaluation

Cite This Article

APA Style
Ali, H., Sumon, R.I., Khalid, A.R., Fathima, K., Kim, H.C. (2025). A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models. Computers, Materials & Continua, 84(3), 5445–5462. https://doi.org/10.32604/cmc.2025.065992
Vancouver Style
Ali H, Sumon RI, Khalid AR, Fathima K, Kim HC. A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models. Comput Mater Contin. 2025;84(3):5445–5462. https://doi.org/10.32604/cmc.2025.065992
IEEE Style
H. Ali, R. I. Sumon, A. R. Khalid, K. Fathima, and H. C. Kim, “A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models,” Comput. Mater. Contin., vol. 84, no. 3, pp. 5445–5462, 2025. https://doi.org/10.32604/cmc.2025.065992



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1012

    View

  • 495

    Download

  • 0

    Like

Share Link