Open Access
ARTICLE
A Semantic Evaluation Framework for Medical Report Generation Using Large Language Models
Department of Digital Anti-Aging Healthcare, Inje University, Gimhae, 50813, Republic of Korea
* Corresponding Author: Hee Cheol Kim. Email:
Computers, Materials & Continua 2025, 84(3), 5445-5462. https://doi.org/10.32604/cmc.2025.065992
Received 27 March 2025; Accepted 11 June 2025; Issue published 30 July 2025
Abstract
Artificial intelligence is reshaping radiology by enabling automated report generation, yet evaluating the clinical accuracy and relevance of these reports is a challenging task, as traditional natural language generation metrics like BLEU and ROUGE prioritize lexical overlap over clinical relevance. To address this gap, we propose a novel semantic assessment framework for evaluating the accuracy of artificial intelligence-generated radiology reports against ground truth references. We trained 5229 image–report pairs from the Indiana University chest X-ray dataset on the R2GenRL model and generated a benchmark dataset on test data from the Indiana University chest X-ray and MIMIC-CXR datasets. These datasets were selected for their public availability, large scale, and comprehensive coverage of diverse clinical cases in chest radiography, enabling robust evaluation and comparison with prior work. Results demonstrate that the Mistral model, particularly with task-oriented prompting, achieves superior performance (up to 91.9% accuracy), surpassing other models and closely aligning with established metrics like BERTScore-F1 (88.1%) and CLIP-Score (88.7%). Statistical analyses, including paired t-tests (p < 0.01) and analysis of variance (p < 0.05), confirm significant improvements driven by structured prompting. Failure case analysis reveals limitations, such as over-reliance on lexical similarity, underscoring the need for domain-specific fine-tuning. This framework advances the evaluation of artificial intelligence-driven (AI-driven) radiology report generation, offering a robust, clinically relevant metric for assessing semantic accuracy and paving the way for more reliable automated systems in medical imaging.Keywords
Cite This Article
Copyright © 2025 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools