Open Access
ARTICLE
Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies
1 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
2 School of Computing, National University of Computer and Emerging Sciences, Islamabad, 04403, Pakistan
3 Institute of Information and Computational Technologies, Almaty, 050010, Kazakhstan
4 Institute of Automation and Information Technologies, Satbayev University, Almaty, 050013, Kazakhstan
5 Turan University, Chaikina St 12a, Almaty, 050020, Kazakhstan
* Corresponding Authors: Hongying Zan. Email: ; Orken Mamyrbayev. Email:
Computers, Materials & Continua 2025, 84(2), 3407-3429. https://doi.org/10.32604/cmc.2025.065872
Received 24 March 2025; Accepted 09 May 2025; Issue published 03 July 2025
Abstract
While automatic image captioning systems have made notable progress in the past few years, generating captions that fully convey sentiment remains a considerable challenge. Although existing models achieve strong performance in visual recognition and factual description, they often fail to account for the emotional context that is naturally present in human-generated captions. To address this gap, we propose the Sentiment-Driven Caption Generator (SDCG), which combines transformer-based visual and textual processing with multi-level fusion. RoBERTa is used for extracting sentiment from textual input, while visual features are handled by the Vision Transformer (ViT). These features are fused using several fusion approaches, including Concatenation, Attention, Visual-Sentiment Co-Attention (VSCA), and Cross-Attention. Our experiments demonstrate that SDCG significantly outperforms baseline models such as the Generalized Image Transformer (GIT), which achieves 82.01%, and Bootstrapping Language-Image Pre-training (BLIP), which achieves 83.07%, in sentiment accuracy. While SDCG achieves 94.52% sentiment accuracy and improves scores in BLEU and ROUGE-L, the model demonstrates clear advantages. More importantly, the captions are more natural, as they incorporate emotional cues and contextual awareness, making them resemble those written by a human.Keywords
Cite This Article

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.