Open Access iconOpen Access

ARTICLE

Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

Komal Rani Narejo1, Hongying Zan1,*, Kheem Parkash Dharmani2, Orken Mamyrbayev3,*, Ainur Akhmediyarova4, Zhibek Alibiyeva4, Janna Alimkulova5

1 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
2 School of Computing, National University of Computer and Emerging Sciences, Islamabad, 04403, Pakistan
3 Institute of Information and Computational Technologies, Almaty, 050010, Kazakhstan
4 Institute of Automation and Information Technologies, Satbayev University, Almaty, 050013, Kazakhstan
5 Turan University, Chaikina St 12a, Almaty, 050020, Kazakhstan

* Corresponding Authors: Hongying Zan. Email: email; Orken Mamyrbayev. Email: email

Computers, Materials & Continua 2025, 84(2), 3407-3429. https://doi.org/10.32604/cmc.2025.065872

Abstract

While automatic image captioning systems have made notable progress in the past few years, generating captions that fully convey sentiment remains a considerable challenge. Although existing models achieve strong performance in visual recognition and factual description, they often fail to account for the emotional context that is naturally present in human-generated captions. To address this gap, we propose the Sentiment-Driven Caption Generator (SDCG), which combines transformer-based visual and textual processing with multi-level fusion. RoBERTa is used for extracting sentiment from textual input, while visual features are handled by the Vision Transformer (ViT). These features are fused using several fusion approaches, including Concatenation, Attention, Visual-Sentiment Co-Attention (VSCA), and Cross-Attention. Our experiments demonstrate that SDCG significantly outperforms baseline models such as the Generalized Image Transformer (GIT), which achieves 82.01%, and Bootstrapping Language-Image Pre-training (BLIP), which achieves 83.07%, in sentiment accuracy. While SDCG achieves 94.52% sentiment accuracy and improves scores in BLEU and ROUGE-L, the model demonstrates clear advantages. More importantly, the captions are more natural, as they incorporate emotional cues and contextual awareness, making them resemble those written by a human.

Keywords

Image-captioning; sentiment analysis; deep learning; fusion methods

Cite This Article

APA Style
Narejo, K.R., Zan, H., Dharmani, K.P., Mamyrbayev, O., Akhmediyarova, A. et al. (2025). Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies. Computers, Materials & Continua, 84(2), 3407–3429. https://doi.org/10.32604/cmc.2025.065872
Vancouver Style
Narejo KR, Zan H, Dharmani KP, Mamyrbayev O, Akhmediyarova A, Alibiyeva Z, et al. Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies. Comput Mater Contin. 2025;84(2):3407–3429. https://doi.org/10.32604/cmc.2025.065872
IEEE Style
K. R. Narejo et al., “Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies,” Comput. Mater. Contin., vol. 84, no. 2, pp. 3407–3429, 2025. https://doi.org/10.32604/cmc.2025.065872



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 531

    View

  • 144

    Download

  • 0

    Like

Share Link