Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

Komal Narejo; Hongying Zan; Kheem Dharmani; Orken Mamyrbayev; Ainur Akhmediyarova; Zhibek Alibiyeva; Janna Alimkulova

doi:10.32604/cmc.2025.065872

Open Access icon Open Access

ARTICLE

Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

Komal Rani Narejo¹, Hongying Zan^1,*, Kheem Parkash Dharmani², Orken Mamyrbayev^3,*, Ainur Akhmediyarova⁴, Zhibek Alibiyeva⁴, Janna Alimkulova⁵

1 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, 450001, China
2 School of Computing, National University of Computer and Emerging Sciences, Islamabad, 04403, Pakistan
3 Institute of Information and Computational Technologies, Almaty, 050010, Kazakhstan
4 Institute of Automation and Information Technologies, Satbayev University, Almaty, 050013, Kazakhstan
5 Turan University, Chaikina St 12a, Almaty, 050020, Kazakhstan

* Corresponding Authors: Hongying Zan. Email: email ; Orken Mamyrbayev. Email: email

Computers, Materials & Continua 2025, 84(2), 3407-3429. https://doi.org/10.32604/cmc.2025.065872

Received 24 March 2025; Accepted 09 May 2025; Issue published 03 July 2025

Abstract

While automatic image captioning systems have made notable progress in the past few years, generating captions that fully convey sentiment remains a considerable challenge. Although existing models achieve strong performance in visual recognition and factual description, they often fail to account for the emotional context that is naturally present in human-generated captions. To address this gap, we propose the Sentiment-Driven Caption Generator (SDCG), which combines transformer-based visual and textual processing with multi-level fusion. RoBERTa is used for extracting sentiment from textual input, while visual features are handled by the Vision Transformer (ViT). These features are fused using several fusion approaches, including Concatenation, Attention, Visual-Sentiment Co-Attention (VSCA), and Cross-Attention. Our experiments demonstrate that SDCG significantly outperforms baseline models such as the Generalized Image Transformer (GIT), which achieves 82.01%, and Bootstrapping Language-Image Pre-training (BLIP), which achieves 83.07%, in sentiment accuracy. While SDCG achieves 94.52% sentiment accuracy and improves scores in BLEU and ROUGE-L, the model demonstrates clear advantages. More importantly, the captions are more natural, as they incorporate emotional cues and contextual awareness, making them resemble those written by a human.

Keywords

Image-captioning; sentiment analysis; deep learning; fusion methods

Cite This Article

APA Style

Narejo, K.R., Zan, H., Dharmani, K.P., Mamyrbayev, O., Akhmediyarova, A. et al. (2025). Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies. Computers, Materials & Continua, 84(2), 3407–3429. https://doi.org/10.32604/cmc.2025.065872

Vancouver Style

Narejo KR, Zan H, Dharmani KP, Mamyrbayev O, Akhmediyarova A, Alibiyeva Z, et al. Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies. Comput Mater Contin. 2025;84(2):3407–3429. https://doi.org/10.32604/cmc.2025.065872

IEEE Style

K. R. Narejo et al., “Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies,” Comput. Mater. Contin., vol. 84, no. 2, pp. 3407–3429, 2025. https://doi.org/10.32604/cmc.2025.065872

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Optimizing Sentiment Integration in Image Captioning Using Transformer-Based Fusion Strategies

Abstract

Keywords

Cite This Article

1239

315

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link