Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Xinshan Huang; Zirui Pei; Chaohong Tan; Zuqiang Meng

doi:10.32604/cmc.2026.081626

Open Access icon Open Access

ARTICLE

Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Xinshan Huang¹, Zirui Pei¹, Chaohong Tan², Zuqiang Meng^1,*

1 College of Computer, Electronics and Information, Guangxi University, Nanning, China
2 Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning, China

* Corresponding Author: Zuqiang Meng. Email: email

(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua 2026, 88(2), 90 https://doi.org/10.32604/cmc.2026.081626

Received 05 March 2026; Accepted 13 May 2026; Issue published 15 June 2026

Abstract

Multimodal Sentiment Analysis (MSA) plays an important role in understanding social media content; however, existing methods often struggle with the heterogeneity and complex interactions between images and text. These challenges include inter-modal information asymmetry, insufficient feature fusion, and noise interference, which collectively limit robustness and accuracy. To address these issues, we propose a multimodal sentiment classification model termed Multi-Branch Cross-Modal Cross-Attention Gating (MB-CMCAG). The model first incorporates a Transformer-based image caption generation module to convert raw images into semantically rich auxiliary textual descriptions, which complement the original text and form paired textual inputs with enhanced visual semantics. To capture multi-source features, MB-CMCAG adopts a dual-branch feature extraction architecture: the visual branch encodes images using a Vision Transformer (ViT), while the textual branch encodes text with Bidirectional Encoder Representations from Transformers (BERT); a Contrastive Language-Image Pre-training (CLIP) model is also introduced for joint image–text feature extraction. To exploit cross-modal correlations and enable hierarchical fusion, we construct a cross-modal attention module that supports bidirectional information flow from image to text and from text to image. Building on this, a cross-modal gated mechanism is introduced to selectively regulate the transmission and aggregation of features from different sources, thereby improving noise suppression and sentiment sensitivity. Experimental results on the public MVSA-Single and MVSA-Multiple datasets show that MB-CMCAG achieves accuracies of 76.38% and 73.87%, respectively, outperforming existing baselines by a clear margin in image–text multimodal sentiment classification.

Keywords

Multimodal sentiment analysis; cross-modal cross-attention; cross-modal gated fusion

Cite This Article

APA Style

Huang, X., Pei, Z., Tan, C., Meng, Z. (2026). Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification. Computers, Materials & Continua, 88(2), 90. https://doi.org/10.32604/cmc.2026.081626

Vancouver Style

Huang X, Pei Z, Tan C, Meng Z. Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification. Comput Mater Contin. 2026;88(2):90. https://doi.org/10.32604/cmc.2026.081626

IEEE Style

X. Huang, Z. Pei, C. Tan, and Z. Meng, “Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification,” Comput. Mater. Contin., vol. 88, no. 2, pp. 90, 2026. https://doi.org/10.32604/cmc.2026.081626

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Abstract

Keywords

Cite This Article

630

301

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link