Open Access
ARTICLE
Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification
1 College of Computer, Electronics and Information, Guangxi University, Nanning, China
2 Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning, China
* Corresponding Author: Zuqiang Meng. Email:
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)
Computers, Materials & Continua 2026, 88(2), 90 https://doi.org/10.32604/cmc.2026.081626
Received 05 March 2026; Accepted 13 May 2026; Issue published 15 June 2026
Abstract
Multimodal Sentiment Analysis (MSA) plays an important role in understanding social media content; however, existing methods often struggle with the heterogeneity and complex interactions between images and text. These challenges include inter-modal information asymmetry, insufficient feature fusion, and noise interference, which collectively limit robustness and accuracy. To address these issues, we propose a multimodal sentiment classification model termed Multi-Branch Cross-Modal Cross-Attention Gating (MB-CMCAG). The model first incorporates a Transformer-based image caption generation module to convert raw images into semantically rich auxiliary textual descriptions, which complement the original text and form paired textual inputs with enhanced visual semantics. To capture multi-source features, MB-CMCAG adopts a dual-branch feature extraction architecture: the visual branch encodes images using a Vision Transformer (ViT), while the textual branch encodes text with Bidirectional Encoder Representations from Transformers (BERT); a Contrastive Language-Image Pre-training (CLIP) model is also introduced for joint image–text feature extraction. To exploit cross-modal correlations and enable hierarchical fusion, we construct a cross-modal attention module that supports bidirectional information flow from image to text and from text to image. Building on this, a cross-modal gated mechanism is introduced to selectively regulate the transmission and aggregation of features from different sources, thereby improving noise suppression and sentiment sensitivity. Experimental results on the public MVSA-Single and MVSA-Multiple datasets show that MB-CMCAG achieves accuracies of 76.38% and 73.87%, respectively, outperforming existing baselines by a clear margin in image–text multimodal sentiment classification.Keywords
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools