Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.081626
Special Issues
Table of Content

Open Access

ARTICLE

Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Xinshan Huang1, Zirui Pei1, Chaohong Tan2, Zuqiang Meng1,*
1 College of Computer, Electronics and Information, Guangxi University, Nanning, China
2 Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning, China
* Corresponding Author: Zuqiang Meng. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081626

Received 05 March 2026; Accepted 13 May 2026; Published online 03 June 2026

Abstract

Multimodal Sentiment Analysis (MSA) plays an important role in understanding social media content; however, existing methods often struggle with the heterogeneity and complex interactions between images and text. These challenges include inter-modal information asymmetry, insufficient feature fusion, and noise interference, which collectively limit robustness and accuracy. To address these issues, we propose a multimodal sentiment classification model termed Multi-Branch Cross-Modal Cross-Attention Gating (MB-CMCAG). The model first incorporates a Transformer-based image caption generation module to convert raw images into semantically rich auxiliary textual descriptions, which complement the original text and form paired textual inputs with enhanced visual semantics. To capture multi-source features, MB-CMCAG adopts a dual-branch feature extraction architecture: the visual branch encodes images using a Vision Transformer (ViT), while the textual branch encodes text with Bidirectional Encoder Representations from Transformers (BERT); a Contrastive Language-Image Pre-training (CLIP) model is also introduced for joint image–text feature extraction. To exploit cross-modal correlations and enable hierarchical fusion, we construct a cross-modal attention module that supports bidirectional information flow from image to text and from text to image. Building on this, a cross-modal gated mechanism is introduced to selectively regulate the transmission and aggregation of features from different sources, thereby improving noise suppression and sentiment sensitivity. Experimental results on the public MVSA-Single and MVSA-Multiple datasets show that MB-CMCAG achieves accuracies of 76.38% and 73.87%, respectively, outperforming existing baselines by a clear margin in image–text multimodal sentiment classification.

Keywords

Multimodal sentiment analysis; cross-modal cross-attention; cross-modal gated fusion
  • 98

    View

  • 18

    Download

  • 0

    Like

Share Link