Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Xinshan Huang¹, Zirui Pei¹, Chaohong Tan², Zuqiang Meng^1,*
1 College of Computer, Electronics and Information, Guangxi University, Nanning, China
2 Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning, China
* Corresponding Author: Zuqiang Meng. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081626

Received 05 March 2026; Accepted 13 May 2026; Published online 03 June 2026

Download PDF

Abstract

Multimodal Sentiment Analysis (MSA) plays an important role in understanding social media content; however, existing methods often struggle with the heterogeneity and complex interactions between images and text. These challenges include inter-modal information asymmetry, insufficient feature fusion, and noise interference, which collectively limit robustness and accuracy. To address these issues, we propose a multimodal sentiment classification model termed Multi-Branch Cross-Modal Cross-Attention Gating (MB-CMCAG). The model first incorporates a Transformer-based image caption generation module to convert raw images into semantically rich auxiliary textual descriptions, which complement the original text and form paired textual inputs with enhanced visual semantics. To capture multi-source features, MB-CMCAG adopts a dual-branch feature extraction architecture: the visual branch encodes images using a Vision Transformer (ViT), while the textual branch encodes text with Bidirectional Encoder Representations from Transformers (BERT); a Contrastive Language-Image Pre-training (CLIP) model is also introduced for joint image–text feature extraction. To exploit cross-modal correlations and enable hierarchical fusion, we construct a cross-modal attention module that supports bidirectional information flow from image to text and from text to image. Building on this, a cross-modal gated mechanism is introduced to selectively regulate the transmission and aggregation of features from different sources, thereby improving noise suppression and sentiment sensitivity. Experimental results on the public MVSA-Single and MVSA-Multiple datasets show that MB-CMCAG achieves accuracies of 76.38% and 73.87%, respectively, outperforming existing baselines by a clear margin in image–text multimodal sentiment classification.

Keywords

Multimodal sentiment analysis; cross-modal cross-attention; cross-modal gated fusion

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

98

View
18

Download
0

Like

Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images
Jieyu An, Wan Mohd Nazmee Wan...
Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification
Israa K. Salman Al-Tameemi, Mohammad-Reza...
Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism
Lujuan Deng, Boyi Liu, Zuhe Li
Text-Image Feature Fine-Grained Learning for Joint Multimodal Aspect-Based Sentiment Analysis
Tianzhi Zhang, Gang Zhou, Shuang...
PKME-MLM: A Novel Multimodal Large Model for Sarcasm Detection
Jian Luo, Yaling Li, Xueyu Li,...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Multi-Branch Cross-Modal Cross-Attention for Image–Text Multimodal Sentiment Classification

Abstract

Keywords

98

18

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link