Hierarchical Joint Cross-Modal Attention and Gating Mechanism for Multimodal Sentiment Analysis

Shuqiu Tan, Chunsheng Tan, Yahui Liu^*
School of Computer Science and Engineering, Chongqing University of Technology, Chongqing, China
* Corresponding Author: Yahui Liu. Email: liuyh@cqut.edu.cn
(This article belongs to the Special Issue: Sentiment Analysis for Social Media Data: Lexicon-Based and Large Language Model Approaches)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077982

Received 21 December 2025; Accepted 06 March 2026; Published online 21 April 2026

Download PDF

Abstract

Multimodal sentiment analysis aims to accurately identify emotional states by comprehensively utilizing information from multiple sources such as text, audio, and visual data. However, semantic heterogeneity and temporal differences exist between different modalities, limiting the effectiveness of feature fusion. To address this issue, this paper proposes a hierarchical joint cross-modal attention and gating mechanism (HJCAG) for multimodal sentiment analysis. This method introduces a hierarchical structure, dividing modal interactions into bimodal and trimodal layers to progressively model the semantic relevance between modalities. First, deep features are extracted from text, audio, and visual modalities using pre-trained models to obtain high-dimensional representations of semantics, speech, and facial expressions, which are then aligned to a unified feature space. Second, a joint cross-modal attention module is designed at the bimodal and trimodal levels, calculating cross-attention weights based on the correlation between the joint feature representation and individual modal representations. Explicit modeling of multimodal interaction relationships and semantic alignment fully leverages the complementary information of different modalities. Furthermore, this paper introduces a gating mechanism to adaptively control the contribution weights of each modal feature, reducing redundant information interference and improving the discriminativeness of the fused representation. Finally, the fused global features are input into the emotion classifier to identify emotional states. The proposed method achieves 75.47 ± 0.22% and 69.25 ± 0.37% accuracy and 76.84 ± 0.45% and 68.97 ± 0.41% weighted F1 scores on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database and Multimodal Emotion Lines Dataset (MELD), respectively, outperforming mainstream multimodal baseline methods, verifying the effectiveness and robustness of the proposed method in multimodal feature fusion and emotion recognition.

Keywords

Multimodal sentiment analysis; cross-modal attention; hierarchical structure; joint features; gating mechanism

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

143

View
21

Download
0

Like

Fake News Detection Based on Multimodal Inputs
Zhiping Liang
Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images
Jieyu An, Wan Mohd Nazmee Wan...
Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification
Israa K. Salman Al-Tameemi, Mohammad-Reza...
Multimodal Sentiment Analysis Based on a Cross-Modal Multihead Attention Mechanism
Lujuan Deng, Boyi Liu, Zuhe Li
Fake News Detection Based on Text-Modal Dominance and Fusing Multiple Multi-Model Clues
Lifang Fu, Huanxin Peng, Changjin...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Hierarchical Joint Cross-Modal Attention and Gating Mechanism for Multimodal Sentiment Analysis

Abstract

Keywords

143

21

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link