Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.077057
Special Issues
Table of Content

Open Access

ARTICLE

SYMPHONIA–Enhanced Multimodal Emotion Recognition with Dual-Branch Dynamic Attention and Hierarchical Adaptive Fusion

Akmalbek Abdusalomov1, Mukhriddin Mukhiddinov2,3, Kamola Abdurashidova2, Alpamis Kutlimuratov4, Avazjon Marakhimov5, Kuanishbay Seytnazarov6, Young-Im Cho1,*
1 Department of Computer Engineering, Gachon University Sujeong-Gu, Seongnam-si, Gyeonggi-Do, Republic of Korea
2 Department of Computer Systems, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent, Uzbekistan
3 Department of Industrial Management and Digital Technologies, Nordic International University, Tashkent, Uzbekistan
4 Department of Applied Informatics, Kimyo International University in Tashkent, Tashkent, Uzbekistan
5 Department of Information Processing and Management Systems, Tashkent State Technical University, Tashkent, Uzbekistan
6 Department of General Education Disciplines and Distance Education, Nukus State Pedagogical Institute Named after Ajiniyaz, Nukus, Uzbekistan
* Corresponding Author: Young-Im Cho. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.077057

Received 01 December 2025; Accepted 03 March 2026; Published online 20 April 2026

Abstract

Human emotions are intricate and difficult to decipher through various modalities. Current methodologies frequently employ inflexible fusion strategies that do not consider the dynamic and context-sensitive characteristics of emotional expressions in both visual and textual mediums. This paper presents SYMPHONIA (Synchronizing Facial and Textual Modalities for Emotion Understanding), an innovative architecture engineered to capture and amalgamate emotional signals from facial expressions and language, attuned to contextual and modality interactions. There are two parts to SYMPHONIA: a Facial Emotion Branch that uses Vision Transformers and facial landmarks, and a Textual Emotion Branch that uses RoBERTa embeddings and graph-based reasoning. A Dual-Branch Dynamic Attention Mechanism and a Hierarchical Adaptive Fusion Module are used to connect these branches. SYMPHONIA beat the best models on four datasets: IEMOCAP, MELD, CMU-MOSI, and CMU-MOSEI. It got 80.9% accuracy and 80.1% F1-score on IEMOCAP, which was better than Dualgats (74.8%) and EmoCLIP (75.3%). SYMPHONIA got 74.2% accuracy and 73.5% F1-score for MELD. It beat its competitors by getting a 0.86 Pearson correlation on MOSI and a 0.83 on MOSEI for predicting sentiment. Cross-dataset tests showed that SYMPHONIA could generalize, with 66.9% accuracy when trained on IEMOCAP and tested on MELD. This was better than all the baselines. These results show that SYMPHONIA is good at recognizing emotions and analyzing sentiment in different situations, which shows that it can adapt and do well in different settings.

Keywords

Multimodal emotion recognition; RoBERTa; cross-modal attention; graph neural networks; contrastive learning; adaptive fusion; temporal modeling; affective computing; context-aware representation
  • 237

    View

  • 42

    Download

  • 0

    Like

Share Link