iconOpen Access

ARTICLE

SYMPHONIA–Enhanced Multimodal Emotion Recognition with Dual-Branch Dynamic Attention and Hierarchical Adaptive Fusion

Akmalbek Abdusalomov1, Mukhriddin Mukhiddinov2,3, Kamola Abdurashidova2, Alpamis Kutlimuratov4, Avazjon Marakhimov5, Kuanishbay Seytnazarov6, Young-Im Cho1,*

1 Department of Computer Engineering, Gachon University Sujeong-Gu, Seongnam-si, Gyeonggi-Do, Republic of Korea
2 Department of Computer Systems, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent, Uzbekistan
3 Department of Industrial Management and Digital Technologies, Nordic International University, Tashkent, Uzbekistan
4 Department of Applied Informatics, Kimyo International University in Tashkent, Tashkent, Uzbekistan
5 Department of Information Processing and Management Systems, Tashkent State Technical University, Tashkent, Uzbekistan
6 Department of General Education Disciplines and Distance Education, Nukus State Pedagogical Institute Named after Ajiniyaz, Nukus, Uzbekistan

* Corresponding Author: Young-Im Cho. Email: email

(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua 2026, 88(1), 74 https://doi.org/10.32604/cmc.2026.077057

Abstract

Human emotions are intricate and difficult to decipher through various modalities. Current methodologies frequently employ inflexible fusion strategies that do not consider the dynamic and context-sensitive characteristics of emotional expressions in both visual and textual mediums. This paper presents SYMPHONIA (Synchronizing Facial and Textual Modalities for Emotion Understanding), an innovative architecture engineered to capture and amalgamate emotional signals from facial expressions and language, attuned to contextual and modality interactions. There are two parts to SYMPHONIA: a Facial Emotion Branch that uses Vision Transformers and facial landmarks, and a Textual Emotion Branch that uses RoBERTa embeddings and graph-based reasoning. A Dual-Branch Dynamic Attention Mechanism and a Hierarchical Adaptive Fusion Module are used to connect these branches. SYMPHONIA beat the best models on four datasets: IEMOCAP, MELD, CMU-MOSI, and CMU-MOSEI. It got 80.9% accuracy and 80.1% F1-score on IEMOCAP, which was better than Dualgats (74.8%) and EmoCLIP (75.3%). SYMPHONIA got 74.2% accuracy and 73.5% F1-score for MELD. It beat its competitors by getting a 0.86 Pearson correlation on MOSI and a 0.83 on MOSEI for predicting sentiment. Cross-dataset tests showed that SYMPHONIA could generalize, with 66.9% accuracy when trained on IEMOCAP and tested on MELD. This was better than all the baselines. These results show that SYMPHONIA is good at recognizing emotions and analyzing sentiment in different situations, which shows that it can adapt and do well in different settings.

Keywords

Multimodal emotion recognition; RoBERTa; cross-modal attention; graph neural networks; contrastive learning; adaptive fusion; temporal modeling; affective computing; context-aware representation

1  Introduction

Emotion recognition systems represent a fundamental component of human–computer interaction (HCI) [1] and have significant applications in affective computing [2], virtual agents [3], behavioral analysis [4], and mental health diagnostics [5]. Earlier approaches primarily focused on facial expression recognition [6,7], textual analysis [8], and speech analysis [9], which collectively laid the foundation for emotion recognition research [10,11]. However, these unimodal systems often lacked robustness to noise, ambiguity, and cross-modal biases [12]. In contrast, multimodal emotion recognition (MER) seeks to extract and integrate emotional cues from heterogeneous yet complementary modalities, thereby improving the reliability of emotion inference [13]. Despite the advances in MER, many existing models still rely on static fusion strategies or simplistic cross-modal interaction mechanisms [14]. Such approaches fail to capture the inherently dynamic and context-sensitive interplay between modalities. For instance, textual content may describe a smile as a joyful expression while simultaneously masking sarcasm, the interpretation of which may depend on preceding conversational context [15]. Furthermore, numerous multimodal feature integration methods treat features from different modalities as equally important regardless of contextual relevance. This assumption limits the model’s ability to adaptively emphasize the most informative emotional indicators specific to a given scenario [16].

To deal with these constraint issues, we develop SYMPHONIA, a new multimodal framework capable of dynamically synthesizing facial expression and text emotion signals. The architecture consists of two modality-specific branches: A Facial Emotion Branch which employs Vision Transformers (ViT) with Landmark Guided Attention plus LSTM temporal modeling, and a Textual Emotion Branch derived from RoBERTa embeddings [16] with a TSG built with GAT. Both branches are strongly integrated via a Dual-Branch Dynamic Attention Mechanism which allows cross-modal, context-sensitive interactions to control influence between modalities. The hierarchical adaptive fusion module operates by aligning and integrating multimodal features across multiple layers. It labels features at different semantic levels, using adaptive control and contrastive self-supervised learning to ensure precise and efficient fusion.

By combining classification and regression methods, SYMPHONIA consistently surpasses current state-of-the-art models, achieving superior results across multiple datasets, including IEMOCAP, MELD, CMU-MOSI, and CMU-MOSEI. The redesigned framework has significantly enhanced both accuracy and long-term stability through a modular architecture and dynamic attention mechanisms. These improvements highlight the model’s adaptability and its sensitivity to emotional context. SYMPHONIA’s contribution reflects the evolution of affective computing, where emotion recognition has advanced toward a more complex, interdisciplinary approach supported by flexible and adaptive algorithmic solutions. By doing this, SYMPHONIA makes a strong starting point for more study and for growing technology that can understand feelings better in the future.

2  Related Works

Integrating information from various sources, such as video recordings that capture both speech patterns and facial expressions, greatly enhances emotion recognition models. This integration allows such systems to gain a deeper understanding of human emotions, contributing to the growing interest in this field [17]. However, traditional approaches that rely on a single modality, such as visual [18] or linguistic features, often struggle with reliability and generalization, particularly in noisy or uncertain environments [19,20]. These limitations have shifted research toward multimodal approaches, which offer a more accurate and contextually nuanced interpretation of emotions [21]. Facial expression recognition has traditionally been carried out using Convolutional Neural Networks (CNNs), which extract spatial features from static images or video sequences [22,23]. More recent studies, however, have shown that Vision Transformers (ViTs) [24] are significantly more effective. ViTs employ self-attention mechanisms to capture long-range relationships across different regions of the face [25]. Focusing attention on key landmarks that represent emotional expressions has also been shown to improve performance [26], allowing models to concentrate on critical areas such as the mouth, eyes, and eyebrows [27] for more precise emotional analysis. The growth of textual emotion recognition systems has paralleled the development of natural language processing, especially with the introduction of pre-trained transformer models like BERT and RoBERTa [28].

Barnet and his colleagues attest to the fact that these models provide rich contextual embeddings able to capture subtle semantic and syntactic relationships important for emotion classification tasks [29,30]. In spite of this, these models do not inter-relate the tokens flexibly nor adequately. This especially is important for understanding more complex emotional semantics in language. In response to these problems, higher order token interactions have been modeled by Graph Attention Networks (GATs) which use graph-based techniques and improve the explainability as well as the discrimination power of textual emotions [31]. MER, as with many other tasks, faces challenges with multimodal fusion even after the individual modalities have been sufficiently advanced [32]. As with many other tasks, MER faces challenges with multi-modal fusion even after the individual modalities have been sufficiently advanced. Strategies such as early fusion (feature-level concatenation) and late fusion (decision-level aggregation) often overlook the relational hierarchies and the intricate interdependencies of the modalities and their relative importance across different scenarios portraying emotions [33]. The problem of incorporating modality interactions has been addressed by TFN (tensor-based fusion networks) [34] and memory fusion networks (MFN) [35]. These approaches however still suffer from static representation bottlenecks and inefficient computations [36,37].

To address these issues, we introduce the SYMPHONIA framework (Facial-Textual Emotion Recognition), which incorporates the face and text modalities using: (i) dual branch dynamic attention mechanism with bidirectional context-aware modulation and (ii) hierarchical adaptive fusion based on contrastive self-supervised learning. As with all approaches, SYMPHONIA differs by emphasizing dynamically the most contextually relevant emotional interactions as intermodal contextual engagement and representation alignment at multiple semantics. All of these developments raise the bar for emotion recognition performance across complex systems by strengthening the model’s interpretability, flexibility, and context understanding.

3  Proposed Model

We introduce the SYMPHONIA framework to address the limitations of current multimodal emotion recognition methods, such as their rigid fusion strategies and lack of context-aware adaptation. This innovative architecture provides a more flexible and responsive approach by dynamically and hierarchically merging textual and facial emotional information. Consequently, the final emotional embedding turns into a strong and evocative representation of the textual modality. The adaptive multimodal fusion process that follows is improved by the insightful information this description offers.

The SYMPHONIA architecture comprises several vital components that function synergistically to extract, align, and integrate features on a modality level, thus providing robust, transparent, and contextually intelligent emotion recognition. More specifically, SYMPHONIA consists of: (i) Facial Emotion Branch that captures expressive visual dynamics through Vision Transformers with landmark-guided temporal attention and LSTM temporal modeling; (ii) Textual Emotion Branch using RoBERTa embeddings with temporal semantic interactions modeled by GAT to a temporal semantic graph; (iii) a bidirectional dynamic attention mechanism for context-sensitive modulation between modalities (Dual-Branch Dynamic Attention Mechanism); and (iv) a Hierarchical Adaptive Fusion Module that integrates multimodal features on several semantic levels with adaptive gating through contrastive self-supervised learning. Emotional categories are predicted based on the fused representation by the classification layer built on top of a transformer backbone. The subsequent subsections systematically describe each component of the SYMPHONIA framework, outlining the design rationale, architectural structure, and functional role within the complete emotion recognition pipeline (Fig. 1).

images

Figure 1: Overview of the SYMPHONIA framework. The architecture comprises two modality-specific branches: a facial emotion processing stream using ViT, facial landmark-guided attention, and LSTM-based temporal modeling; and a textual emotion branch leveraging RoBERTa embeddings enriched through a GAT. These branches are dynamically integrated via a Dual-Branch Dynamic Attention Mechanism and further aligned through a Hierarchical Adaptive Fusion Module operating at low, mid, and high semantic levels. The final output yields a precise emotion prediction, illustrated here with the classification result “Surprise: 96%”.

3.1 Facial Emotion Branch

The main aim of the facial emotion branch is to capture expressive and discriminative facial features as well as model their temporal changes over time in order to interpret emotions accurately from visual data. We utilize a ViT for deep face feature extraction because of the impressive long-distance dependency capturing within regions of a face. ViT works by dividing the input face image into patches of a given size, embedding each patch, and then attending to them relation-wise with self-attention. As illustrated in Fig. 2, the attention weight matrix generated by a Vision Transformer head highlights how different facial patches attend to one another, thereby emphasizing salient regions that contribute most significantly to emotion inference. This attention-driven mechanism enhances the model’s ability to capture subtle facial dynamics and contextual inter-patch relationships.

images

Figure 2: Attention weight matrix from vision transformer head in the facial emotion branch.

An input face image IRH×W×C where H, W, C represent height, width, and channels, respectively, partitioned into a sequence of N non-overlapping patches {Ipi}i=1N where each patch is of size P × P × C:N=HWP2. Each patch Ipi is flattened and linearly projected into a D-dimensional embedding vector ei:

ei=LinearProj(Ipi),eiRD(1)

The ViT further prepends a learnable class embedding ecls to the sequence of patch embeddings and adds positional embeddings eposi to retain positional information:

Z0=[ecls;e1;e2;;eN]+epos(2)

The Transformer encoder consists of multiple stacked layers comprising multi-head self-attention (MHSA) and multi-layer perceptron (MLP) blocks, calculated as follows:

Zl=MHSA(LN(Zl1))+Zl1(3)

Zl=MLP(LN(Zl))+Zl(4)

here, Zl is the output of layer l, LN represents layer normalization, and l=1,,L with L being the total Transformer layers. For highlighting pivotal emotional zones, we utilize Facial Landmark Attention (FLA).

Utilizing facial landmark coordinates retrieved from a landmark detection model, we create attention masks which highlight landmarks located on the expression bearing regions which include eyes, eyebrows, and mouth:

Alandmark=σ(Conv(L)),Alandmark[0,1]H×W(5)

where L represents a binary landmark heatmap, Conv is a convolutional operation, and σ denotes the sigmoid activation function. The landmark attention mask is applied to each embedding vector ei:

ei=eiPooling(Alandmark)(6)

This mechanism further enhances and reinforces the activation of emotionally expressive facial regions. To model temporal dependencies within facial embeddings over time, we employ an LSTM network. Considering the facial embedding sequence obtained from ViT for each frame t:

Et=[ecls,e1,e2,,eN],EtR(N+1)×D(7)

We utilize the class embedding ecls as it effectively summarizes global emotional context: xt=ecls,xtRD. LSTM processes this sequence temporally. Given an input sequence X=[x1,x2,,xT], the LSTM computes hidden states recursively as:

ft=σ(Wf×[ht1,xt]+bf)(8)

it=σ(Wi×[ht1,xt]+bi)(9)

ot=σ(Wo×[ht1,xt]+b0)(10)

C~t=tanh(Wc×[ht1,xt]+bc)(11)

Ct=ftCt1+itC~t(12)

ht=ottanh(Ct)(13)

where ft,it,ot are forget, input, and output gates, respectively. Ct represents the cell state at time step t, and ht is the hidden state. Wf,Wi,Wo,Wc, and bf,bi,bo,bc, are learnable parameters, σ represents sigmoid activation. The facial emotion branch generates temporally-aware and expression-focused embeddings through the integration of ViT feature extraction, landmark-based attention, and LSTM temporal modeling. These embeddings are then applied within the subsequent fusion and emotion classification of the SYMPHONIA model.

Fig. 3 displays the changes in emotion activation scores over time and across sequential facial frames as processed by the LSTM module within the Facial Emotion Branch. Every point reflects the internal estimation of an emotion by the model at a given time step. The LSTM’s ability to model even the most minute changes over time is important for distinguishing dynamic from static emotion recognition.

images

Figure 3: Temporal emotion activation curve from the facial LSTM module.

3.2 Textual Emotion Branch

The textual emotion branch is designed to extract semantically meaningful and context-aware embeddings that capture the subtle emotional nuances present in textual input. By leveraging advanced language modeling and graph-based relational learning, the framework achieves deep semantic understanding and explicitly models emotional dependencies among textual tokens. Textual modality embeddings are generated using RoBERTa, a Transformer-based encoder pretrained on large-scale corpora and widely recognized for its strong contextual representation capabilities. Given an input sentence S consisting of words (tokens) w1,w2,,wM, where M is the total number of tokens, RoBERTa produces a sequence of contextual embeddings as:

ERoBERTa=RoBERTa([w1,w2,,wM]),ERoBERTaRM×Dtext(14)

here, Dtext represents the dimensionality of RoBERTa embeddings. Each word embedding emRoBERTa captures semantic meaning and contextual nuances of the corresponding token wM. To explicitly model relational and semantic interactions between words and enhance emotion-focused contextual representation, we introduce a TSG based on GAT. TSG effectively captures word-to-word emotional dependencies, enhancing the understanding of emotional nuances in textual sequences. The semantic graph 𝒢=(v,ε) is constructed from the tokens of sentence S, where nodes V=(v1,v2,,vM) correspond to token embeddings from RoBERTa, and edges εv×v represent pairwise semantic interactions between tokens.

We define edge features between two nodes (words) as the scaled dot-product similarity between their corresponding RoBERTa embeddings, representing the degree of semantic relationship:

αij=sortmax((WqeiRoBERTa)×(WkejRoBERTa)dk)(15)

where Wq,Wk represent the parameter matrices that learn the embeddings and transform them into query and key vectors, respectively. dk refers to the dimensionality of the transformed embeddings. αij denotes attention weight representing semantic connectivity from node i to node j.

The heatmap in Fig. 4 visualizes the attention weights calculated by the self-attention mechanism of RoBERTa, or by the TSG using a GAT, over a sample sentence. Each cell contains the attention strength from a query token (row) to a key token (column), thus showing which relationships among words the model focuses on while reasoning about emotions. Cumulatively, attention allocation for emotionally salient words like “happy” is higher, influencing the resultant emotion embedding.

images

Figure 4: Token-level attention heatmap in the textual emotion branch.

We employ a multi-head Graph Attention Network to aggregate information from neighbor nodes, enhancing each token’s embedding with its contextual semantic information. The GAT aggregation rule for updating node features is formulated as follows:

ei=||k=1Kσ(jNiαijkWvkejRoBERTa)(16)

where ei denotes the updated embedding for token i, concatenated () from K attention heads. Ni represents neighboring nodes (tokens) connected to node i. αijk denotes attention weights computed by head k. Wvk are learnable parameters (value matrices) for attention head k. σ denotes a non-linear activation function.

Fig. 5 presents the semantic graph illustrating the emotional states and contextual relationships associated with tokens in a sample sentence, as processed by the Textual Emotion Branch using a GAT. Each token is a node, and directed edges represent learned semantic relations endowed with weights of attention scores. Emotionally important words like “happy” can be extracted by the model through information aggregation from associated tokens with the help of contextual information. Thus, the representation of emotion is improved over the representation of words with the graph and context. For stability and efficiency, we typically average multi-head results:

ei=σ(1Kk=1KjNiαijkWvkejRoBERTa)(17)

images

Figure 5: Semantic graph of token interactions from the TTSG.

To capture the entire sentence embedding, we utilize a weighted average pooling of the updated node embeddings ei. The graph attention mechanism pools express emotional salience with greater weight to certain words:

efinaltext=i=1Mβiei,where βi=softmax(Wβei+bβ)(18)

where efinaltextRDfinal represents the final emotional embedding of text, where βi are the learned emotional significance weights for each token embedding. Wβ and bβ are learnable parameters. This emotional embedding integrates both the semantic and the relational emotional information contained in the textual modality. In this branch, RoBERTa performs the extraction of token embeddings deeply and contextually. These token embeddings are enriched by our TSG, which captures semantic emotional relationships between tokens through graph attention. Therefore, the final emotional embedding serves as a detailed and strong representation of the textual modality. This description offers valuable insights, improving the subsequent adaptive multimodal fusion process.

3.3 Dual-Branch Dynamic Attention Mechanism

The primary characteristic of SYMPHONIA is its Dual-Branch Dynamic Attention Mechanism, which effectively combines facial and textual emotional signals. Unlike traditional static or one-way attention systems, this method employs a bidirectional model in which the two modalities interact with each other. This enables the model to detect subtle emotional nuances by allowing each modality to dynamically influence the interpretation of the other. Traditional cross-modal attention methods frequently employ fixed attention weights, which fail to account for the changing relevance of modalities in various emotional contexts. In contrast, SYMPHONIA’s dynamic system uses facial features to guide text interpretation, emphasising emotional words that correspond to facial expressions. In a similar vein, textual characteristics impact facial image analysis, emphasising expressive facial cues that align with the text’s emotional tone. A deeper, more nuanced understanding of emotions across various inputs is provided by this dynamic interaction between the two modalities, which also improves the model’s adaptability.

The facial modality embedding hTRDface serves as a context vector to modulate textual embeddings Etext=[e1,e2,,eM]RM×Dtext. The attention scores between facial embedding hT and textual embeddings ei are computed by:

uift=yanh(Wft(1)ei+Wft(2)hT+bft)(19)

γift=exp(uftuift)j=1Mexp(uftujft)(20)

where Wft(1),Wft(2)RDa×Dtext,Da×Dface and bftRDa all of which represent learnable model parameters. uftRDa presents as a learnable vector responsible for transforming embeddings to attention scores. Attention weights from facial-to-textual modality is denoted as γift. The facial-conditioned textual embedding cft is computed as a weighted sum of textual embeddings:

cft=i=1Mγiftei,cftRDtext(21)

Similarly, the textual embedding efinaltextRDtext is utilized to dynamically guide the attention across facial embeddings H=[h1,h2,,hT]RT×Dface. After obtaining dynamically attended cross-modal embeddings cft (facial-conditioned textual embedding) and cft (textual-conditioned facial embedding), we perform a dynamic integration step. The integration adaptively merges both modality embeddings using a learned gating mechanism, ensuring balanced contributions based on context:

gfusion=σ(Wg[cft;ctf]+bg),gfusion[0,1]Dfusion(22)

The final integrated embedding ffusion is then computed as:

ffusion=gfusion(Wf(1)cft)+(1gfusion)(Wf(2)ctf)(23)

where Wg,Wf(1),Wf(2)  and bg are learnable parameters. σ denotes sigmoid activation, providing dynamic control over each modality’s influence. The resulting ffusionRDfusion represents the integrated multimodal emotional embedding, fully capturing the dynamic, bidirectional cross-modal interactions and emotional contexts between facial and textual modalities. By enabling both textual and facial embeddings to dynamically adapt and mutually inform one another, the Dual-Branch Dynamic Attention Mechanism greatly improves emotional recognition by capturing complex cross-modal emotional relationships.

In the SYMPHONIA framework, the adaptively merged embedding, ffusion, offers a strong basis for later hierarchical fusion and emotion classification.

3.4 Hierarchical Adaptive Fusion Module

The main purpose of hope is to admit the possibility of a better future happening in the real world. This module cleverly interweaves human facial and textual data—at various levels—constructing prior comprehension toward emotional states. This multimodal fusion strategy patterns adaptive gating and self-supervised contrastive learning on aligned multimodal representations. Instead of forcing a snapshot collapse at some stage of fusion into a singular unified space, the fusion paradigm is staged to start from the early combination of modality-specific features, which capture primary emotional cues. At the intermediate stage, the model enforces semantic alignment across modalities with contrastive alignment, coherently clustering corresponding emotional representations while repelling unrelated ones. Attainment of the highest abstraction stage defines the use of attention-based adaptive gating to dynamically highlight the most critical components from each fusion stage. The module synthesizes a richly encoded and deeply hierarchical contextualized emotional embedding with robust alignment and high sensitivity, ensuring interpretability and structural coherence. In this stage, initial modality embeddings—facial hT and textual efinaltext are combined. An adaptive gating mechanism integrates modality-specific features, preserving complementary emotional cues:

flow=glow(Wl(1)hT)+(1glow)(Wl(2)efinaltext)(24)

where the gating factor glow is computed as:

glow=σ(Wglow[hT;efinaltext]+bglow),glow[0,1]Dlow(25)

Wglow,Wl(1),Wl(2),bglow are learnable parameters. σ represents sigmoid activation, ensuring adaptive gating. The low-level fused representation flow captures initial cross-modal interactions at a coarse semantic level.

The adaptive gating mechanism dynamically assigns weights to features derived from low-, mid-, and high-level fusion stages, with the high-level fusion contributing the most significantly, as illustrated in Fig. 6.

images

Figure 6: Contribution of each hierarchical fusion stage to final emotion prediction.

To enhance cross-modal semantic alignment, we propose a self-supervised contrastive learning-based fusion at the intermediate stage. The contrastive objective ensures that corresponding modality embeddings are closer in the joint embedding space, while mismatched modality pairs remain farther apart. We apply a contrastive loss (Lcontrastive) inspired by InfoNCE loss formulation to align these representations:

Lcontrastive=logexp(sim(cft,ctf)τ)n=1Nexp(sim(cft,ctf(n))τ)(26)

where sim(u,v)=uvuv denotes cosine similarity. τ is a temperature hyperparameter that scales the similarity. N represents the number of negative samples (Dualgatsligned embeddings). By minimizing this loss, the mid-level fusion learns modality embeddings with strong semantic correspondence, ensuring cross-modal representational consistency Fig. 3. The mid-level fused embedding fmid is then obtained by averaging aligned embeddings, reflecting enhanced semantic coherence:

fmid=12(cft+ctf)(27)

We combine the previously obtained low-level flow, mid-level fmid, and dynamically integrated embedding ffusion into a unified high-level emotional embedding using attention-based adaptive gating.

Fig. 7 demonstrates the cosine similarity between the token embeddings computed by RoBERTa (left) and those enhanced by GAT (right). The numbers show how semantically close each pair of tokens is. After the application of GAT, tokens that are semantically related, like “feel” and “happy”, show greater similarity, suggesting proper alignment of words on emotions relevant to the model. Such shifts improve the model representation of text. Finally, we combine the previously obtained low-level flow, mid-level fmid, and dynamically integrated embedding ffusion into a unified high-level emotional embedding using attention-based adaptive gating. We first stack the embeddings into a single set:

F=[flow,fmid,ffusion],FR3×Dfusion(28)

images

Figure 7: Cosine similarity matrices of token embeddings before and after GAT.

We then compute attention-based gating weights:

α=softmax(Watanh(WFF+bF)+ba),aR3(29)

Wa,WF,ba,bF are learnable parameters.

After alignment, the embeddings from both modalities become semantically closer, demonstrating the effectiveness of the self-supervised contrastive learning framework in improving cross-modal representation coherence, as illustrated in Fig. 8. The final fused representation fhigh is a weighted sum of these embeddings, adaptively emphasizing the most emotionally relevant features from each fusion stage:

fhigh=i=13aiFi,fhighRDfusion(30)

images

Figure 8: Visualization of facial and textual embeddings before and after contrastive alignment using t-SNE.

The described adaptive weighting structure enables each emotional layer to dynamically adjust its contribution based upon context so that each layer functions optimally during integration. The complete hierarchical adaptive fusion module is trained jointly under a single optimization criterion using cross-modal alignment (contrastive loss) in conjunction with a standard supervised classification loss, resulting in the following combined loss function:

Lfusion=Lcls(fhigh,y)+λLcontrastive(31)

where y denotes ground-truth emotion labels. Lcls is typically implemented using cross-entropy or focal loss. λ is a hyperparameter balancing supervised classification and self-supervised contrastive learning objectives. The Hierarchical Adaptive Fusion Module successively merges the various modality embeddings, which are at different abstraction levels namely, low, mid, and high, thus creating a strong and semantically consistent multimodal emotional embedding fhigh. The collaboration of adaptive gating and self-supervised contrastive learning guarantees the smooth integration and proper alignment of emotional signals from the face and the text. This mix increases the ability of the model to interpret emotions better and thus tends to robustness.

3.5 Final Classification Layer

The last step in the SYMPHONIA Framework involves the classification of the detailed multimodal embedding created through hierarchical adaptive fusion into specific emotion categories. The Final Classification Layer is designed as a Transformer-based module that can effectively leverage contextual embeddings to interpret complex multimodal emotional representations. Given the final multimodal embedding fhighRDfusion we apply a Transformer encoder to further model the internal contextual dependencies, capturing subtle emotional contexts that may span across different embedded dimensions. To apply the Transformer-based classification module, we first expand the embedding fhigh into a sequence of vectors. We define a linear projection of the embedding into a sequence of L tokens, each with dimension dmodel:

Fseq=[f1,f2,,fL],fiRdmodel(32)

with each token embedding fi defined as:

fi=Wiprojfhigh+biproj,i{1,2,,L}(33)

WiprojRdmodel×Dfusion,biprojRdmodel are learnable parameters. Positional embeddings epos are added to each token to preserve ordering information:

Fseq=[f1+epos(1),f2+epos(2),,fL+epos(L)](34)

We feed this sequence into a Transformer Encoder block comprising Multi-Head Self-Attention (MHSA) and feed-forward neural networks (FFN):

Z=TEncoder(Fseq),ZRL×dmodel(35)

The Multi-Head Self-Attention module captures the contextual relationships across the different components of the multimodal embedding sequence. Given input X, self-attention operates as:

Att(Q,K,V)=softmax(QKdk)V(36)

where Q=XWQ,K=XWK,V=XWV, with learnable parameters WQ,WK,WV. dk is the dimensionality of key vectors. Multi-head attention further divides attention operations into h parallel heads, allowing parallel computation and diverse representation learning: MHSA(X)=Concat(head1,,headh)Wo. Each head is computed as follows: headi=Att(XWQi,XWKi,XWVi). Following the Transformer Encoder, we aggregate the output representations Z via mean pooling, summarizing emotional features comprehensively:

𝒵final=1Li=1LZi,𝒵finalRdmodel(37)

The pooled representation 𝒵final undergoes a linear transformation and softmax activation to predict emotion class probabilities:

y^=softmax(Wc𝒵final+bc),y^RC(38)

where WcRC×dmodel and bcRC denote the learnable parameters of the classifier. C denotes the total number of emotion categories available. The predicted class is retrieved via the argmax operation: ypred=argmaxc[1,C](y^c). The classification module is trained using a supervised cross-entropy loss, optimizing model parameters to minimize discrepancy between predicted and true emotion labels: Lcls=c=1Cyclog(y^c). where yc denotes the ground truth emotion label in one-hot format. y^c denotes predicted emotion class probability from softmax.

The heatmap in Fig. 9 visualizes the attention scores computed over the projected token embeddings within the Transformer encoder of the final classification layer. Certain tokens like [CLS], facial embeddings (F1, F2), textual embeddings (T1, T2), and even the [EOS] token are influenced differently. The model can contextually refine the unified representation through the attention mechanism with regard to the most emotionally predictive tokens before the finalized prediction of emotions.

images

Figure 9: Transformer attention map within the final classification layer.

Combining with the previous fusion module losses, the overall training objective for the SYMPHONIA model is defined as: LT=Lcls+λLcontrastive with hyperparameter λ balancing classification and contrastive alignment objectives. The last classification layer effectively performs the Transformer-based context modeling, drawing on complex cross-modal relationships captured by hierarchical adaptive fusion. Furthermore, by incorporating the usage of self-attention and robust pooling, the module for classification interprets multimodal emotional data accurately, guaranteeing the precision and reliability of emotion predictions.

4  Experimental Results

To systematically assess the impact of our proposed SYMPHONIA framework, we set out to methodically construct AWD-OMNIBUS and REPS-MSR, which, combined, offered the most comprehensive coverage of well-known multimodal datasets for emotion recognition alongside benchmark methods and evaluation protocols spanning all relevant implementation dimensions.

4.1 Datasets

To validate the accuracy and applicability of our proposed multimodal emotion recognition system, we performed a systematic evaluation on four publicly available benchmark datasets, IEMOCAP, MELD, CMU-MOSI, and CMU-MOSEI. We ensured that the datasets IEMOCAP and MELD, along with CMU-MOSI and CMU-MOSE, provided all relevant graphemic and phonemic systems, speaker heterogeneity, emotions within linguistics, and real-world complexities to make the results of the evaluation reliable and holistic Table 1.

images

The IEMOCAP dataset comprises approximately 12 h of richly annotated, audio-visual recordings of dyadic emotionally-charged dialogues performed by trained actors. It contains tri-modal video, audio, and text data, and is labeled with coarse-grained emotion categories including anger, happiness, sadness, neutral, excitement, and frustration. The MELD dataset is an extension of the FRIENDS television series corpus, so it contains a very large collection of naturalistic, spontaneous, multi-party conversations. It consists of more than 13,000 utterances in more than 1400 dialogues and has annotations of 7 emotions: anger, disgust, fear, joy, neutral, sadness, and surprise in video, audio, and text streams.

The CMU-MOSI dataset centers around the monologue video segments that have opinions, and it gives detailed layers of sentiment intensity alongside face and voice multimodal data. It contains 93 video clips from which 2199 opinion segments have been extracted. CMU-MOSEI augments MOSI by adding a collection of over 23,000 annotated utterances with diverse emotions such as anger, disgust, fear, happiness, sadness, and surprise alongside continuous sentiment scores and discrete values for each emotion. As with the other datasets, CMU-MOSEI preserves synchronized input from the facial video to the audio and text transcripts. The datasets serve as a solid benchmark for testing the model with regard to controlled and spontaneous interaction, layered and unlayered emotion, and a diverse range of human behavior.

4.2 Evaluation Metrics

To thoroughly assess our model performance, we employed standard evaluation metrics commonly used for multimodal emotion recognition:

Accuracy=TP+TNTP+FP+FN+TN(39)

Precision=TPTP+FP(40)

Recall=TPTP+FN(41)

F1=2×Precision×RecallPrecision+Recall(42)

Additionally, for sentiment intensity and continuous emotional dimensions (CMU-MOSI/CMU-MOSEI):

MAE=1Ni=1N|y^iyi|(43)

PearsonCorrelation=i=1N(yiy¯)(y^iy^¯)i=1N(yiy¯)2i=1N(y^iy^¯)2(44)

4.3 Results

This section provides complete experimental findings capturing the application of our SYMPHONIA model in various datasets and evaluation frameworks. We include evaluation metrics of quantitative analysis against baseline models and ablation tests, generalization analysis across datasets, as well as some model interpretability evaluations. To demonstrate the comparative effectiveness of our model, we selected several state-of-the-art (SOTA) multimodal emotion recognition baseline methods for comparison in Table 2.

images

Table 3 presents the summary statistics for class-based emotions and classification metrics (Accuracy, Precision, Recall, F1-score) for the IEMOCAP and MELD datasets. The results show that SYMPHONIA performed better than other methods in all the evaluation criteria.

images

Table 4 shows sentiment intensity prediction results for the CMU-MOSI and CMU-MOSEI datasets using Pearson correlation (Corr) and Mean Absolute Error (MAE).

images

The outcomes of this study show that SYMPHONIA performs better than other models at capturing nuanced emotional and sentiment representations and improves quantitative metrics across a wide range of emotional tasks. We performed thorough ablation studies to assess how particular features contribute to the model’s overall performance (Table 5).

images

The results demonstrate the combined and separate contributions to facial and textual elements, as dynamic attention and hierarchical fusion improve performance significantly (Fig. 10).

images

Figure 10: Ablation study visualization based on IEMOCAP dataset results.

Omission of these components yields less effectiveness, thereby demonstrating their importance. To evaluate generalization ability, we trained SYMPHONIA on IEMOCAP and tested it on MELD in a direct transfer scenario, without any additional tuning (Table 6).

images

Our results demonstrate the remarkable general performance of SYMPHONIA, especially on cross-dataset and cross-modal transfer tasks, in which it outperforms all baseline models consistently. A further qualitative study clearly shows how dynamic cross-modal attention resolves modality conflicts by paying attention to the most reliable emotional cues, which brings about performance improvement. Take for instance the text “I’m fine, everything’s good”—which expresses a positive emotional tone—combined with subtle facial sadness as shown in Fig. 11.

images

Figure 11: The SYMPHONIA model produces highly accurate predictions on multimodal input samples, effectively identifying a wide range of emotional expressions. These include Surprise (96%), Fear (87%), Disgust (92%), Sadness (99%), Neutral (91%, 94%), Anger (93%), and Joy (89%). These findings point to the model’s effectiveness in capturing a wide range of emotional conditions for different subjects under various situations. The results further accentuate the capability of SYMPHONIA in distinguishing subtle facial expressions and closely matching them with textual cues via its dynamic attention mechanism and hierarchical fusion approach.

While unimodal approaches using text let “happiness” slip through, SYMPHONIA, through facial-conditioned textual attention, identifies the true emotion as “sadness”. On the contrary, another example features the phrase “That’s amazing!” and a facial expression that is neutral or inconsistent. SYMPHONIA here adaptively attends to the verbal modality, classifying the facial expression as “surprise” or “happiness”, which static mechanisms fail due to facial ambiguity. This demonstrates how dynamic attention of SYMPHONIA not only improves accuracy for target predictions but also extends fine-grained analysis for explaining the decisions made via contextually informative modalities using context-sensitive reasoning.

5  Conclusions

This paper proposes SYMPHONIA, a new model that effectively recognizes and integrates emotional signals from facial expressions and text. Instead of relying on a static approach, the signals are processed dynamically with the use of a dual-branch dynamic attention mechanism and adaptive hierarchical fusion. Extensive experiments were conducted to evaluate the performance of the model on IEMOCAP, MELD, CMU-MOSI, and CMU-MOSEI datasets, which were benchmarked against the state-of-the-art methods for multimodal emotion recognition. Experimental results manifest the proposed model outperforming current models and yielding higher accuracy while enhancing metrics such as F1-score, sentiment intensity correlation, and MAE. Concretely, SYMPHONIA achieved 80.9% in terms of accuracy and 80.1% in terms of F1-score on the IEMOCAP dataset, compared to Dualgats (74.8%) and EmoCLIP (75.3%). For the MELD dataset, SYMPHONIA achieved an accuracy of 74.2% and an F1-score of 73.5%, outperforming other baseline models. In the case of sentiment prediction, it resulted in a Pearson correlation of 0.86 on MOSI and 0.83 on MOSEI, outperforming all other baseline methods. Extensive ablation studies confirm that the model’s success is largely due to its inclusion of the cross-modal dynamic attention and adaptive hierarchical fusion components for enhancing stability and explainability. This goes to support the hypothesis that context-aware, adaptive multimodal fusion improves model performance.

The generalization of SYMPHONIA across datasets further underlines its robustness toward domain shifts and variations of emotional contexts. For the IEMOCAP-trained model that was tested on MELD, SYMPHONIA achieved 66.9% accuracy and 65.8% F1-score, outperforming all baselines such as TFN, MFN, and Dualgats with a gap of more than 8%. Qualitative analyses further show that the model excels in the interpretation of complex signals through dynamic focusing on the most salient emotional cues, which is another proof of the strength of its framework.

Building on the promising results of SYMPHONIA, future work will explore the integration of additional modalities, such as audio and physiological signals (e.g., ECG, EEG, and GSR), to capture subtle and complementary emotional cues, thereby enhancing contextual understanding and predictive accuracy. Furthermore, adaptive multi-sensor fusion and space–frequency selective feature modeling strategies, as demonstrated in AMSO-SFS [39], will be investigated to improve robustness under challenging visual conditions. In addition, iterative refinement and scale-alignment mechanisms inspired by DI-MDE [40] may be incorporated to strengthen temporal consistency and dynamic scene representation in multimodal emotion recognition. Lastly, improvements in model reliability under low-resource conditions can be achieved with the help of advanced self-supervised learning, large-scale pretraining, and transfer learning; this will make a system robust, adaptive, and context-aware.

Acknowledgement: None.

Funding Statement: This study was funded by the Korea Agency for Technology and Standards in 2022, project numbers 1415181629 (20022340, Development of International Standard Technologies Based on AI Model Lightweighting Technologies).

Author Contributions: The authors confirm contribution to the paper as follows: study conception and design: Akmalbek Abdusalomov, Alpamis Kutlimuratov, Avazjon Marakhimov, Kuanishbay Seytnazarov and Young-Im Cho; data collection: Alpamis Kutlimuratov, Mukhriddin Mukhiddinov and Kamola Abdurashidova; software: Akmalbek Abdusalomov and Alpamis Kutlimuratov; analysis and interpretation of results: Akmalbek Abdusalomov, Mukhriddin Mukhiddinov, Alpamis Kutlimuratov, Avazjon Marakhimov and Kuanishbay Seytnazarov; draft manuscript preparation: Akmalbek Abdusalomov and Alpamis Kutlimuratov; supervision: Young-Im Cho. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: Data openly available in a public repository. “The data that support the findings of this study are openly available in IEMOCAP at https://sail.usc.edu/iemocap/index.html, MELD at https://affective-meld.github.io/ and CMU-MOSI at http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/”.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Khare SK, Blanes-Vidal V, Nadimi ES, Acharya UR. Emotion recognition and artificial intelligence: a systematic review (2014–2023) and research recommendations. Inf Fusion. 2024;102(3):102019. doi:10.1016/j.inffus.2023.102019. [Google Scholar] [CrossRef]

2. Guo R, Guo H, Wang L, Chen M, Yang D, Li B. Development and application of emotion recognition technology—a systematic literature review. BMC Psychol. 2024;12(1):95. doi:10.1186/s40359-024-01581-4. [Google Scholar] [PubMed] [CrossRef]

3. Zhang S, Yang Y, Chen C, Zhang X, Leng Q, Zhao X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: a systematic review of recent advancements and future prospects. Expert Syst Appl. 2024;237:121692. doi:10.1016/j.eswa.2023.121692. [Google Scholar] [CrossRef]

4. AVG, Mala T, Priyanka D, Uma E. Multimodal emotion recognition with deep learning: advancements, challenges, and future directions. Inf Fusion. 2024;105(2):102218. doi:10.1016/j.inffus.2023.102218. [Google Scholar] [CrossRef]

5. Zhu X, Huang Y, Wang X, Wang R. Emotion recognition based on brain-like multimodal hierarchical perception. Multimed Tools Appl. 2024;83(18):56039–57. doi:10.1007/s11042-023-17347-w. [Google Scholar] [CrossRef]

6. Meng T, Shou Y, Ai W, Yin N, Li K. Deep imbalanced learning for multimodal emotion recognition in conversations. arXiv:2312.06337. 2023. [Google Scholar]

7. Safarov F, Kutlimuratov A, Khojamuratova U, Abdusalomov A, Cho YI. Enhanced AlexNet with Gabor and local binary pattern features for improved facial emotion recognition. Sensors. 2025;25(12):3832. doi:10.3390/s25123832. [Google Scholar] [PubMed] [CrossRef]

8. Richet N, Belharbi S, Aslam H, Schadt ME, González-González M, Cortal G, et al. Textualized and feature-based models for compound multimodal emotion recognition in the wild. arXiv:2407.12927. 2024. [Google Scholar]

9. Abdusalomov A, Kutlimuratov A, Nasimov R, Whangbo TK. Improved speech emotion recognition focusing on high-level data representations and swift feature extraction calculation. Comput Mater Contin. 2023;77(3):2915–33. doi:10.32604/cmc.2023.044466. [Google Scholar] [CrossRef]

10. Rathi T, Tripathy M. Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: a review. Speech Commun. 2024;162:103102. doi:10.1016/j.specom.2024.103102. [Google Scholar] [CrossRef]

11. Makhmudov F, Kutlimuratov A, Cho YI. Hybrid LSTM–attention and CNN model for enhanced speech emotion recognition. Appl Sci. 2024;14(23):11342. doi:10.3390/app142311342. [Google Scholar] [CrossRef]

12. Alhussein G, Ziogas I, Saleem S, Hadjileontiadis LJ. Speech emotion recognition in conversations using artificial intelligence: a systematic review and meta-analysis. Artif Intell Rev. 2025;58(7):198. doi:10.1007/s10462-025-11197-8. [Google Scholar] [CrossRef]

13. Kalateh S, Estrada-Jimenez LA, Nikghadam-Hojjati S, Barata J. A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges. IEEE Access. 2024;12(4):103976–4019. doi:10.1109/ACCESS.2024.3430850. [Google Scholar] [CrossRef]

14. Hazmoune S, Bougamouza F. Using transformers for multimodal emotion recognition: taxonomies and state of the art review. Eng Appl Artif Intell. 2024;133(3):108339. doi:10.1016/j.engappai.2024.108339. [Google Scholar] [CrossRef]

15. Yin G, Liu Y, Liu T, Zhang H, Fang F, Tang C, et al. Token-disentangling mutual transformer for multimodal emotion recognition. Eng Appl Artif Intell. 2024;133(10):108348. doi:10.1016/j.engappai.2024.108348. [Google Scholar] [CrossRef]

16. Kim T, Vossen P. EmoBERTa: speaker-aware emotion recognition in conversation with RoBERTa. arXiv:2108.12009. 2021. [Google Scholar]

17. Udahemuka G, Djouani K, Kurien AM. Multimodal emotion recognition using visual, vocal and physiological signals: a review. Appl Sci. 2024;14(17):8071. doi:10.3390/app14178071. [Google Scholar] [CrossRef]

18. Gursesli MC, Lombardi S, Duradoni M, Bocchi L, Guazzini A, Lanata A. Facial emotion recognition (FER) through custom lightweight CNN model: performance evaluation in public datasets. IEEE Access. 2024;12(15):45543–59. doi:10.1109/ACCESS.2024.3380847. [Google Scholar] [CrossRef]

19. Deshmukh S, Gupta P. Application of probabilistic neural network for speech emotion recognition. Int J Speech Technol. 2024;27(1):19–28. doi:10.1007/s10772-023-10037-w. [Google Scholar] [CrossRef]

20. Rakhimovich MA, Kadirbergenovich KK, Rakhmovich OU, Rustem J. A new type of architecture for neural networks with multi-connected weights in classification problems. In: Proceedings of the 12th World Conference “Intelligent System for Industrial Automation” (WCIS-2022); 2022 Nov 25–26; Tashkent, Uzbekistan. p. 105–12. [Google Scholar]

21. Shou Y, Meng T, Ai W, Zhang F, Yin N, Li K. Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations. Inf Fusion. 2024;112(9):102590. doi:10.1016/j.inffus.2024.102590. [Google Scholar] [CrossRef]

22. Wang L, Kang X, Ding F, Nakagawa S, Ren F. A joint local spatial and global temporal CNN-transformer for dynamic facial expression recognition. Appl Soft Comput. 2024;161(2):111680. doi:10.1016/j.asoc.2024.111680. [Google Scholar] [CrossRef]

23. Tagmatova Z, Umirzakova S, Kutlimuratov A, Abdusalomov A, Cho YI. A hyper-attentive multimodal transformer for real-time and robust facial expression recognition. Appl Sci. 2025;15(13):7100. doi:10.3390/app15137100. [Google Scholar] [CrossRef]

24. Zhu A, Li K, Wu T, Zhao P, Hong B. Cross-task multi-branch vision transformer for facial expression and mask wearing classification. arXiv:2404.14606. 2024. [Google Scholar]

25. Zakieldin K, Khattab R, Ibrahim E, Arafat E, Ahmed N, Hemayed E. ViTCN: hybrid vision transformer with temporal convolution for multi-emotion recognition. Int J Comput Intell Syst. 2024;17(1):64. doi:10.1007/s44196-024-00436-5. [Google Scholar] [CrossRef]

26. Tian Y, Zhu J, Yao H, Chen D. Facial expression recognition based on vision transformer with hybrid local attention. Appl Sci. 2024;14(15):6471. doi:10.3390/app14156471. [Google Scholar] [CrossRef]

27. Nawaz U, Saeed Z, Atif K. A novel transformer-based approach for adult’s facial emotion recognition. IEEE Access. 2025;13(3):56485–508. doi:10.1109/ACCESS.2025.3555510. [Google Scholar] [CrossRef]

28. Elyoseph Z, Refoua E, Asraf K, Lvovsky M, Shimoni Y, Hadar-Shoval D. Capacity of generative AI to interpret human emotions from visual and textual data: pilot evaluation study. JMIR Ment Health. 2024;11(2):e54369. doi:10.2196/54369. [Google Scholar] [PubMed] [CrossRef]

29. Shelke N, Chaudhury S, Chakrabarti S, Bangare SL, Yogapriya G, Pandey P. An efficient way of text-based emotion analysis from social media using LRA-DNN. Neurosci Inform. 2022;2(3):100048. doi:10.1016/j.neuri.2022.100048. [Google Scholar] [CrossRef]

30. Madrakhimov S, Makharov K, Khurramov A. On the transparency of decision-making in classification by precedents with fuzzy descriptions. IEEE Access. 2025;13:173656–64. doi:10.1109/ACCESS.2025.3616052. [Google Scholar] [CrossRef]

31. Zhu P, Wang B, Tang K, Zhang H, Cui X, Wang Z. A knowledge-guided graph attention network for emotion-cause pair extraction. Knowl Based Syst. 2024;286(3):111342. doi:10.1016/j.knosys.2023.111342. [Google Scholar] [CrossRef]

32. Zhang D, Chen F, Chen X. DualGATs: dual graph attention networks for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023 Jul 9–14; Toronto, ON, Canada. p. 7395–408. [Google Scholar]

33. Wang D, Guo X, Tian Y, Liu J, He L, Luo X. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 2023;136(2):109259. doi:10.1016/j.patcog.2022.109259. [Google Scholar] [CrossRef]

34. Xiang A, Qi Z, Wang H, Yang Q, Ma D. A multimodal fusion network for student emotion recognition based on transformer and tensor product. In: Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE); 2024 Aug 29–31; Jinzhou, China. p. 1–4. [Google Scholar]

35. Chudasama V, Kar P, Gudmalwar A, Shah N, Wasnik P, Onoe N. M2FNet: multi-modal fusion network for emotion recognition in conversation. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2022 Jun 19–20; New Orleans, LA, USA. p. 4651–60. [Google Scholar]

36. Khujamatov EH, Abdullaev M, Umirzakova S. Analytical modeling of hybrid CNN-transformer dynamics for emotion classification. Mathematics. 2026;14(1):85. doi:10.3390/math14010085. [Google Scholar] [CrossRef]

37. Foteinopoulou NM, Patras I. EmoCLIP: a vision-language method for zero-shot video facial expression recognition. In: Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG); 2024 May 27–31; Istanbul, Turkiye. p. 1–10. [Google Scholar]

38. Khan M, Gueaieb W, El Saddik A, Kwon S. MSER: multimodal speech emotion recognition using cross-attention with deep fusion. Expert Syst Appl. 2024;245(22):122946. doi:10.1016/j.eswa.2023.122946. [Google Scholar] [CrossRef]

39. Abdusalomov A, Umirzakova S, Bakhtiyor Shukhratovich M, Mukhiddinov M, Kakhorov A, Buriboev A, et al. Drone-based wildfire detection with multi-sensor integration. Remote Sens. 2024;16(24):4651. doi:10.3390/rs16244651. [Google Scholar] [CrossRef]

40. Abdusalomov A, Umirzakova S, Shukhratovich MB, Kakhorov A, Cho YI. Breaking new ground in monocular depth estimation with dynamic iterative refinement and scale consistency. Appl Sci. 2025;15(2):674. doi:10.3390/app15020674. [Google Scholar] [CrossRef]


Cite This Article

APA Style
Abdusalomov, A., Mukhiddinov, M., Abdurashidova, K., Kutlimuratov, A., Marakhimov, A. et al. (2026). SYMPHONIA–Enhanced Multimodal Emotion Recognition with Dual-Branch Dynamic Attention and Hierarchical Adaptive Fusion. Computers, Materials & Continua, 88(1), 74. https://doi.org/10.32604/cmc.2026.077057
Vancouver Style
Abdusalomov A, Mukhiddinov M, Abdurashidova K, Kutlimuratov A, Marakhimov A, Seytnazarov K, et al. SYMPHONIA–Enhanced Multimodal Emotion Recognition with Dual-Branch Dynamic Attention and Hierarchical Adaptive Fusion. Comput Mater Contin. 2026;88(1):74. https://doi.org/10.32604/cmc.2026.077057
IEEE Style
A. Abdusalomov et al., “SYMPHONIA–Enhanced Multimodal Emotion Recognition with Dual-Branch Dynamic Attention and Hierarchical Adaptive Fusion,” Comput. Mater. Contin., vol. 88, no. 1, pp. 74, 2026. https://doi.org/10.32604/cmc.2026.077057


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 293

    View

  • 61

    Download

  • 0

    Like

Share Link