iconOpen Access

ARTICLE

MDGET-MER: Multi-Level Dynamic Gating and Emotion Transfer for Multi-Modal Emotion Recognition

Musheng Chen1,2, Qiang Wen1, Xiaohong Qiu1,2, Junhua Wu1,*, Wenqing Fu1

1 School of Software Engineering, Jiangxi University of Science and Technology, Nanchang, 330013, China
2 Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication, Nanchang, 330013, China

* Corresponding Author: Junhua Wu. Email: email

Computers, Materials & Continua 2026, 86(3), 34 https://doi.org/10.32604/cmc.2025.071207

Abstract

In multi-modal emotion recognition, excessive reliance on historical context often impedes the detection of emotional shifts, while modality heterogeneity and unimodal noise limit recognition performance. Existing methods struggle to dynamically adjust cross-modal complementary strength to optimize fusion quality and lack effective mechanisms to model the dynamic evolution of emotions. To address these issues, we propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition. A dynamic gating mechanism is applied across unimodal encoding, cross-modal alignment, and emotion transfer modeling, substantially improving noise robustness and feature alignment. First, we construct a unimodal encoder based on gated recurrent units and feature-selection gating to suppress intra-modal noise and enhance contextual representation. Second, we design a gated-attention cross-modal encoder that dynamically calibrates the complementary contributions of visual and audio modalities to the dominant textual features and eliminates redundant information. Finally, we introduce a gated enhanced emotion transfer module that explicitly models the temporal dependence of emotional evolution in dialogues via transfer gating and optimizes continuity modeling with a comparative learning loss. Experimental results demonstrate that the proposed method outperforms state-of-the-art models on the public MELD and IEMOCAP datasets.

Keywords

Multi-modal emotion recognition; dynamic gating; emotion transfer module; cross-modal dynamic alignment; noise robustness

1  Introduction

Multi-modal emotion recognition aims to emulate human emotion understanding by fusing text, visual, and audio information and is widely used in intelligent interactive systems. Although unimodal research in text analysis [13] and speech recognition [46] has advanced, human emotional expression often exhibits cross-modal inconsistency. As shown in Fig. 1, taking utterance u3 as an example, the visual and audio information of utterance (faceless face and flat tone) is vague, but when combined with the content of the text, it implies a hidden anger. In addition, the emotion behind u3 is also related to the previous context u1 and u2. In particular, the shift from using nicknames in u1 to using full names in u2 suggests an emotion shift caused by u2, since another speaker tries to make a joke with a pretended lightness. It can be seen that the relationships in {u1, u2, u3} are complex and multivariate, and involves the interdependence between the two dimensions of modality and context. This example shows that the multi-modal method can significantly improve the robustness of emotion understanding by fusing cross-modal complementary information.

images

Figure 1: Examples of multi-modal dataset

At the same time, the inherent heterogeneity of multi-modal data, noise interference and the dynamic evolution characteristics of emotional state make this field still have a broad space for exploration. Current mainstream methods—such as DialogueRNN [7], which relies on static weight fusion, or graph convolutional networks [8] that attempt to model cross-modal interactions—have made notable progress. However, they generally suffer from several key limitations: Firstly, it is difficult for them to dynamically adjust the complementary strength between cross-modalities according to modal quality and context, resulting in insufficient noise suppression [9]. Secondly, the filtering mechanism for uni-modal internal and cross-modal redundant information is weak; finally, most methods ignore the transfer modeling of emotional states in dialogues [10], and rely too much on the context consistency hypothesis, which makes it difficult to effectively capture the synergy of multi-modal cues when emotions change.

However, these methods still have obvious limitations. In particular, it is difficult to dynamically adjust the cross-modal complementary strength, resulting in insufficient noise suppression. The uni-modal and cross-modal redundancy filtering mechanisms are weak, and the emotion transfer modeling is generally ignored [11], and the context consistency assumption is over-reliant. To address these limitations, we propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition. We employ gated unimodal encoders to suppress noise and reduce heterogeneity, and a gated-attention cross-modal encoder to dynamically calibrate the complementary contributions of audio and visual modalities relative to text. In addition, we introduce a gated enhanced emotion transfer module to explicitly model the temporal dependence of emotion evolution in dialogues, thereby jointly improving recognition performance and robustness.

In summary, the contributions of this paper are as follows:

•   We propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition. A dynamic gating mechanism is applied throughout unimodal encoding, cross-modal alignment, and emotion transfer modeling, significantly improving noise robustness and feature alignment.

•   We design gated recurrent units and a gated-attention mechanism to optimize model performance from the perspectives of unimodal noise suppression and cross-modal dynamic weight distribution, which effectively improves recognition accuracy in scenes with emotional transitions.

•   We introduce an emotion transfer perception module that, together with comparative learning constraints on the transfer process, enhances the model’s adaptability to dynamic dialogue scenarios. We validate the effectiveness and the synergistic contribution of the proposed modules on public datasets.

2  Related Works

The research of Multi-modal Emotion Recognition (MER) continues to deepen, mainly focusing on the core directions of cross-modal interaction modeling, context-dependent capture, noise and heterogeneity mitigation, and emotional evolution modeling.

Cross-modal interaction modeling forms the basis of MER. Early fusion strategies, such as simple feature concatenation or static weight fusion [3,12], outperform uni-modal approaches but often overlook deep inter-modal correlations, which may exacerbate the heterogeneity gap and noise interference [9]. With technological progress, graph neural networks (GNNs) have become powerful tools for capturing complex cross-modal dependencies. For instance, Shi et al. [8] proposed a modality calibration strategy to refine utterance and speaker representations at both contextual and modal levels, coupled with emotion-aligned hypergraph and line graph fusion to optimize cross-modal semantic transfer and emotional pathway learning. Chen et al. [13] innovatively fused multi-frequency features via hypergraph networks to enhance high-order relationship modeling. Meng et al. [14] further designed a multi-modal heterogeneous graph to enable dynamic semantic aggregation. Meanwhile, the Transformer architecture is also widely used for feature alignment and interaction. Mao et al. [15] designed a hierarchical Transformer to manage contextual preferences within modalities and employed multi-granularity interactive fusion to distinguish modal contributions. Hu et al. [16] deeply integrated audio/visual features with text and combined cross-modal contrastive learning to optimize difference representation.

Given the inherent sequential nature of dialogue, context and dialogue structure modeling is crucial. Majumder et al. [7] used recurrent networks (such as RNN) to model dialogue sequences, but did not explicitly deal with modal heterogeneity. Subsequent studies have significantly strengthened the understanding of context: Fu et al. [17] embedded an emotional transmission mechanism to coordinate temporal information flow on the basis of integrating text ambidexterity and audio-visual information; Ai et al. [18] constructed speaker relation graph and event graph to optimize the semantic representation and aggregation process. Meng et al. [14] used heterogeneous graph structure to integrate richer context cues. Furthermore, the two-level decoupling mechanism proposed by Li et al. [19] aims to coordinate modal features and dialogue context, and improves the fineness of context modeling.

Dealing with the inherent noise (especially visual/audio modality) and heterogeneity of multi-modal data is a key challenge to improve the robustness of the model. The research in this direction focuses on feature filtering and dynamic weighting strategies. For example, Zeng et al. [20] introduced modal modulation loss to learn the contribution weight of each mode, and designed a filtering module to actively remove cross-modal noise. Yu et al. [21] used emotional labels to generate anchors to guide comparative learning, and optimized decision boundaries to enhance the robustness of the model. Li et al. [22] used the attention mechanism to dynamically fuse modalities and perceived the influence of emotional transfer on the importance of modalities. However, it is worth noting that the existing methods have obvious limitations in dynamically calibrating the complementary intensity, for example, how to dynamically adjust the supplementary degree of audio/visual modality to the dominant text modality according to the quality and relevance of speech/image content, and fine-grained noise suppression. For example, there are still obvious limitations in distinguishing informative visual cues from irrelevant background noise, and the robustness to the inherent noise of visual/audio modality needs to be further improved.

The flow and transfer of emotions in conversation (emotional evolution modeling) is the core of understanding emotional dynamics. Bansal et al. [23] first proposed an explicit emotional transfer component to capture the ups and downs in dialogue. The follow-up work attempts to deepen the understanding of the transfer process: Tu et al. [24] integrated the common sense knowledge base to provide background support for the reasoning and transfer of emotional states; Hu et al. [16] combined the adversarial training strategy to optimize the representation learning in the process of emotion transfer. Nevertheless, most of the existing methods rely on static context consistency assumptions, and it is difficult to deal with emotional mutation scenarios adaptively. In such cases, emotions shift drastically, requiring the model to quickly integrate and weigh new evidence from different modalities that may conflict or complement each other. Current methods still lack sufficient capability to address this collaborative challenge.

In summary, although the current research has made progress in multi-modal fusion and context modeling, collaborative optimization of dynamic gating mechanisms and emotion transfer remains insufficient.

3  Method

3.1 Task Definition

Given a dialogue containing |U| utterance U={u1,,u|U|}, each utterance ui is composed of text (T), visual (V) and audio (A) modalities, that is ui={uiT,uiV,uiA}. The goal of the emotion recognition task is to predict the emotional label ei of each utterance, which is realized by a mapping function ei=F(xi), which xi is the multi-modal feature representation of the utterance ui.

3.2 Model Architecture

To address the aforementioned challenges in multi-modal emotion recognition, we propose a model named Multi-level Dynamic Gating and Emotion Transfer for Multi-modal Emotion Recognition (MDGET-MER). The neural network architecture constructed under this framework consists of five main components: a feature extractor for each modality, a Gated Uni-modal Encoder (GUE), a Gated Attention-based Cross-Modal Encoder (Gated ACME), a Gated Enhanced Emotion Transfer Module (Gated EETM), and an emotion classifier, as illustrated in Fig. 2. Each component is described in detail in the following sections.

images

Figure 2: Architecture of MDGET-MER

3.3 Modal Feature Extraction

•   Text modality: Liu et al. [25] proposed a RoBERTa pre-trained language model based on multi-layer Transformer architecture, which optimizes text representation ability through mask language modeling and dynamic mask strategy. After fine-tuning the model, we extracts the [CLS] label embedding of the last layer as the semantic feature of the text.

•   Audio modality: Eyben et al. [26] proposed an OpenSmile modular speech signal processing toolkit, which supports batch extraction of audio parameters such as fundamental frequency and energy. In this paper, an audio feature extraction system is constructed with reference to this method to achieve efficient feature processing of speech signals.

•   Visual modality: Huang et al. [27] proposed a DenseNet model based on dense blocks, which enhances the ability of feature reuse.

3.4 Gated Uni-Modal Encoder

Inspired by the Transformer architecture, we propose a Gated Uni-modal Encoder (GUE) to encode utterances within each modality. In the GUE framework, a Gated Recurrent Unit (GRU) is first employed to encode the original features. A gating mechanism then dynamically adjusts the fusion ratio between the original features and the encoded features, thereby enhancing the model’s robustness to noise and improving its ability to extract contextual emotional cues. The network structure of GUE is shown in Fig. 3. Specifically, the structure of GUE can be formalized as:

G1=σ(Wg1[X;GRU(X)]+bg1)(1)

Xrr=LayerNorm(G1X+(1G1)GRU(X))(2)

G2=σ(Wg2[X;Xrr;FF(Xrr)]+bg2)(3)

Xfr=LayerNorm(G2X+(1G2)(Xrr+FF(Xrr)))(4)

where XRn×d are the feature matrixs (n is the number of utterances, d is the feature dimension) representing all utterances; GRU(·) is a bidirectional GRU network, which is used to capture the temporal context; LayerNorm(·) represents the layer normalization operation; [;] represents the feature splicing operation; σ() is a sigmoid function to generate gating weights in the range of [0, 1]; G1,G2Rn×d are the dynamic gating matrixs, the fusion ratio of the original feature and the coding feature in the residual connection is controlled, respectively. The gate vectors G1 and G2 automatically adjust the weight according to the input feature. For example, when the input noise is large, G1 can reduce the contribution of GRU(X) and more information of the original feature X can be retained; Wg1Rd×2d,Wg2Rd×3d are the gated parameter matrixs, bg1,bg2Rd are learnable parameters; represents element-by-element multiplication; FF(·) is the feedforward network layer composed of two fully connected networks, which can be expressed as:

FF(Xrr)=dropout(FC(dropout(α(FC(Xrr)))))(5)

images

Figure 3: The network structure of GUE

FC(·) and dropout(·) represent the fully connected network and dropout operation, respectively, and α(·) represents the activation function.

The GUE is applicable to three modalities at the same time through the shared parameter mechanism. That is Xfrm=GUE(Xm), where m{T,V,A}, and GUE(·) denotes a gated uni-modal encoder. The gating mechanism can adaptively adjust the feature distribution of different modalities, so that the distribution of utterance data of each mode is as close as possible, alleviating the multi-modal heterogeneity gap.

3.5 Gated Attention-Based Cross-Modal Encoder

We design a gated attention-based cross-modal encoder (Gated ACME) to effectively extract and integrate complementary information from multi-modal emotional data. By incorporating a dynamic gating mechanism, this module adaptively calibrates the contribution weights of visual and audio modalities during cross-modal interactions, thereby enhancing the dominance of text features while selectively fusing relevant complementary cues from non-text modalities. This design aims to mitigate redundancy and noise, ensuring robust and context-aware multi-modal representation learning. As shown in Fig. 4, referring to the Transformer structure, The Gated ACME is primarily composed of an attention network layer, a feedforward network layer, and dynamic gating. Numerous studies in multi-modal emotion recognition have indicated that the quantity of emotional information embedded in visual and audio modalities is generally lower than that in textual modalities, thereby limiting the range of emotional expression in these modalities [15,28]. Based on this hypothesis, we takes visual and audio features as complementary information to supplement the emotional expression of text features, and introduces dynamic gating to dynamically adjust the complementary intensity of visual and audio modalities to the text. In turn, textual features of utterances are used to enhance visual and audio representations. In addition, in GUE, it is difficult for GRU to pay attention to the global context information of the utterances. Therefore, a self-attention layer is employed to capture global contextual emotional cues prior to cross-modal interaction. The designed Gated ACME consists of the following three stages:

images

Figure 4: The network structure of gated ACME

•   Improve the global context awareness of utterances. The feature matrixs from the three modalities XmRn×d(m{T,V,A}) serve as input to three multi-head attention (MHA) networks. The output Xsm is directly added to the input Xm (the residual operation) to obtain the feature matrixs Xsrm. This process can be expressed by the equation:

Xsm=dropout(MHA(Xm,Xm,Xm))(6)

Xsrm=LayerNorm(Xm+Xsm)(7)

where MHA(·) denotes a multi-headed attention network.

•   Cross-modal interaction modeling is performed, and a dynamic gating mechanism is introduced to dynamically adjust the supplementary strength of visual and audio modalities to the text. The above results are input into four MHA networks in pairs and the information of each modality is updated. Next, information updates for each modality will be described separately.

• For the information update of the text modality, the gated weight is first generated:

GV=σ(WgV[XsrT;XsrV]+bgV)(8)

GA=σ(WgA[XsrT;XsrA]+bgA)(9)

where WgV,WgAR2d×d,bgV,bgARd are learnable parameters. Then the gating weight is applied to cross-modal attention output:

XcTV=dropout(GVMHA(XsrT,XsrV,XsrV))(10)

XcTA=dropout(GAMHA(XsrT,XsrA,XsrA))(11)

where GV,GA[0,1]n×d control the contribution ratio of visual and audio modalities to the text; in the first MHA network, the text feature matrix XsrT is used as the query Q, the visual feature matrix XsrV is used as the key K and value V, and the output XcTV is a text feature matrix containing visual information. In the second MHA network, the query Q comes from XsrT, the key K and the value V come from XsrA, and get XcTA, this is a text feature matrix containing audio information. Finally, XcTV and XcTA are connected, and get XcT, and XT, XsrT and XcT are combined by residual operation to obtain a new text feature matrix XcrT:

XcT=α(FC(CAT(XcTV,XcTA)))(12)

XcrT=LayerNorm(XT+XsrT+XcT)(13)

where CAT(·) denotes the concatenation operation.

• For the information update of visual modality, the gating weight is first generated:

GV=σ(WgV[XsrV;XsrT]+bgV)(14)

• Then the gating weight is applied to cross-modal attention output:

XcVT=dropout(GVMHA(XsrV,XsrT,XsrT))(15)

• where GV[0,1]n×d controls the contribution ratio of text information to visual features; in the MHA network, the visual feature matrix XsrV is used as the query Q in the MHA network, and the text feature matrix XsrT is used as the key K and value V, so as to obtain the visual feature matrix with text information enhancement XcVT. Similar to the process of updating text information, the residual operation is applied to add XV, XsrV and XcVT to obtain a new visual feature matrix XcrV:

XcrV=LayerNorm(XV+XsrV+XcVT)(16)

• The information updating process of audio modality is similar to that of visual modality:

GA=σ(WgA[XsrA;XsrT]+bgA)(17)

XcAT=dropout(GAMHA(XsrA,XsrT,XsrT))(18)

XcrA=LayerNorm(XA+XsrA+XcAT)(19)

•   The expression ability and stability of the model are improved, and the feedforward network is enhanced. Before the feature is transmitted to the feedforward network, feature selection gating is introduced to filter redundant information:

GF=σ(WgFXcrm+bgF)(20)

where GF[0,1]n×d is feature selection gating and suppresses irrelevant features; WgFRd×d,bgFRd are learnable parameters. After that, the feature selection gating Xcrm is used as the input of each feed-forward network layer FF(·) to obtain Xfm; at the same time, the residual operation is used to combine Xm, Xcrm and Xfm to obtain the characteristic matrix Xfm:

Xfm=FF(GFXcrm)(21)

Xfrm=LayerNorm(Xm+Xcrm+Xfm)(22)

3.6 Gated Enhanced Emotion Classifier

After multi-layer GUE and Gated ACME coding, the final feature matrix HT,HVand HA are obtained. Then, using the modal fusion gating, the contribution of each modal feature is dynamically weighted to obtain the fusion feature matrix H:

Gm=σ(Wgm[HT;HV;HA]+bgm)(23)

Hgate=GmTHT+GmVHV+GmAHA(24)

H=CAT(HT,HV,HA,Hgate)(25)

where Gm=[GmT,GmV,GmA][0,1]n×3 are modal gating matrices, which control the fusion weights of each mode; WgmR3d×3,bgmR3 are learnable parameters; HgateRn×d is a gated weighted fusion feature, which is spliced with the original modal features to retain complete information. Finally, the feature dimension of H is converted to |E| (the number of emotions) by using the gated enhanced emotion classifier, so as to obtain the predicted emotion ei(eiE). The process can be expressed as follows:

Gf=σ(Wgfhi+bgf)(26)

hifiltered=Gfhi(27)

li=dropout(ReLU(Wlhifiltered))(28)

yi=Softmax(Wsmaxli)(29)

ei=argmax(yi)(30)

where Gf[0,1]d is feature selection gating, inhibiting feature dimensions unrelated to emotion WgfRd×d,bgfRd are learnable parameters; hiH,Wl,Wsmax are learnable parameters; Softmax(·) denotes the softmax function; argmax(·) denotes the argmax function. The loss function is defined as follows:

c=1Ni=1Nj=1n(i)yijlogyij+η(Wgm+Wgf)(31)

where η is the L2 regularization coefficient; n(i) is the number of utterances in the i-th dialogue, and N is the number of all dialogues in the training set; yij represents the probability distribution of the predicted emotion label of the j-th utterance in the i-th conversation, and yij represents the truth label.

3.7 Gated Enhanced Emotion Transfer Module

To enhance the model’s capacity for capturing the dynamic evolution of emotions within dialogues, we introduce a gated enhanced emotion transfer module (Gated EETM). This module explicitly models emotion transition patterns through a transfer gating mechanism, which suppresses spurious emotional associations between unrelated utterances. By emphasizing contextual emotional shifts and reducing reliance on static context assumptions, Gated EETM improves the recognition of emotional changes and reinforces the model’s adaptability in conversational scenarios. As shown in Fig. 5, The Gated LESM consists of three main steps, namely, first constructing the probability tensor of emotion transfer, then generating the label matrix of emotion transfer, and finally using the loss of emotion transfer for training.

images

Figure 5: The network structure of gated EETM. Gated EETM consists of three key steps: (1) Constructing the emotion transfer probability tensor: Two parameter-shared Gated ACME modules process the input multi-modal features to generate two representationally distinct yet emotionally consistent feature matrices. (2) Generating the Emotion Transfer Label Matrix: Based on the ground-truth emotion labels, a binary emotion transfer label matrix is constructed. (3) Emotion Transfer Loss for Training: The emotion transfer probability tensor is passed through a fully connected layer to predict transfer states

•   Emotion transfer probability: Inspired by Gao et al. [29], we use two parameter-shared Gated ACMEs to process the input multi-modal features, the output XmRn×d(m{T,V,A}) of GUE, respectively, to generate two feature matrices with different representations but consistent emotional semantics:

H=GatedACME(XT,XV,XA)(32)

H=GatedACME(XT,XV,XA)(33)

where H,HR|U|×d,|U| are the number of utterances in the dialogue, and d is the feature dimension. For each pair of utterance ui and uj, calculate its emotion transfer possibility weight Gtrans(i,j):

Gtrans(i,j)=σ(WgtransCAT(hi,hj)+bgtrans)(34)

where hiHandhjH represent the feature vectors of the i-th and j-th statements, respectively; WgtransR2d×1bgtransR; finally, the gated weight Gtrans(i,j) is applied to the feature stitching results to generate a sparse emotion transfer probability tensor 𝒯ij:

𝒯ij=Gtrans(i,j)CAT(hi,hj)R|U|×|U|×2d(35)

•   Emotion transfer label: Using the real emotion labels from the dataset, the emotion transfer state between adjacent utterances is marked. Specifically, if the real emotions of the two utterances are the same, their emotional transfer state is marked as 0, indicating that there is no emotional transfer; on the contrary, if their real emotions are different, the emotional transfer state is marked as 1, indicating that emotional transfer has occurred. Through the above operation, a |U|×|U| dimensional emotion transfer label matrix is obtained.

•   Emotion transfer loss: After constructing the emotion transfer probability and label, a loss function is defined to train the model. The gated enhanced emotion transfer module is a binary auxiliary task, which aims to correctly distinguish the emotional transfer state between utterances. This encourages the model to capture emotional dynamics and reduce over-reliance on contextual information. Firstly, in order to obtain the predicted emotion transfer state sij(sij{0,1}), the full connection layer is used to convert the feature dimension of the probability tensor to 2:

lij=dropout(ReLU(Wltij))(36)

zij=Softmax(Wsmaxslij)(37)

sij=argmax(zij[k])(38)

where tij represents the emotion transfer probability vector between the i-th statement and the j-th statement. tij𝒯,zij is the probability distribution of the predicted emotional transfer label between the i-th statement and the j-th statement. WlandWsmaxs are learnable parameters. The low-confidence sentences are filtered by transfer gating Gtrans(j,k), and the defined emotion transfer loss is:

s=1Ni=1Nj,k=1n(i)zijklogzijk,Gtrans(j,k)>τ(39)

where the gating threshold τ can be set manually or optimized by the verification set; n(i) is the number of utterances in the i-th dialogue, and N is the number of all dialogues in the training set; z represents the probability distribution of the predicted emotion transfer label between the j-th and k-th utterances in the i-th dialogue, and zijk represents the truth label.

3.8 Loss Function

We combine the classification loss c and the emotion transfer loss s, and explicitly add the gated parameter set Wg in the regularization term to obtain the final training target:

=c+λs+η(W+Wg)(40)

where Wg contains the learnable parameters of all gating mechanisms; λ is a compromise parameter whose value is in the range of [0, 1]. η is an L2 regularization weight, which W is a set of all learnable parameters except the gating mechanism. In addition, λ can be set manually or adjusted automatically using the method of Cipolla et al. [30].

4  Experiment and Results Analysis

4.1 Datasets

To validate the effectiveness of the proposed method, we conducted experiments on the publicly available multi-modal emotional dialogue datasets MELD [31] and IEMOCAP [32]. A brief description of each dataset is provided below.

•   MELD dataset: This dataset is a multi-modal sentiment analysis dataset, including more than 1400 dialogues and more than 13,000 utterances from the TV series ‘Friends’. There are seven emotional categories in the dataset, namely anger, disgust, sadness, joy, surprise, fear and neutral.

•   IEMOCAP dataset: This dataset is a performance-based multi-modal multi-speaker dataset composed of two-person dialogues, including text, visual and audio modalities. The dataset contains 151 dialogues and 7433 utterances, which are labeled with six emotional categories: happy, sad, neutral, angry, excited and frustrated.

The statistical information of MELD and IEMOCAP is shown in Table 1.

images

4.2 Model Evaluation and Training Details

We used accuracy (ACC) and weighted F1 score (w-F1) as the main evaluation indicators, and used F1 score as an evaluation indicator for each individual emotional category.

All experiments in this paper are conducted on NVIDIA GeForce RTX 3090 graphics card, using CUDA 11.8 version and deep learning framework PyTorch 2.5.0, using AdamW as the optimizer, the number of heads of all MHA networks is set to 8, and the L2 regularization factor is set to 0.00001. The other parameters of the model are shown in Table 2, where Batch_size is the batch size, Learning_rate is the learning rate, GUE_drop is the random inactivation rate of GUE, ACME_ drop is the random inactivation rate of Gated ACME, GUE_layers represents the number of network layers of GUE, and ACME_layers represents the number of network layers of Gated ACME.

images

4.3 Comparative Experimental Analysis

4.3.1 Comparison to Baselines on the MELD Dataset

We conduct a comprehensive evaluation of MDGET-MER on the MELD dataset and compared it with nine comparison models. As shown in Table 3, on the MELD dataset, MDGET-MER is significantly better than all comparison models with an accuracy of 68.51% and a weighted F1 score of 67.23%, which is 0.66 and 0.53 percentage points higher than the optimal comparison model CFN-ESA (67.85% Acc/66.70% w-F1). Multi-modal emotion recognition studies have shown that, disgust and anger noise is higher. For example, in the MELD dataset, the study [33] can significantly improve the classification accuracy of these two categories by introducing sample weighted focus contrast (SWFC) loss, indirectly proving its high noise characteristics. The advantage of MDGET-MER is particularly prominent in the high-noise emotion category. For example, the F1 score of MDGET-MER in the ‘disgust’ category is 33.08%, which is much higher than 26.92% of CFN-ESA and 2.60% of MMPGCN, which verifies the effective suppression of visual and audio noise by dynamic gating mechanism. In the ‘fear’ category, the F1 score (21.00%) is slightly lower than that of EACL (21.62%), but significantly better than that of CoMPM (2.90%), indicating that the model is more adaptable to low SNR modes. In addition, MDGET-MER achieved an F1 score of 81.03% in the neutral category, which is 0.98% higher than that of CFN-ESA, highlighting the ability of text-led cross-modal interaction design to strengthen the semantic core.

images

4.3.2 Comparison to Baselines on the IEMOCAP Dataset

Similarly, in order to fully verify the effectiveness of the proposed method, we also compare MDGET-MER with 9 comparison models on the IEMOCAP dataset. As shown in Table 4, MDGET-MER is significantly superior to all comparison models with an accuracy of 71.78% and a weighted F1 score of 71.85%, and is 1.00 and 0.81 percentage points higher than the optimal comparison model CFN-ESA (70.78% Acc/71.04% w-F1), respectively. In fine-grained emotion analysis, MDGET-MER’s ability to model complex emotional states is further highlighted: for example, the F1 score of the ‘excited’ category reaches 78.40%, which is 3.58 percentage points higher than that of CFN-ESA (74.82%), reflecting the enhancement effect of cross-modal dynamic alignment on positive emotions. Notably, MDGET-MER achieved an F1 score of 59.81% in ‘Happy’, which is more balanced than EACL (51.29%), indicating that it balanced the strength and diversity of modal contributions through a multi-level gating mechanism.

images

Fig. 6 describes the confusion matrix of MDGET-MER and GA2MIF on the IEMOCAP dataset. MDGET-MER had higher correct recognition rates in happy (64.58% vs. 47.92%), excited (75.25% vs. 70.9%) and sad (82.45% vs. 81.22%) categories, and more significant inhibition of cross-category misclassifications such as happy → excited and neutral → happy (such as happy misclassification as excited is only 20.83%, lower than 25.0% of GA2 MIF). Neutral misjudged as happy is only 4.69%, lower than 7.03% of GA2MIF. Even if GA2MIF has a local accuracy advantage in angry recognition (72.35% vs. 69.41%), the misjudgment of MDGET-MER is still more concentrated (for example, the misjudgment of angry as frustrated is only 21.18%, lower than 25.88% of GA2MIF). On the whole, MDGET-MER has more accurate fine-grained distinction between positive emotions (happy, excited) and negative emotions (sad), less cross-category confusion, and more advantages in the stability and accuracy of emotion recognition.

images

Figure 6: Comparison of confusion matrices between MDGET-MER and GA2MIF. hap, sad, neu, ang, exc, fru denote happy, sad, neutral, angry, excited, and frustrated, respectively: (a) Confusion matrix of MDGET-MER; (b) Confusion matrix of GA2MIF

4.4 Ablation Experimental Analysis

4.4.1 The Importance of Multi-Modality

In order to verify the importance of text, visual and audio modalities to MDGET-MER, we conduct ablation experiments on MELD and IEMOCAP datasets, as shown in Table 5. The experimental results show that the text modality plays a dominant role in multi-modal emotion recognition. Under the single modality setting, the accuracy of MELD and IEMOCAP is 67.75% and 67.56%, respectively, and the weighted F1 score is 66.47% and 67.38%, which is significantly better than that of vision (MELD Acc: 48.71%; IEMOCAP Acc: 46.01%) and audio (MELD Acc: 50.01%; iEMOCAP Acc: 54.45%), which verifies that text features contain the most abundant emotional semantic information. For example, in the MELD dataset, the weighted F1 score of the pure visual modality is only 32.71%, which is much lower than 66.47% of the text modality, indicating that facial expressions and body movements are susceptible to environmental noise.

images

In the dual-modal experiment, the text + audio combination performed best: the accuracy and weighted F1 score of IEMOCAP increase to 69.67% and 69.27%, respectively, which increase by 2.11 and 1.89 percentage points compare with the pure text mode, indicating that audio features such as speech tone can effectively supplement the implicit emotional cues of the text. The accuracy of text + visual combination on MELD (67.79%) is close to that of pure text model (67.75%), suggesting that the contribution of visual modality in short dialogue scenes is limited. It is worth noting that the accuracy of visual + audio combination (MELD: 51.00%; IEMOCAP: 61.36%) is significantly lower than other dual-modal settings, confirming that the model is difficult to effectively capture cross-modal emotional associations when there is a lack of text dominance.

When the text + visual + audio tri-modality is fused, the model performance reaches the peak: the accuracy of MELD and IEMOCAP is 68.51% and 71.78%, respectively, which is 0.69 and 2.11 percentage points higher than the optimal bi-modal combination (text + audio), and the weighted F1 score is increased by 0.76 and 2.58 percentage points. This result verifies the multi-modal synergy effect: for example, in the ‘excited’ category of IEMOCAP, the visual modality and the audio modality adaptively strengthen the supplement to the text through the dynamic gating mechanism, so that the F1 score is 4.3% higher than that of the dual modality. Experiments show that MDGET-MER maximizes the utilization of complementary information while suppressing noise through text-driven cross-modal interaction design. Especially in complex scenes where visual/audio modalities provide key clues (such as facial expressions or speech urgency in the ‘anger’ category), three-modal fusion is significantly better than single-modal or dual-modal settings, providing a robust and efficient solution for dialogue emotion recognition.

4.4.2 The Importance of Each Module

In order to verify the synergy of each module in MDGET-MER, we analyze the contribution of its core components through four ablation experiments, as shown in Table 6.

images

Firstly, after removing the Gated Uni-modal Encoder (GUE), the accuracy of MELD and IEMOCAP decrease to 68.17% and 71.43%, respectively (0.34 and 0.35 percentage points lower than the complete model), and the weighted F1 score decreased by 0.33 and 0.79 percentage points, respectively. Experiments show that the ability of GUE to suppress single-modal noise through dual dynamic gating is crucial to model robustness, especially when there is inherent noise in visual and audio modalities, which verifies its key role in the purification of original features.

Secondly, after removing the Gated Attention-based Cross-Modal Encoder (Gated ACME), the performance of the model is significantly degraded: the accuracy of IEMOCAP plummeted to 68.97% (2.81 percentage points lower than the complete model), and the weighted F1 score decrease by 2.96 percentage points. It can be seen from the results that the cross-modal dynamic interaction design can accurately capture the complementary information of the auxiliary modes.

Then, when the Gated Enhanced Emotion Transfer Module (Gated EETM) is removed, the accuracy of MELD decreased slightly to 68.28%, and the weighted F1 score decreases by 0.30 percentage points. In IEMOCAP, the accuracy rate decreased by 0.77 percentage points. Because it can not explicitly model the law of emotion transfer, the optimization ability of context dependence biasis weakened, indicating that this module is more critical to the modeling of emotion mutation in long dialogue scenes.

Finally, when the hierarchical gating mechanism is completely removed, the model performance suffered the greatest loss: the weighted F1 scores of MELD and IEMOCAP fell to 65.86% and 69.49%, respectively (1.37 and 2.36 percentage points lower than the complete model). The results show that the dynamic gating mechanism systematically optimizes the overall performance through synergy, and verifies the enhancement effect of gating design on complex emotional expression.

4.4.3 The Importance of Gating

In order to deeply analyze the specific contribution of each stage of gating in the dynamic gating mechanism, we further design ablation experiments to systematically remove the gating units in each key module of the model (as shown in Table 6) to evaluate its impact on the overall performance.

Firstly, after removing gate of GUE, the performance of the model on MELD and IEMOCAP decrease slightly (MELD Acc: 68.25%, w-F1: 67.02%; IEMOCAP Acc: 71.54%, w-F1: 71.17%). This shows that although the dual dynamic gating in GUE is not the biggest driving force for performance improvement, its role in noise filtering and feature enhancement has a fundamental contribution to model robustness. After removal, the sensitivity of the model to single-mode noise increases slightly.

Secondly, removing gate of Gated ACME causes significant damage to model performance, especially on the IEMOCAP dataset (IEMOCAP Acc: 69.51%, w-F1: 69.36%, 2.27% Acc/2.49% w-F1 lower than the complete model). This strongly verifies that dynamic gating in Gated ACME is the key pillar of model performance. The gating is responsible for dynamically calibrating the complementary strength of visual/audio modalities to text features and feature selection. Its removal causes the model to fail to effectively suppress cross-modal redundant information and adaptively adjust the modal contribution, which seriously weakens the efficiency of cross-modal dynamic alignment and complementary information utilization.

Then, after removing gate of Gated EETM, the performance of the model on MELD decreases the least (Acc: 68.34%, w-F1: 67.01%), while on IEMOCAP decreases significantly (Acc: 71.16%, w-F1: 71.19%). And after removing the emotion transfer loss, the performance of the model on MELD decreases the least (Acc: 68.30%, w-F1: 67.01%), while on IEMOCAP decreases significantly (Acc: 71.06%, w-F1: 71.08%). This difference indicates that the gating of Gated EETM plays a more significant role in modeling more complex emotional evolution dependence and inhibiting false associations in long dialogues (such as IEMOCAP). Removing this gating weakens the model’s ability to focus on key cues in emotional mutation scenarios.

Finally, when all gating mechanisms are removed, the model performance suffers the greatest loss (MELD Acc: 67.36%, w-F1: 65.86%; IEMOCAP Acc: 69.38%, w-F1: 69.49%) is much lower than the performance of removing any single stage gating. This clearly shows that the gating mechanism in GUE, Gated ACME and Gated EETM does not operate in isolation, but jointly optimizes the noise robustness, cross-modal dynamic alignment ability and emotional evolution modeling accuracy of the model through synergy, and achieves a systematic improvement in overall performance.

4.5 Time and Memory Overhead

In Table 7, we present the running time and memory usage of our proposed model alongside several baseline methods for the MER task on both datasets. Compared to other models, M3Net and CFN-ESA demonstrate significantly lower time consumption on the MELD dataset, while CFN-ESA also achieves competitive time efficiency on IEMOCAP. MM-DFN and SACL-LSTM exhibit considerably higher running times across both datasets. In terms of memory usage, CFN-ESA consumes the least memory on MELD, while our model shows the most efficient memory utilization on IEMOCAP. M3Net and SACL-LSTM require substantially more memory across the datasets. Overall, our proposed MDGET-MER not only achieves strong performance metrics but also maintains a favorable balance between computational time and memory efficiency.

images

The term “Time” refers to the duration(s) required for training and testing in each epoch, and “Memory” refers to the consumption (MB) of allocated and reserved memory.

4.6 Case Study

This section shows a case of emotional transfer. Fig. 7 shows a dialogue scenario in the IEMOCAP dataset. When the speaker continuously outputs multiple sentences with neutral real emotions, the existing models (such as M3Net) tend to misjudge the subsequent utterance as neutral. This is due to its over-reliance on historical context, which leads to the failure of the model to effectively capture the cross-modal complementary information of the current utterance, thus making it difficult to model the instantaneous dynamic evolution of emotions.

images

Figure 7: A conversational case in the IEMOCAP dataset

In contrast, MDGET-MER can collaboratively optimize the fusion of contextual dependencies and current multi-modal cues through a gated enhanced emotion transfer module. Gated EETM explicitly captures the emotional transfer information, so that the model can strengthen the current cue in the face of emotional changes rather than relying too much on the historical context, so as to accurately identify the real emotion-surprise of the subsequent utterance in this case.

5  Conclusions

In order to solve the problem of emotional semantic alignment caused by modal heterogeneity, noise interference and emotional evolution complexity in dialogue scenes, we propose multi-level dynamic gating and emotion transfer for multi-modal emotion recognition. The dynamic gating mechanism runs through the whole process of uni-modal coding, cross-modal alignment and emotion transfer modeling. Firstly, the Gated Uni-modal Encoder (GUE) is used to suppress the single-mode noise and narrow the heterogeneity gap, so as to enhance the discrimination of the original features. Secondly, a Gated Attention-based Cross-Modal Encoder (Gated ACME) is designed to dynamically calibrate the complementary strength of visual and audio modalities to the dominant features of the text, and to filter redundant information by combining feature selection gating. Finally, a Gated Enhanced Emotion Transfer Module (Gated EETM) is introduced to explicitly model the evolution of emotions by comparative learning, so as to alleviate the conflict between context dependence and emotional mutation. Experiments show that the accuracy and weighted F1 score of MDGET-MER on MELD and IEMOCAP dataset are better than the existing comparison model. In the future, we will explore from the direction of deep integration of multi-modal emotion recognition and dynamic gating mechanism, and use real-time perception of emotional state to optimize the context understanding and emotional response of multi-modal empathetic dialogue generation, so as to further improve the emotional consistency and empathy depth of generated dialogue.

Acknowledgement: This research was supported by the Fanying Special Program of the National Natural Science Foundation of China, the Scientific research project of Jiangxi Provincial Department of Education, the Doctoral startup fund of Jiangxi University of Technology.

Funding Statement: This research was funded by “the Fanying Special Program of the National Natural Science Foundation of China, grant number 62341307”, “the Scientific research project of Jiangxi Provincial Department of Education, grant number GJJ200839” and “the Doctoral startup fund of Jiangxi University of Technology, grant number 205200100402”.

Author Contributions: Conceptualization, Musheng Chen and Xiaohong Qiu; methodology, Musheng Chen and Qiang Wen; software, Qiang Wen and Wenqing Fu; validation, Junhua Wu and Wenqing Fu; formal analysis, Musheng Chen and Qiang Wen; investigation, Junhua Wu; resources, Qiang Wen; data curation, Qiang Wen; writing—original draft preparation, Qiang Wen and Musheng Chen; writing—review and editing, Musheng Chen and Xiaohong Qiu; visualization, Qiang Wen; supervision, Junhua Wu; project administration, Musheng Chen; funding acquisition, Musheng Chen, Junhua Wu and Xiaohong Qiu. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The data presented in this study are openly available in (MELD) at (1810.02508v6) and (IEMOCAP) at (1804.05788v3) reference numbers [31,32].

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

Notations

The following notations are used in this manuscript:
Notation Meaning
U={u1,,u|U|} A dialogue containing |U| utterance
ui={uiT,uiV,uiA} ui is composed of text (T), visual (V) and audio (A) modalities
ei The emotional label
xi The feature representation of ui
XRn×d The feature matrixs; n is the number of utterances, d is the feature dimension
[;] The feature splicing operation
σ() A sigmoid function
G1,G2Rn×d The dynamic gating matrixs
Wg1Rd×2d,Wg2Rd×3d The gated parameter matrixs
bg1,bg2Rd Learnable parameters
Element-by-element multiplication
α( ) The activation function
HT,HVandHA The final feature matrix
HgateRn×d A gated weighted fusion feature
|E| The number of emotions
ei(eiE) The predicted emotion
η The L2 regularization coefficient
n(i) The number of utterances in the i-th dialogue
N The number of all dialogues in the training set
yij The probability distribution of the predicted emotion label of the j-th utterance in the i-th conversation
yij The truth label
Gtrans(i,j) Emotion transfer possibility weight
hiHandhjH The feature vectors of the i-th and j-th statements
τ The gating threshold
z The probability distribution of the predicted emotion transfer label
c The classification loss
s Emotion transfer loss
Wg The learnable parameters of all gating mechanisms
λ A compromise parameter whose value is in the range of [0, 1]
W A set of all learnable parameters except the gating mechanism

References

1. Jiao W, Lyu M, King I. Real-time emotion recognition via attention gated hierarchical memory network. Proc AAAI Conf Artif Intell. 2020;34(5):8002–9. doi:10.1609/aaai.v34i05.6309. [Google Scholar] [CrossRef]

2. Nie W, Chang R, Ren M, Su Y, Liu A. I-GCN: incremental graph convolution network for conversation emotion detection. IEEE Trans Multimed. 2022;24:4471–81. doi:10.1109/TMM.2021.3118881. [Google Scholar] [CrossRef]

3. Li S, Yan H, Qiu X. Contrast and generation make BART a good dialogue emotion recognizer. Proc AAAI Conf Artif Intell. 2022;36(10):11002–10. doi:10.1609/aaai.v36i10.21348. [Google Scholar] [CrossRef]

4. Shen W, Chen J, Quan X, Xie Z. DialogXL: all-in-one XLNet for multi-party conversation emotion recognition. Proc AAAI Conf Artif Intell. 2021;35(15):13789–97. doi:10.1609/aaai.v35i15.17625. [Google Scholar] [CrossRef]

5. Latif S, Rana R, Khalifa S, Jurdak R, Schuller BW. Multitask learning from augmented auxiliary data for improving speech emotion recognition. IEEE Trans Affect Comput. 2023;14(4):3164–76. doi:10.1109/TAFFC.2022.3221749. [Google Scholar] [CrossRef]

6. Zhou Y, Liang X, Gu Y, Yin Y, Yao L. Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans Audio Speech Lang Process. 2022;30:695–705. doi:10.1109/TASLP.2022.3145287. [Google Scholar] [CrossRef]

7. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. DialogueRNN: an attentive RNN for emotion detection in conversations. Proc AAAI Conf Artif Intell. 2019;33(1):6818–25. doi:10.1609/aaai.v33i01.33016818. [Google Scholar] [CrossRef]

8. Shi J, Li M, Bai L, Cao F, Lu K, Liang J. MATCH: modality-calibrated hypergraph fusion network for conversational emotion recognition. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence; 2025 Aug 16–22; Montreal, QC, Canada. p. 6164–72. doi:10.24963/ijcai.2025/686. [Google Scholar] [CrossRef]

9. Zhan Y, Yu J, Yu Z, Zhang R, Tao D, Tian Q. Comprehensive distance-preserving autoencoders for cross-modal retrieval. In: Proceedings of the 26th ACM International Conference on Multimedia; 2018 Oct 22–26; Seoul, Republic of Korea. p. 1137–45. doi:10.1145/3240508.3240607. [Google Scholar] [CrossRef]

10. Ghosal D, Majumder N, Gelbukh A, Mihalcea R, Poria S. COSMIC: commonsense knowledge for emotion identification in conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2020; 2020 Nov 16–20; Online. p. 2470–81. doi:10.18653/v1/2020.findings-emnlp.224. [Google Scholar] [CrossRef]

11. Shen W, Wu S, Yang Y, Quan X. Directed acyclic graph network for conversational emotion recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); 2021 Aug 1–6; Online. p. 1551–60. doi:10.18653/v1/2021.acl-long.123. [Google Scholar] [CrossRef]

12. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2017 Jul 30–Aug 4; Vancouver, BC, Canada. p. 873–83. doi:10.18653/v1/p17-1081. [Google Scholar] [CrossRef]

13. Chen F, Shao J, Zhu S, Shen HT. Multivariate, multi-frequency and multimodal: rethinking graph neural networks for emotion recognition in conversation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 10761–70. doi:10.1109/CVPR52729.2023.01036. [Google Scholar] [CrossRef]

14. Meng T, Shou Y, Ai W, Du J, Liu H, Li K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing. 2024;569(1):127109. doi:10.1016/j.neucom.2023.127109. [Google Scholar] [CrossRef]

15. Mao Y, Liu G, Wang X, Gao W, Li X. DialogueTRM: exploring multi-modal emotional dynamics in a conversation. In: Findings of the Association for Computational Linguistics: EMNLP 2021; 2021 Nov 7–11; Punta Cana, Dominican Republic. p. 2694–04. doi:10.18653/v1/2021.findings-emnlp.229. [Google Scholar] [CrossRef]

16. Hu D, Bao Y, Wei L, Zhou W, Hu S. Supervised adversarial contrastive learning for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023 Jul 9–14; Toronto, ON, Canada. p. 10835–52. doi:10.18653/v1/2023.acl-long.606. [Google Scholar] [CrossRef]

17. Fu Y, Yan X, Chen W, Zhang J. Feature-enhanced multimodal interaction model for emotion recognition in conversation. Knowl Based Syst. 2025;309:112876. doi:10.1016/j.knosys.2024.112876. [Google Scholar] [CrossRef]

18. Ai W, Shou Y, Meng T, Li K. DER-GCN: dialog and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition. IEEE Trans Neural Netw Learn Syst. 2025;36(3):4908–21. doi:10.1109/TNNLS.2024.3367940. [Google Scholar] [PubMed] [CrossRef]

19. Li B, Fei H, Liao L, Zhao Y. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia; 2023 Oct 29–Nov 3; Ottawa, ON, Canada. p. 5923–34. doi:10.1145/3581783.3612053. [Google Scholar] [CrossRef]

20. Zeng Y, Mai S, Hu H. Which is making the contribution: modulating unimodal and cross-modal dynamics for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: EMNLP 2021; 2021 Nov 7–11; Punta Cana, Dominican Republic. p. 1262–74. doi:10.18653/v1/2021.findings-emnlp.109. [Google Scholar] [CrossRef]

21. Yu F, Guo J, Wu Z, Dai X. Emotion-anchored contrastive learning framework for emotion recognition in conversation. In: Findings of the Association for Computational Linguistics: NAACL 2024; 2024 Jun 16–21; Mexico City, Mexico. p. 4521–34. doi:10.18653/v1/2024.findings-naacl.282. [Google Scholar] [CrossRef]

22. Li J, Wang X, Liu Y, Zeng Z. CFN-ESA: a cross-modal fusion network with emotion-shift awareness for dialogue emotion recognition. IEEE Trans Affect Comput. 2024;15(4):1919–33. doi:10.1109/TAFFC.2024.3389453. [Google Scholar] [CrossRef]

23. Bansal K, Agarwal H, Joshi A, Modi A. Shapes of emotions: multimodal emotion recognition in conversations via emotion shifts. In: Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models; 2022 Oct 12–17; Virtual. p. 44–56. [Google Scholar]

24. Tu G, Wang J, Li Z, Chen S, Liang B, Zeng X, et al. Multiple knowledge-enhanced interactive graph network for multimodal conversational emotion recognition. In: Findings of the Association for Computational Linguistics: EMNLP 2024; 2024 Nov 12–16; Miami, FL, USA. p. 3861–74. doi:10.18653/v1/2024.findings-emnlp.222. [Google Scholar] [CrossRef]

25. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692. 2019. [Google Scholar]

26. Eyben F, Weninger F, Gross F, Schuller B. Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia; 2013 Oct 21–25; Barcelona, Spain. p. 835–8. doi:10.1145/2502081.2502224. [Google Scholar] [CrossRef]

27. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 2261–9. doi:10.1109/CVPR.2017.243. [Google Scholar] [CrossRef]

28. Hu D, Hou X, Wei L, Jiang L, Mo Y. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2022 May 23–27; Singapore. p. 7037–41. doi:10.1109/ICASSP43922.2022.9747397. [Google Scholar] [CrossRef]

29. Gao T, Yao X, Chen D. SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021 Nov 7–11; Punta Cana, Dominican Republic. p. 6894–910. doi:10.18653/v1/2021.emnlp-main.552. [Google Scholar] [CrossRef]

30. Cipolla R, Gal Y, Kendall A. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 7482–91. doi:10.1109/CVPR.2018.00781. [Google Scholar] [CrossRef]

31. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R. MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019 Jul 28–Aug 2; Florence, Italy. p. 527–36. doi:10.18653/v1/p19-1050. [Google Scholar] [CrossRef]

32. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval. 2008;42(4):335–59. doi:10.1007/s10579-008-9076-6. [Google Scholar] [CrossRef]

33. Shi T, Huang SL. MultiEMO: an attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023 Jul 9–14; Toronto, ON, Canada. p. 14752–66. doi:10.18653/v1/2023.acl-long.824. [Google Scholar] [CrossRef]

34. Li J, Wang X, Lv G, Zeng Z. GA2MIF: graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Trans Affect Comput. 2024;15(1):130–43. doi:10.1109/TAFFC.2023.3261279. [Google Scholar] [CrossRef]

35. Lee J, Lee W. CoMPM: context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2022 Jul 10–15; Seattle, WA, USA. p. 5669–79. doi:10.18653/v1/2022.naacl-main.416. [Google Scholar] [CrossRef]


Cite This Article

APA Style
Chen, M., Wen, Q., Qiu, X., Wu, J., Fu, W. (2026). MDGET-MER: Multi-Level Dynamic Gating and Emotion Transfer for Multi-Modal Emotion Recognition. Computers, Materials & Continua, 86(3), 34. https://doi.org/10.32604/cmc.2025.071207
Vancouver Style
Chen M, Wen Q, Qiu X, Wu J, Fu W. MDGET-MER: Multi-Level Dynamic Gating and Emotion Transfer for Multi-Modal Emotion Recognition. Comput Mater Contin. 2026;86(3):34. https://doi.org/10.32604/cmc.2025.071207
IEEE Style
M. Chen, Q. Wen, X. Qiu, J. Wu, and W. Fu, “MDGET-MER: Multi-Level Dynamic Gating and Emotion Transfer for Multi-Modal Emotion Recognition,” Comput. Mater. Contin., vol. 86, no. 3, pp. 34, 2026. https://doi.org/10.32604/cmc.2025.071207


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 541

    View

  • 176

    Download

  • 0

    Like

Share Link