Open Access
ARTICLE
Real-Time Emotion Recognition System Using Adaptive Distillation Technique
1 College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates
2 College of Computer Vision, Mohamed Bin Zayed University of AI, Abu Dhabi, United Arab Emirates
3 Interaction Technology Laboratory, Sejong University, Seoul, Republic of Korea
* Corresponding Author: Soonil Kwon. Email:
Computer Modeling in Engineering & Sciences 2026, 147(1), 34 https://doi.org/10.32604/cmes.2026.079697
Received 26 January 2026; Accepted 01 April 2026; Issue published 27 April 2026
Abstract
Knowledge distillation has shown impressive results in different fields, including detection, recognition, and generation. These models are excellent at tasks such as speech recognition, but they need to be shrunk down using adaptive knowledge distillation (AKD). The use of AKD can improve human-computer interactions and streamline data collection in the field of Speech Emotion Recognition (SER). This study presents a high-level approach that employs a novel adaptive knowledge distillation (AKD) with spatio-temporal transformers to acquire advanced semantic features from the input signal. This method uses an instance-by-instance correlation between the teacher and a student to determine the teacher’s importance. Additionally, this work proposes a knowledge-transfer strategy to integrate soft targets between teachers and students, aiming to provide deeper insight for the final prediction. Our light-weight model AKD is an efficient solution for edge devices and learns the synergistic information for respective tasks, as discussed in the results and analysis section. Our proposed model AKD outperforms the SOTA models of SER systems on the benchmark datasets, IEMOCAP, EmoDB, and RAVDESS, with an absolute gain of 4%–6% in overall recognition rate.Keywords
Emotion recognition involves detecting the intention, feelings, and attitude of the speakers and is an indispensable part of human communication. Speech emotion recognition (SER) has recently become a focus of research due to its applications in areas such as human-computer interaction, mental health diagnosis, customer service, clever call centers, and online learning [1]. More and more deep learning algorithms have been developed to address the SER problem to find various patterns and relationships in the speech data, including convolutional neural networks (CNN), recurrent neural networks (RNN) and long short-term memory (LSTM) [2].
Recently, the modern machine learning algorithms and the prevalence of smartphones [3], edge devices like smartphones and IoT (Internet of Things) devices can use built-in sensors like cameras, microphones, or heartbeat sensors to identify user emotions [4]. Subsequently, algorithms are trained to recognize and categorize emotions using facial expressions, voice patterns, and physiological responses, thus making them adaptable for improving human-computer interaction, mental health diagnosis, and tailored marketing, among others [5]. Besides the difficulty of collecting emotion data, emotions are also difficult to categorize, as they are multi-faceted phenomena subject to cultural and individual differences. Since different cultures and individuals express and perceive emotions differently, it can be difficult to use a universal standard [6] to categorize them.
By contrast, KD methods, that explicitly improve the model’s ability to capture higher-level semantic cues from speech for both target and non-target classes [7], do not consider non-target class information in that phase. For example, the KD loss functions from [8] enable smaller subnetworks to learn from large “teacher” networks while maintaining similar performance. In these approaches, the authors employed the KD method in teacher-student networks during training and did not utilize any other methods in the inference. Furthermore, the KD method has also been proposed as a performance enhancer in other domains such as computer vision [9], natural language processing [10], recommendation [11], and other such related domains.
Furthermore, action localization [12] and emotion and identity recognition [13] for videos have also been addressed using transformer-based architectures. To address the limitations of these methods, in this paper, we propose the state-of-the-art Adaptive Knowledge Distillation (AKD) framework (see Fig. 1), which learns from multi-level knowledge distillation across three types of teacher models, including the high-level, intermediate-level, and soft-target ones. The authors also make use of weighted soft targets and a group hint approach for transferring the information from the teacher’s last layers to the intermediate layers of the student. Implementing some of the principles of adaptive knowledge distillation, they propose a basic and effective framework with the goal to increase SER systems’ performance and robustness. This is done through the technique of distilling knowledge, which distills and adapts knowledge from teacher models to improve an SER system’s performance.

Figure 1: Overview of the designed model for emotion recognition with adaptive knowledge distillation using speech signal.
Inspired by the success of recursive attention architectures in other tasks [14], we include this strategy in our system to further improve the accuracy and generalization of the model. For improved adaptability, we adopt a spatiotemporal transformer in the student network and a hierarchical context-based transformer in the teacher models. This is accomplished through fused multi-head attention mechanisms, transferring knowledge via soft labels, aligning teacher-student logits, and enhancing feature discriminability. The complete framework and our methodologies promise a significant leap forward in the development of advanced and lightweight SER systems that can be easily deployed on edge devices.
The main contributions of the paper are summarized as follows:
• The authors propose a lightweight affective model for edge devices to endorse a novel knowledge distillation approach incorporating an adaptive learning strategy. In addition, employ instance-level teacher importance weights to facilitate the transfer of intermediate-level knowledge to students.
• The authors introduce a novel method for enhancing emotion recognition through adaptive knowledge distillation, leveraging ‘dark knowledge’ to mitigate misclassifications of emotions and enhance recognition rates beyond current state-of-the-art techniques (See Section 4). To the best of our knowledge, this is the first use of adaptive knowledge distillation in speech-emotion recognition for edge devices.
• Our model leverages adaptive knowledge from the teacher network to guide the student using a single-level output built upon a lightweight spatio-temporal transformer architecture. The distilled student model demonstrates remarkable performance across three benchmark datasets: IEMOCAP, EmoDB, and RAVDESS, achieving a 4%–6% improved recognition rate, respectively.
The rest of this paper unfolds as follows. Related work is covered in Section 2. The proposed AKD for SER methodology is detailed in Section 3. Experimentation and ablation studies are presented in Section 4. Finally, conclusions and avenues for future research are outlined in Section 5.
Foundation Models: Foundation models are expected to be a good initialization point for down-stream tasks. A new branch of research focuses on self-supervised learning-based adaptations of pre-trained huge models (pre-trained through unsupervised learning on large and unlabelled datasets). While these methods have shown strong results on the speech task [15–17], one potential answer, used by Wav2vec 2.0 [15], is to use product quantization for a specific function, along with a separate contrastive loss function, which is learned during pre-training to help the Transformer encoder identify the right quantized representation among the distractors.
Another successful model is HuBERT [16], which uses k-means clustering to group together representation vectors to form pseudo-class labels that can then be used in the training dataset. During pre-training, the model tries to predict the real class labels of both the masked tokens and the unmasked tokens. In this model, the features of one layer of the Transformer are extracted, and a second k-means clustering is used to refine the cluster. WavLM is based on the pseudo-labeling in pre-training from HuBERT, but it uses a broader pre-training dataset to improve generalization to new tasks [17]. In addition, the WavLM pre-training tasks also include speech denoising, where the model is trained to continue performing under the presence of noise and overlapping speech in the input. These tasks help to build better representations and more scalable models that can be transferred beyond automatic speech recognition tasks. This led to the emergence of foundation models as a new model in the space of speech processing models that achieved state-of-the-art performance.
Knowledge Distillation: Knowledge distillation (KD) is the task of training a smaller and more compact model called the student to mimic the function of a larger model called the teacher. The teacher is typically more powerful but also computationally expensive to evaluate. The teacher’s knowledge is transferred to the student well, especially when the output is a distribution with probabilities (for instance the probability of words in a language task). Since these distributions are usually trained using KL-divergence, L1 or L2 losses are often used to match the teacher’s internal representation or feature maps to the student’s.
Various methods have been proposed [18] to ensure efficient learning while transferring complete knowledge, always by looking at the one most likely word sequence (out of the complex word sequence). Since we only need to look at the most likely path, KD can be performed with a simplified loss function that is computationally faster with the trade-off of losing some information from the entire sequence. However, the complicated distribution sequence is not the only route for transferring knowledge and the authors of [19] did so differently. The authors use features extracted from a layer of the teacher model encoder to ease training. The latent features are of fixed size compared to the variable-length input audio, speeding up computation as they do not need to be continually computed from scratch. To further reduce the size of the information sent and avoid bottlenecks with the multicodebook vector quantization, the teacher model features are quantized from 32-bit floating-point representations to 8-bit integer values, which the student model tries to predict during training. This is likely to be more efficient, but may lead to a small drop in performance compared to plain L1 and L2 losses: reference [20] also uses a three-step approach where the non-streaming teacher model is not emitted until distillation, and uses a streaming model before distilling from the teacher.
This section describes the training framework used in the proposed Adaptive Knowledge Distillation (AKD) model. The system consists of a pretrained wav2vec 2.0 teacher network and a lightweight student network. During training, the student network learns from both the ground-truth emotion labels and the soft targets produced by the teacher model. The overall training objective combines the standard cross-entropy loss with the adaptive distillation loss. During training, the teacher model remains fixed while the student model parameters are optimized using the combined loss function. For each input segment, the teacher produces soft targets that guide the student network. The student model is trained using the Adam optimizer until convergence.
Let x denote an input speech segment and y is corresponding emotion label. The teacher network produces logits
where C denotes the number of emotion classes. This loss measures the discrepancy between the student model’s predicted emotion distribution and the ground-truth labels, and the knowledge distillation loss is defined as
where
where
The authors introduced a relatively new technique for transferring knowledge according to the Kullback-Leibler divergence theory in [8]. This divergence measures results from differences between two probability distributions over the same variable and is minimized for the teacher-student model. This technique has been tested on speech and image recognition tasks, which have proven effective. This concept incorporates the distillation of knowledge kd into the probability equation, along with the classification probability
where
Teacher: The method employs a pre-trained wav2vec-2.0 (Large) [15] model to encode the audio waveform using the feature encoder, capturing the low-level embedding features of the input waveform. The resulting embedding is normalized and activated with the Gelu function before being fed into the context network, comprising 12 transformer blocks with 12 attention heads each. Afterward, the soft label and weight for emotion classification prediction are determined from the feature vector obtained from the last layer of the context network.
Student: Our proposed student is based on the hybrid transformer [21], which incorporates skip connections between encoders and decoders. In the case of smaller datasets, the authors observed that employing knowledge adoption and distillation yields improved results. Following this insight, the authors constructed a student model (shown in Fig. 1) using an encoder-decoder architecture with a residual learning strategy, followed by convolutional layers with skip connections across layers [22]. In our method, the Transformer model takes the lead in analyzing speech and is encoded with a self-attention mechanism that analyzes these features, focusing on essential parts of the speech and considering long-range dependencies. This improves understanding compared to traditional methods. Finally, the system decodes the encoded features, enabling the model to consider all parts of the speech signal simultaneously and potentially improving its ability to capture complex relationships within the data, which is crucial for extracting discriminative features. The outputs undergo post-processing using connected components and employ spatial and temporal learning to target emotion effectively, capturing a comprehensive learning pattern from a micro perspective.
Limitations: The traditional transformer approach uses arbitrary convolutions for volumetric input data. However, these convolutions can only capture short-range spatial-temporal features, limiting their capacity to model broader global contextual dependencies beyond the designated receptive field. The spatial and temporal channels of the Transformers encode long-range dependencies by comparing feature activations throughout space and time. This mechanism transcends the limitations of conventional filters’ receptive fields. However, combining self-attention with convolutional layers proves advantageous for various tasks [23]. However, the authors are unaware of prior attempts to design spatio-temporal self-attention exclusively as an essential component for SER, as described in the literature.
3.4 Proposed Adaptive Knowledge Distillation
To capture the inherent qualities of teacher networks, the authors developed Adopter (as shown in Fig. 1) as a latent representation of teacher networks. This approach draws inspiration from latent factor models frequently employed in recommendation systems (as discussed in [24]). A latent factor represents their inherent characteristics in the model. Our approach extracts instance representations from the final layer of the student network’s output. This results in a value of the input, where the number of channels, height, and width of the student’s feature map ensure consistent alignment of input representations; the authors employ an essential operation to select the most significant value within each channel, as demonstrated below:
The method has a set of values represented by the variable
Eq. (6) defines a loss function for a model’s predictions, particularly useful in scenarios with multiple possible outputs at each step. Here,
The variable
Our Adaptive Knowledge Distillation (AKD) framework prioritizes robust learning, particularly when the student model learns from the teacher’s outputs. In order to prevent a student regression from being negatively impacted by the noise from their teacher’s predictions, the authors replace the original regression loss function (MSE, or Mean Squared Error) with the Huber loss (denoted as
In the core AKD process, the students are trained using the “soft targets”, i.e., the probabilities of the teacher model. The soft targets
Furthermore, as shown in Eq. (11), our knowledge transfer mechanism helps to close the divide not only at the final output. As
A critical challenge in KD is “negative transfer,” in which a student model inadvertently learns incorrect patterns from a faulty teacher. To mitigate this, introduce a mechanism to selectively allow knowledge transfer based on the teacher’s reliability for a given instance. This is governed by Eqs. (12) and (13):
Here,
The variable
• If the teacher’s error
• If the teacher’s error
Crucially, as stated in the original context: “If the value of
After incorporating this negative transfer mitigation module, the overall joint loss function, which guides the student’s training, can be expressed as (revising/clarifying the role of the original Eq. (10) context):
where:
•
•
•
•
Clarification on
• If
• If
This might be intended if
In this manner, the student model selectively learns from the teacher. It primarily imitates the teacher’s outputs (via
This article uses the IEMOCAP [26], EmoDB [27], and RAVDESS [28] corpora to ensure the proposed method’s robustness and efficiency for SER. The IEMOCAP corpus is recorded in American English by 10 professional speakers, covering four emotions. Similarly, EmoDB is a German-language recorded corpus by ten German speakers that covers seven emotions, and RAVDESS is a recorded corpus in British English by twenty-four speakers across twelve sessions, covering eight emotions. These are scripted corpora in which male and female actors deliver pre-designed scripts while portraying various emotions. More explanations and details are available in [26–28].
To assess the predictive capacity of our proposed model, utilize two metrics: weighted accuracy (WA) and unweighted accuracy (UA). UA represents the mean accuracy across emotional categories, while WA gauges the accuracy across all samples. These metrics are widely utilized in contemporary SER research to assess performance.
We adopted a true nested cross-validation protocol for evaluation. In the outer loop, the data were divided into



4.3.1 Leakage-Free Data Partitioning and Segmentation
To avoid data leakage, utterance-level split was performed prior to segmentation of the audio signal. Each audio recording was assigned a unique utterance id, which is used as an explicit grouping key for splitting between training and evaluation sets. These utterance IDs were then split into train, validation, and test sets, according to speaker grouping. Each segment that was generated retained the metadata from its original utterance, including the utterance ID, speaker ID, emotion label, partition label, and temporal boundaries of the original utterance. Furthermore, we preserved a one-to-many mapping from utterances to the segments derived from them, such that all segments derived from a single utterance were in the same partition, and no segment reassignments were performed after the initial segmentation. This preprocessing procedure is summarized in Algorithm 2.

4.3.2 Utterance-Level Inference and Test-Time Overlap Analysis
In order to cover more temporal space, at inference time each test utterance was processed using a set of overlapping windows with a sliding-window approach, though the individual windows were not considered as separate evaluation examples. Instead of directly assigning the window-level predictions to the utterances, the window-level predictions of the same utterance were combined into an utterance-level prediction. The reported weighted accuracy (WA) and unweighted accuracy (UA) were calculated at the utterance level, not the segment level. This is not intended to increase the evaluation bias. Instead, since the windows of the same utterance are highly overlapping and similar, they serve as multiple local views on the same test sample. Therefore, we averaged the posterior probabilities of all windows that belong to the same utterance and assigned the utterance label with the maximum averaged posterior from the batch. In order to study the effect of overlap further, we repeated the test time inference, keeping the training protocol, model parameters and utterance-level aggregation rule exactly the same, while varying the ratio of the overlapping segments. The results are shown in Table 1. These analyzes suggest that while overlap provides a small advantage to temporal stability, this effect is measured entirely in terms of utterances. The same utterance-level mean-posterior aggregation rule was used for all experiments and for all three datasets, namely IEMOCAP, EmoDB, and RAVDESS.
All results reported in Table 3 were obtained using the dataset-specific speaker-independent protocol described above and summarized in Table 2. As an optimization algorithm, Adam is used to train the model for 100 epochs with a batch size of 64 and a learning rate of
This section presents additional experiments to evaluate the effectiveness of our proposed model. Various architectures are tested, including solely deep learning, combinations of neural networks, encoders/decoders, knowledge distillation, and adaptive knowledge distillation, as shown in Table 4. According to the outcomes in Table 4, the architecture incorporating adaptive knowledge distillation achieved the highest accuracy.

The results indicate that even seemingly unrelated labels are valuable when using knowledge distillation. The knowledge distillation model serves as our initial benchmark, against which outcomes from experiments with diverse architectures are compared, as mentioned in Table 3. Our analysis highlights the importance of knowledge exchange among non-target classes for successful logit distillation. The effectiveness of logit distillation is determined mainly by strategic adaptation, which has been understudied. Our suggested architecture combination incorporates a knowledge distillation approach to enhance the effectiveness of distillation by fine-tuning coefficients.
The proposed model is compared with state-of-the-art techniques using the same dataset and evaluation metrics, as shown in Table 3. The output highlights the robustness of our method with innovative architecture in the SER domain. Our model outperformed the recent SER model and especially beat [39], knowledge distillation, and [38] multilayer attention mechanism with distillation for speech recognition, which achieved the best results recently. However, our model has a reasonable recognition rate and outperforms the recent baseline methods. These findings underscore the uniqueness and broad applicability of the features acquired through our proposed encoder-based adaptive knowledge distillation architecture. Furthermore, our model effectively captures salient information in emotion recognition tasks, as demonstrated by the confusion matrices in Figs. 2–4, which provide intuitive visualizations of its performance across all evaluated datasets.

Figure 2: Our AKD model: confusion among actual and predicted labels of the IEMOCAP dataset.

Figure 3: Our AKD model: confusion among actual and predicted labels of the EmoDB dataset.

Figure 4: Our AKD model: confusion among actual and predicted labels of the RAVDESS dataset.
4.6 Computational Analysis for Edge Devices
The study set out to create an AKD model tailored for resource-constrained devices, such as those on the edge. To make it lean and efficient without sacrificing accuracy, carefully applied several techniques: pruning away unnecessary parts, simplifying calculations through quantization, and using distillation to learn from larger models. This method used the ONNX standard to neatly package the model’s parameters and weights for smooth, real-time deployment, with the results of this optimization detailed in Table 5. Beyond that, it further boosted its speed by adapting it to lower-precision numbers (like fp16 floating-point), ensuring performance wasn’t compromised. Ultimately, aimed to build a quick model that doesn’t drain much power and still makes excellent predictions.

4.7 Limitations of the Proposed AKD Model
Our model can occasionally become perplexed and incorrectly label similar or closely related emotions, such as confusing Frustration with Anger or Happiness with Excitement. Furthermore, when dealing with highly imbalanced data, our model tends to misclassify emotions as the one with the most data samples.
The proposed system employed an adaptive knowledge distillation strategy utilizing spatio-temporal encoders/decoders in the student network, along with pre-trained Wav2Vec-2.0 (large) in the teacher networks, to enhance the model’s performance. Our distillation model for emotion recognition leverages knowledge of non-target classes to learn discriminative features. Through our experiments on the IEMOCAP, EmoDB, and RAVDESS corpora to achieve 84.45%, 97.07%, 97.06% weighted, and 83.34%, 96.04%, 95.50% Unweighted accuracy, respectively. According to the experimental results, the student model demonstrates a strong ability to recognize emotion from speech under moderate-level noisy conditions when guided by the teacher model.
Furthermore, our plan is to explore the use of knowledge distillation in SER to incorporate noise into audio data, making our model more robust for real-world scenarios. Future research could enhance the proposed system by optimizing it for real-time applications, exploring various fusion techniques, addressing privacy concerns, integrating additional modalities, and evaluating the model’s interpretability.
Acknowledgement: The authors express their appreciation and thanks to the SafeStream team for their contribution to the development of Next-Gen Multimodal AI for Improved Detection, Recognition, and Scene Analysis in UAV Applications. The authors would also like to express their gratitude to the AI-based tools used during this research to enhance it.
Funding Statement: This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2025S1A5C3A02009153).
Author Contributions: Conceptualization, Mustaqeem Khan and Ufaq Khan; methodology, Mustaqeem Khan; software, Mustaqeem Khan and Ufaq Khan; validation, Mustaqeem Khan and Guiyoung Son; formal analysis, Mamoun Awad, Nazar Zaki and Soonil Kwon; investigation, Mustaqeem Khan, Nazar Zaki and Soonil Kwon; writing—original draft preparation, Mustaqeem Khan and Ufaq Khan; writing—review and editing, Mamoun Awad, Nazar Zaki, Guiyoung Son and Soonil Kwon; visualization, Mustaqeem Khan, Guiyoung Son, Nazar Zaki and Soonil Kwon; supervision, Nazar Zaki and Soonil Kwon; project administration, Guiyoung Son and Soonil Kwon; funding acquisition, Guiyoung Son and Soonil Kwon. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The study utilized publicly available datasets that can be accessed through the following links: IEMOCAP (https://sail.usc.edu/iemocap/ or https://www.kaggle.com/datasets/samuelsamsudinng/iemocap-emotion-speech-database), EmoDB (https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb), and RAVDESS (https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio).
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Bachate MM, Suchitra S. Sentiment analysis and emotion recognition in social media: a comprehensive survey. Appl Soft Comput. 2025;174(3):112958. doi:10.1016/j.asoc.2025.112958. [Google Scholar] [CrossRef]
2. Anand RV, Md AQ, Sakthivel G, Padmavathy T, Mohan S, Damaševičius R. Acoustic feature-based emotion recognition and curing using ensemble learning and CNN. Appl Soft Comput. 2024;166(4):112151. doi:10.1016/j.asoc.2024.112151. [Google Scholar] [CrossRef]
3. Prabhakar GA, Basel B, Dutta A, Rao CVR. Multichannel CNN-BLSTM architecture for speech emotion recognition system by fusion of magnitude and phase spectral features using DCCA for consumer applications. IEEE Trans Consum Electron. 2023;69(2):226–35. doi:10.1109/tce.2023.3236972. [Google Scholar] [CrossRef]
4. Sharma A, Kumar A. DREAM: deep learning-based recognition of emotions from multiple affective modalities using consumer-grade body sensors and video cameras. IEEE Trans Consum Electron. 2024;70(1):1434–42. [Google Scholar]
5. Lak AJ, Boostani R, Alenizi FA, Mohammed AS, Fakhrahmad SM. RoBERTa, ResNeXt and BiLSTM with self-attention: the ultimate trio for customer sentiment analysis. Appl Soft Comput. 2024;164:112018. [Google Scholar]
6. Basak S, Agrawal H, Jena S, Gite S, Bachute M, Pradhan B, et al. Challenges and limitations in speech recognition technology: a critical review of speech signal processing algorithms, tools and systems. Comput Model Eng Sci. 2023;135(2):1053–89. doi:10.32604/cmes.2022.021755. [Google Scholar] [CrossRef]
7. Chauhan GS, Saxena A, Nahta R, Meena YK. Hierarchical attention for aspect extraction using LSTM in fine-grained sentiment analysis and evaluation. Appl Soft Comput. 2024;167:112408. doi:10.1016/j.asoc.2024.112408. [Google Scholar] [CrossRef]
8. Hinton GE, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv:1503.02531. 2015. [Google Scholar]
9. You S, Xu C, Xu C, Tao D. Learning with single-teacher multi-student. In: Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park, CA, USA: AAAI Press; 2018. Vol. 32, p. 4390–7. [Google Scholar]
10. Nakashole N, Flauger R. Knowledge distillation for bilingual dictionary induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: ACL; 2020. p. 2497–506. [Google Scholar]
11. Wang LH, Dai Q, Du T, Chen LF. Lightweight intrusion detection model based on CNN and knowledge distillation. Appl Soft Comput. 2024;165(1–2):112118. doi:10.1016/j.asoc.2024.112118. [Google Scholar] [CrossRef]
12. Wang P, Huang H, Zhao L, Zhu B, Huang H, Wu H. ExtRe: extended temporal-spatial network for consumer-electronic WiFi-based human activity recognition. IEEE Trans Consum Electron. 2025;71(1):230–8. doi:10.1109/tce.2024.3435881. [Google Scholar] [CrossRef]
13. Ji X, Dong Z, Han Y, Lai CS, Zhou G, Qi D. EMSN: an energy-efficient memristive sequencer network for human emotion classification in mental health monitoring. IEEE Trans Consum Electron. 2023;69(4):1005–16. [Google Scholar]
14. Andreas A, Mavromoustakis CX, Song H, Batalla JM. Optimisation of CNN through transferable online knowledge for stress and sentiment classification. IEEE Trans Consum Electron. 2024;70(1):3088–97. doi:10.1109/tce.2023.3319111. [Google Scholar] [CrossRef]
15. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst. 2020;33:12449–60. [Google Scholar]
16. Hsu WN, Bolte B, Tsai YHH, Lakhotia K, Salakhutdinov R, Mohamed A. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process. 2021;29:3451–60. doi:10.1109/taslp.2021.3122291. [Google Scholar] [CrossRef]
17. Chen S, Wang C, Chen Z, Wu Y, Liu S, Chen Z, et al. WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J Sel Top Signal Process. 2022;16(6):1505–18. [Google Scholar]
18. Yang X, Li Q, Woodland PC. Knowledge distillation for neural transducers from large self-supervised pre-trained models. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2022. p. 8527–31. [Google Scholar]
19. Guo L, Yang X, Wang Q, Kong Y, Yao Z, Cui F, et al. Predicting multi-codebook vector quantization indexes for knowledge distillation. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2023. p. 1–5. [Google Scholar]
20. Kurata G, Saon G. Knowledge distillation from offline to streaming RNN transducer for end-to-end speech recognition. In: Interspeech 2020—The 21st Annual Conference of the International Speech Communication Association; 2020 Oct 25–29; Shanghai, China. p. 2117–21. [Google Scholar]
21. Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, et al. Transformer-based acoustic modeling for hybrid speech recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2020. p. 6874–8. [Google Scholar]
22. Zhou S, Zhao Y, Xu S, Xu B, Li H. Multilingual recurrent neural networks with residual learning for low-resource speech recognition. In: INTERSPEECH 2017—The 18th Annual Conference of the International Speech Communication Association; 2017 Aug 20–24; Stockholm, Sweden. p. 704–8. [Google Scholar]
23. Al-Dujaili MJ, Ebrahimi-Moghadam A. Speech emotion recognition: a comprehensive survey. Wirel Pers Commun. 2023;129(4):2525–61. doi:10.1007/s11277-023-10244-3. [Google Scholar] [CrossRef]
24. Koren Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2008. p. 426–34. [Google Scholar]
25. Park W, Kim D, Lu Y, Cho M. Relational knowledge distillation. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE; 2020. p. 3967–76. [Google Scholar]
26. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval. 2008;42:335–59. [Google Scholar]
27. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier WF, Weiss B, Mertens J. A database of German emotional speech. In: INTERSPEECH 2005—Eurospeech, 9th European Conference on Speech Communication and Technology; 2005 Sep 4–8; Lisbon, Portugal. p. 1517–20. [Google Scholar]
28. Livingstone SR, Russo FA. The Ryerson audio-visual database of emotional speech and song (RAVDESSa dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One. 2018;13(5):e0196391. [Google Scholar] [PubMed]
29. Zhong Y, Hu Y, Huang H, Silamu W. A lightweight model based on separable convolution for speech emotion recognition. In: INTERSPEECH 2020—The 21st Annual Conference of the International Speech Communication Association; 2020 Oct 25–29; Shanghai, China. p. 3331–5. [Google Scholar]
30. Ye J, Wen XC, Wang XZ, Xu Y, Luo Y, Wu CL, et al. GM-TCNet: gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun. 2022;145:21–35. [Google Scholar]
31. Tuncer T, Dogan S, Acharya UR. Automated, accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl Based Syst. 2021;211:106547. doi:10.1016/j.knosys.2020.106547. [Google Scholar] [CrossRef]
32. Peng Z, Lu Y, Pan S, Liu Y. Efficient speech emotion recognition using multi-scale CNN and attention. In: ICASSP 2021—International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE; 2021. p. 3020–4. [Google Scholar]
33. Aftab A, Morsali A, Ghaemmaghami S, Lech M. LIGHT-SERNET: a lightweight fully convolutional neural network for speech emotion recognition. In: ICASSP 2022—International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE; 2022. p. 6912–6. [Google Scholar]
34. Wen XC, Ye J, Luo Y, Xu Y, Wang XZ, Wu CL, et al. CTL-MTNet: a novel CapsNet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22). Piscataway, NJ, USA: IEEE; 2022. p. 2305–11. [Google Scholar]
35. Li R, Wu Z, Jia J, Meng H. Dilated residual network with multi-head self-attention for speech emotion recognition. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019). Piscataway, NJ, USA: IEEE; 2019. p. 6675–9. [Google Scholar]
36. Bhangale K, Kothandaraman M. Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics. 2023;12(4):839. doi:10.3390/electronics12040839. [Google Scholar] [CrossRef]
37. Kakouros S, Stafylakis T, Mošner L, Burget L. Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2023. p. 1–5. [Google Scholar]
38. Bhangale KB, Kothandaraman M. Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network). Appl Acoust. 2023;212(2):109613. doi:10.1016/j.apacoust.2023.109613. [Google Scholar] [CrossRef]
39. Ye J, Wen XC, Wei Y, Xu Y, Liu K, Shan H. Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2023. p. 1–5. [Google Scholar]
40. Chen LW, Rudnicky A. Exploring Wav2vec 2.0 fine tuning for improved speech emotion recognition. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ, USA: IEEE; 2023. p. 1–5. [Google Scholar]
41. Mishra SP, Warule P, Deb S. Speech emotion recognition using MFCC-based entropy feature. Signal Image Video Process. 2024;18(1):153–61. doi:10.1007/s11760-023-02716-7. [Google Scholar] [CrossRef]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools