Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.080025
Special Issues
Table of Content

Open Access

ARTICLE

A Hybrid CNN–BiLSTM Framework for Speech Emotion Recognition with TimeGAN-Augmented Data and Contrastive Learning

Rashid Jahangir1,*, Muhammad Asif Nauman2, Oumaima Saidani3, Faisal Ramzan2
1 Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari, Pakistan
2 Riphah School of Computing & Innovation, Riphah International University, Lahore, Pakistan
3 Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
* Corresponding Author: Rashid Jahangir. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.080025

Received 02 February 2026; Accepted 15 April 2026; Published online 17 June 2026

Abstract

Speech Emotion Recognition (SER) is a critical component of affective computing with broad applications in human–computer interaction, mental health monitoring, and intelligent multimedia systems. However, SER remains challenging due to the emotional ambiguity, lack of labeled data, class imbalance, and speaker variability. This study presents an effective SER framework that integrates contrastive representation learning, optimized spectrogram-based data augmentation, and selective synthetic data generation by using TimeGAN to enhance emotion classification performance. Contrastive learning enables the model to better discriminate acoustically similar emotions while Optuna automatically tunes augmentation strategies such as noise injection, time shifting, and time-frequency masking. Unlike existing approaches that apply synthetic generation uniformly across all classes, the proposed method targets only confusing or under-represented emotion classes to preserve the inter-class separability. A CNN-BiLSTM architecture is used to extract spectral and temporal information of the speech. The framework is evaluated with benchmark SER datasets—EMO-DB and RAVDESS—under speaker independent protocols. Experimental results demonstrate improved accuracy, robustness, and generalization under limited and imbalanced data conditions, supported by confusion matrices, UMAP, and t-SNE visualizations.

Keywords

Speech emotion recognition; data augmentation; optuna; TimeGAN; synthetic data; contrastive learning
  • 168

    View

  • 33

    Download

  • 0

    Like

Share Link