A Hybrid CNN–BiLSTM Framework for Speech Emotion Recognition with TimeGAN-Augmented Data and Contrastive Learning

Rashid Jahangir^1,*, Muhammad Asif Nauman², Oumaima Saidani³, Faisal Ramzan²
1 Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari, Pakistan
2 Riphah School of Computing & Innovation, Riphah International University, Lahore, Pakistan
3 Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
* Corresponding Author: Rashid Jahangir. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.080025

Received 02 February 2026; Accepted 15 April 2026; Published online 17 June 2026

Download PDF

Abstract

Speech Emotion Recognition (SER) is a critical component of affective computing with broad applications in human–computer interaction, mental health monitoring, and intelligent multimedia systems. However, SER remains challenging due to the emotional ambiguity, lack of labeled data, class imbalance, and speaker variability. This study presents an effective SER framework that integrates contrastive representation learning, optimized spectrogram-based data augmentation, and selective synthetic data generation by using TimeGAN to enhance emotion classification performance. Contrastive learning enables the model to better discriminate acoustically similar emotions while Optuna automatically tunes augmentation strategies such as noise injection, time shifting, and time-frequency masking. Unlike existing approaches that apply synthetic generation uniformly across all classes, the proposed method targets only confusing or under-represented emotion classes to preserve the inter-class separability. A CNN-BiLSTM architecture is used to extract spectral and temporal information of the speech. The framework is evaluated with benchmark SER datasets—EMO-DB and RAVDESS—under speaker independent protocols. Experimental results demonstrate improved accuracy, robustness, and generalization under limited and imbalanced data conditions, supported by confusion matrices, UMAP, and t-SNE visualizations.

Keywords

Speech emotion recognition; data augmentation; optuna; TimeGAN; synthetic data; contrastive learning

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

299

View
59

Download
0

Like

Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance
Sandeep Kumar, MohdAnul Haq, Arpit...
Crops Leaf Diseases Recognition: A Framework of Optimum Deep Learning Features
Shafaq Abbas, Muhammad Attique...
Image-Based Automatic Energy Meter Reading Using Deep Learning
Muhammad Imran, Hafeez Anwar,...
The Efficacy of Deep Learning-Based Mixed Model for Speech Emotion Recognition
Mohammad Amaz Uddin, Mohammad...
Deep Learning-based Environmental Sound Classification Using Feature Fusion and Data Enhancement
Rashid Jahangir, Muhammad Asif...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

A Hybrid CNN–BiLSTM Framework for Speech Emotion Recognition with TimeGAN-Augmented Data and Contrastive Learning

Abstract

Keywords

299

59

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link