Improvement of Emotion Detection by Fusing Speech and Image Based on CNN with Temporal Models

Shing-Tai Pan^*, Yi-Zhen Huang, Zhi-Qing Chen
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
* Corresponding Author: Shing-Tai Pan. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081460

Received 08 March 2026; Accepted 21 May 2026; Published online 01 July 2026

Download PDF

Abstract

This paper proposes a multimodal fusion framework that integrates speech and visual features to enhance the accuracy of emotion recognition. The principal contribution lies in extending the visual component from single-image to multi-image emotion recognition. Specifically, the proposed framework employs an InceptionV3 Convolutional Neural Network (CNN)-based architecture to extract features from multiple facial images representing the speaker’s expressions throughout an utterance. These features are concatenated into a single vector and subsequently processed by Long Short-Term Memory (LSTM) or Hidden Markov Model (HMM) for temporal modeling. For the speech modality, Mel-Frequency Cepstral Coefficients (MFCC) or filter bank features are extracted from processed audio signals and fed into a hybrid CNN–time-series model. The two modalities are then integrated through model-level and decision-level fusion strategies. Since recognition accuracy tends to degrade as the number of utterances and speakers increases, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), which contains a moderate number of sentences and speakers, is adopted in this study. Experimental results demonstrate that the proposed multi-image approach improves recognition accuracy from 91% to 96% compared with the single-image baseline, and that the multimodal fusion framework consistently outperforms its single-modal counterpart.

Keywords

Speech emotion recognition; consecutive facial image emotion recognition; convolutional neural network (CNN); long short-term memory (LSTM); hidden markov model (HMM); support vector machine (SVM)

Downloads
- Full-Text PDF
Citation Tools
- BibTex
- EndNote
- RIS

194

View
42

Download
0

Like

Profiling Astronomical Objects Using Unsupervised Learning Approach
Theerapat Sangpetch, Tossapon...
Lightweight Multi-scale Convolutional Neural Network for Rice Leaf Disease Recognition
Chang Zhang, Ruiwen Ni, Ye Mu,...
Deep Learning and SVM-Based Approach for Indian Licence Plate Character Recognition
Nitin Sharma, Mohd Anul Haq, Pawan...
Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance
Sandeep Kumar, MohdAnul Haq, Arpit...
Calf Posture Recognition Using Convolutional Neural Network
Tan Chen Tung, Uswah Khairuddin,...

All issues

Online First

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Improvement of Emotion Detection by Fusing Speech and Image Based on CNN with Temporal Models

Abstract

Keywords

194

42

0

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link