Home / Journals / CMC / Online First / doi:10.32604/cmc.2026.081460
Special Issues
Table of Content

Open Access

ARTICLE

Improvement of Emotion Detection by Fusing Speech and Image Based on CNN with Temporal Models

Shing-Tai Pan*, Yi-Zhen Huang, Zhi-Qing Chen
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
* Corresponding Author: Shing-Tai Pan. Email: email
(This article belongs to the Special Issue: Deep Learning for Emotion Recognition)

Computers, Materials & Continua https://doi.org/10.32604/cmc.2026.081460

Received 08 March 2026; Accepted 21 May 2026; Published online 01 July 2026

Abstract

This paper proposes a multimodal fusion framework that integrates speech and visual features to enhance the accuracy of emotion recognition. The principal contribution lies in extending the visual component from single-image to multi-image emotion recognition. Specifically, the proposed framework employs an InceptionV3 Convolutional Neural Network (CNN)-based architecture to extract features from multiple facial images representing the speaker’s expressions throughout an utterance. These features are concatenated into a single vector and subsequently processed by Long Short-Term Memory (LSTM) or Hidden Markov Model (HMM) for temporal modeling. For the speech modality, Mel-Frequency Cepstral Coefficients (MFCC) or filter bank features are extracted from processed audio signals and fed into a hybrid CNN–time-series model. The two modalities are then integrated through model-level and decision-level fusion strategies. Since recognition accuracy tends to degrade as the number of utterances and speakers increases, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), which contains a moderate number of sentences and speakers, is adopted in this study. Experimental results demonstrate that the proposed multi-image approach improves recognition accuracy from 91% to 96% compared with the single-image baseline, and that the multimodal fusion framework consistently outperforms its single-modal counterpart.

Keywords

Speech emotion recognition; consecutive facial image emotion recognition; convolutional neural network (CNN); long short-term memory (LSTM); hidden markov model (HMM); support vector machine (SVM)
  • 9

    View

  • 2

    Download

  • 0

    Like

Share Link