Open Access


CNN-Based Voice Emotion Classification Model for Risk Detection

Hyun Yoo1, Ji-Won Baek2, Kyungyong Chung3,*
1 Contents Convergence Software Research Institute, Kyonggi University, Suwon-si, 16227, Korea
2 Department of Computer Science, Kyonggi University, Suwon-si, 16227, Korea
3 Division of AI Computer Science and Engineering, Kyonggi University, Suwon-si, 16227, Korea
* Corresponding Author: Kyungyong Chung. Email:

Intelligent Automation & Soft Computing 2021, 29(2), 319-334.

Received 25 February 2021; Accepted 06 April 2021; Issue published 16 June 2021


With the convergence and development of the Internet of things (IoT) and artificial intelligence, closed-circuit television, wearable devices, and artificial neural networks have been combined and applied to crime prevention and follow-up measures against crimes. However, these IoT devices have various limitations based on the physical environment and face the fundamental problem of privacy violations. In this study, voice data are collected and emotions are classified based on an acoustic sensor that is free of privacy violations and is not sensitive to changes in external environments, to overcome these limitations. For the classification of emotions in the voice, the data generated from an acoustic sensor are combined with the convolution neural network algorithm of an artificial neural network. Short-time Fourier transform and wavelet transform as frequency spectrum representation methods are used as preprocessing techniques for the analysis of a pattern of acoustic data. The preprocessed spectrum data are represented as a 2D image of the pattern of emotion felt through hearing, which is applied to the image classification learning model of an artificial neural network. The image classification learning model uses the ResNet. The artificial neural network internally uses various forms of gradient descent to compare the learning of each node and analyzes the pattern through a feature map. The classification model facilitates the classification of voice data into three emotion types: angry, fearful, and surprised. Thus, a system that can detect situations around sensors and predict danger can be established. Despite the different emotional intensities of the base data and sentence-based learning data, the established voice classification model demonstrated an accuracy of more than 77.2%. This model is applicable to various areas, including the prediction of crime situations and the management of work environments for emotional labor.


Convolutional neural networks; machine learning; deep learning; voice emotion; crime prediction; crime prevention; IoT

Cite This Article

H. Yoo, J. Baek and K. Chung, "Cnn-based voice emotion classification model for risk detection," Intelligent Automation & Soft Computing, vol. 29, no.2, pp. 319–334, 2021.

This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1419


  • 954


  • 0


Share Link

WeChat scan