For human-machine communication to be as effective as human-to-human communication, research on speech emotion recognition is essential. Among the models and the classifiers used to recognize emotions, neural networks appear to be promising due to the network’s ability to learn and the diversity in configuration. Following the convolutional neural network, a capsule neural network (CapsNet) with inputs and outputs that are not scalar quantities but vectors allows the network to determine the part-whole relationships that are specific 6 for an object. This paper performs speech emotion recognition based on CapsNet. The corpora for speech emotion recognition have been augmented by adding white noise and changing voices. The feature parameters of the recognition system input are mel spectrum images along with the characteristics of the sound source, vocal tract and prosody. For the German emotional corpus EMO-DB, the average accuracy score for 4 emotions, neutral, boredom, anger and happiness, is 99.69%. For Vietnamese emotional corpus BKEmo, this score is 94.23% for 4 emotions, neutral, sadness, anger and happiness. The accuracy score is highest when combining all the above feature parameters, and this score increases significantly when combining mel spectrum images with the features directly related to the fundamental frequency.
Today, robots are present in many places and different areas where people gather. In mass production lines, robots ensure uniformity of manufactured products. It can be said that robots think no less than humans when playing chess. However, for robots, the ability to express emotions through body language, facial expressions, and especially through voice is also limited. Emotional expression is a very subtle human behavior, but at present, robots are very inferior. To achieve the goal of human-machine interaction similar to human-human interaction, there is clearly considerable research needed. Emotional recognition of speech is a research aspect that needs to be considered to achieve that goal.
Neural networks, in addition to being designed to simulate the activity of human neurons, are even more special because of their ability to mimic human perception of the world around them. To identify objects through the visual system, people can quickly and accurately capture information through the part-whole relationship of the object. Therefore, for humans, it is not difficult to exactly identify one object in different poses. The part-whole relationship exists in different objects and is also a specific feature for those objects. The capsule neural network (CapsNet) was proposed to focus on exploiting this feature [
From
The rest of the paper is organized as follows. Section 2 describes related work. The emotional corpora used in this paper, and data augmentation are presented in Section 3. Section 4 details the configuration of the capsule neural network for emotion recognition. The experimental results are provided in Section 5. Finally, Section 6 presents the discussion and conclusion.
If only considering the field of research, such as speech processing, there have been some CapsNet-based studies. In [
The literature [
For Vietnamese emotion recognition, SVM was used in [
In this paper, the EMO-DB corpus was also used to perform emotion recognition using CapsNet. Since its appearance, there have been many emotional speech recognition studies using this corpus. In the context of this paper, we review most of the research conducted in the last more 10 years using EMO-DB (
Overall architecture of our proposal method is shown in
It is well known that for the classification problem or in machine learning in general, the more available data, the better the classification performance. Therefore, if data are insufficient, data augmentation is necessary. In addition, data enhancement is one method to avoid overfitting. Ocquaye et al. [
The example illustrates that after adding white noise, the average signal-to-noise ratio decreases by 11.95 dB. The signal-to-noise ratio is calculated using the
where
That means on average
The mel spectrum image is extracted from sound file with the fixed size 260
Beside 260 mel spectral coefficients such as baseline, we added 8 parameters related to fundamental frequency Derivation of Normalization of Normalization of Normalization of Normalization of Normalization of Normalization of
We also added 28 other parameters related to the vocal tract, spectral characteristics and voice quality, so there are 296 parameters in total. These 28 parameters are listed in
Parameters | Number of parameters |
---|---|
Intensity | 1 |
Formants and its correspondent bandwidths | 8 |
Harmonicity | 1 |
Center of gravity | 1 |
Central moment | 1 |
Skewness | 1 |
Kurtosis | 1 |
LPC coefficients | 14 |
For the basic discrete-time model for speech production, the vocal tract’s transfer function is
Other parameters, such as harmonicity, center of gravity, central moment, skewness, and kurtosis, were explained based on Praat and are presented in [
A capsule is a group of neurons in which the inputs and outputs of the capsule are vectors [
Before entering capsule
Capsule
The output vector
Vector output
where
The basis of dynamic routing algorithms proposed in [
CapsNet computes the margin loss for class
where Tk = 1 if an entity of class
The neural network used to recognize the four emotions in this paper consists of two parts: the first part is 5 CNN layers, and the second part is a capsule neural network. Take the configuration example of the neural network for the case of 296 parameters Layer 1: Convolution 2D, input (296, 296,1), output (148, 148, 64), kernel (3 Layer 2: Convolution 2D, input (148, 148, 64), output (74, 74, 16), kernel (2 Layer 3: Convolution 2D, input (74, 74, 16), output (37, 37, 16), kernel (2 Layer 4: Convolution 2D, input (37, 37, 16), output (19, 19, 16), kernel (2 Layer 5: Convolution 2D, input (19, 19, 16), output (10, 10, 16), kernel (2
The output of layer 5 is the input of the primary capsule in
The above configuration does not change for the remaining 2 cases (260 and 268 feature parameters). Of course, the number of corresponding parameters to be calculated varies depending on 260 and 268 feature parameters.
Emotion hypothesis is determined as following:
EMO-DB is a German emotional corpus [
Using data augmentation, we obtained 2,148 files from 358 files. Of the 2,148 files, the subset has 195 files used for testing. This subset of 195 files includes 42 files for neutral, 42 files for boredom, 41 files for happiness, and 70 files for anger. After taking 195 files for testing, the remaining files are split into 10 subsets (these 10 subsets have slightly different file numbers for each subset) for 10-fold cross-validation.
From the BKEmo corpus, the authors of the paper listened and selected 5,584 files for 4 emotions with 22 sentences, 8 male and 8 female voices, and these 5,584 files were used for emotion recognition. More details about the corpus can be found in [
Dataset | Neural | Sadness | Anger | Happiness | Total |
---|---|---|---|---|---|
EMO-DB train | 432 | 444 | 692 | 385 | 1953 |
EMO-DB validation | 42 | 42 | 70 | 41 | 195 |
EMO-DB test | 42 | 42 | 70 | 41 | 195 |
BKEmo train | 4572 | 4568 | 4568 | 4568 | 18277 |
BKEmo validation | 508 | 508 | 508 | 508 | 2031 |
BKEmo test | 126 | 127 | 127 | 127 | 507 |
The experiments were performed on a machine with the configuration as following: CPU: an Intel Core i7-7700 CPU @ 3.60 GHz RAM: 32 GB GPU: GeForce GTX 1080 Ti/PCIe/SSE2 with 11 GB of RAM Hard-disk: SSD 512 GB
For EMO-DB, the average training time for one fold is approximately 3.3 min, while for BKEmo, this time is approximately 30 min. The accuracy score (%) for each emotion for EMO-DB and BKEmo are given in
Dataset | Parameter type | Neural | Boredom/Sadness | Anger | Happiness | Average |
---|---|---|---|---|---|---|
EMO-DB | MELSPEC | 97.62 | 100 | 99.57 | 98.29 | 98.87 |
MELSPEC_F0 | 98.81 | 100 | 99.71 | 99.02 | 99.39 | |
MELSPEC_F0_OTHER | 99.76 | 100 | 99.86 | 99.13 | 99.69 | |
BKEmo | MELSPEC | 97.17 | 95.00 | 96.95 | 78.57 | 91.92 |
MELSPEC_F0 | 97.01 | 97.54 | 94.92 | 86.35 | 93.96 | |
MELSPEC_F0_OTHER | 96.77 | 96.19 | 95.94 | 88.02 | 94.23 |
At first, for both corpora EMO-DB and BKEmo, the average accuracy score increased when the number of parameters increased from 260 to 268 and 296, respectively. So beside mel spectrum, the parameters related to
If only comparing the accuracy scores for the EMO-DB corpus of the studies listed in
References | Year | Model, classifier | Parameters | Accuracy score (%) |
---|---|---|---|---|
Mishra et al. [ |
2009 | GMM | MFCCs and energy | 63.78 |
Luengo et al. [ |
2010 | k-means clustering | Prosody, voice quality, spectral and segmental features | 78.60 |
Amarakeerthi et al. [ |
2011 | HMM | TLCSF-CC features |
72.85 |
Shen et al. [ |
2011 | SVM | Energy, pitchLPCC, MFCC, linear prediction coefficients and mel cepstrum coefficients (LPCMCC) | 82.50 |
Stuhlsatz et al. [ |
2011 | Generalized discriminant analysis (GerDA) based on DNN | Zero crossing rate, signal energy logarithmic pitch, voice quality spectral, mel spectrum, cepstral | 85.10 |
Pan et al. [ |
2012 | SVM | MFCC and mel-energy spectrum dynamic coefficients (MEDC) + energy | 95.10 |
Jin et al. [ |
2014 | SVM | Intense, loudness, MFCC, LSP (line spectral pairs), ZCR (zero crossing rate), |
83.10 |
Gjoreski et al. [ |
2014 | SVM | 400 features extracted by OpenSmile | 87.00 |
Revathi et al. [ |
2018 | VQ/Fuzzy/MHMM/SVM | Gamma tone filters spaced in equivalent rectangular bandwidth (ERB), MEL and BARK scale | 99.80 |
Ocquaye et al. [ |
2019 | Dual exclusive attentive transfer (DEAT) convolutional neural network | Spectrogram | 67.79 |
Mao et al. [ |
2019 | SGMM-HMM (SGMM-Subspace based GMM) | 15-dimensional MFCCs with the first- and second-order derivatives + pitch + voicing probability | 88.15 |
Seo et al. [ |
2020 | VACNN (Visual attention convolutional neural network) | Log-mel spectrogram | 86.92 |
Lech et al. [ |
2020 | AlexNet (real-time SER) | Spectrograms converted into RGB | 82.00 |
Haider et al. [ |
2021 | SVM | eGeMAPs (a total of 88 features) | 76.90 |
Chauhan et al. [ |
2021 | CNN | Log-mel spectrograms | 72.02 |
MELSPEC (ours) | 2021 | Capsule network | Melspectrogram | 97.62 |
MELSPEC_F0 (ours) | 2021 | Capsule network | Melspectrogram, |
98.81 |
MELSPEC_F0_OTHER (ours) | 2021 | Capsule network | Melspectrogram, |
99.76 |
Vietnamese is a tonal language. There are 6 tones of Vietnamese. For Vietnamese, changing the tone of a syllable changes the meaning of the syllable. The variable rule of fundamental frequency
In [
In summary, our experiments of emotion recognition using a capsule neural network with parameters related to mel spectrum,