Emotion Recognition with Capsule Neural Network

For human-machine communication to be as effective as human-tohuman communication, research on speech emotion recognition is essential. Among the models and the classifiers used to recognize emotions, neural networks appear to be promising due to the network’s ability to learn and the diversity in configuration. Following the convolutional neural network, a capsule neural network (CapsNet) with inputs and outputs that are not scalar quantities but vectors allows the network to determine the part-whole relationships that are specific 6 for an object. This paper performs speech emotion recognition based on CapsNet. The corpora for speech emotion recognition have been augmented by adding white noise and changing voices. The feature parameters of the recognition system input are mel spectrum images along with the characteristics of the sound source, vocal tract and prosody. For the German emotional corpus EMODB, the average accuracy score for 4 emotions, neutral, boredom, anger and happiness, is 99.69%. For Vietnamese emotional corpus BKEmo, this score is 94.23% for 4 emotions, neutral, sadness, anger and happiness. The accuracy score is highest when combining all the above feature parameters, and this score increases significantly when combining mel spectrum images with the features directly related to the fundamental frequency.


Introduction
Today, robots are present in many places and different areas where people gather. In mass production lines, robots ensure uniformity of manufactured products. It can be said that robots think no less than humans when playing chess. However, for robots, the ability to express emotions through body language, facial expressions, and especially through voice is also limited. Emotional expression is a very subtle human behavior, but at present, robots are very inferior. To achieve the goal of human-machine interaction similar to human-human interaction, there is clearly considerable research needed. Emotional recognition of speech is a research aspect that needs to be considered to achieve that goal.
Neural networks, in addition to being designed to simulate the activity of human neurons, are even more special because of their ability to mimic human perception of the world around them. To identify objects through the visual system, people can quickly and accurately capture information through the part-whole relationship of the object. Therefore, for humans, it is not difficult to exactly identify one object in different poses. The part-whole relationship exists in different objects and is also a specific feature for those objects. The capsule neural network (CapsNet) was proposed to focus on exploiting this feature [1][2][3]. When explaining the capsule neural network exploiting this feature, some authors often take the example of human face descriptions. The relative position of the eyes, nose and mouth on the human face is that the eyes and nose are positioned above the mouth, and the nose and mouth are arranged on the vertical symmetrical axis of the face. This is one of the characteristics of the spatial relationships of objects to correctly identify the human face, and these relationships are equivariant. This paper is based on a capsule neural network used to identify images to perform emotion recognition of speech. The input of the recognition system is mel spectrum. A part-whole relationship to the mel spectrum of the speech signal can be taken as an example for the case of the syllable, including a fricative consonant [s] preceding a voiced sound [ε], for example. The mel spectrum of syllable [sε] is given in Fig. 1.
From Fig. 1, the characteristic of the mel spectrum for a fricative sound is that spectral energy is concentrated mainly in the high-frequency domain (the upper part of the mel spectrum), whereas for voiced sounds, the spectral energy is focused more on low-order formants (the lower part of the mel spectrum). Thus, it can be said that in the mel spectrum, the spatial relationship of the energy concentration sections between the fricative sound and the voiced sound is the upper left-the lower right. This also means that the part-whole relationship that CapsNet exploited also exists in the mel spectrum of the speech signal.
The rest of the paper is organized as follows. Section 2 describes related work. The emotional corpora used in this paper, and data augmentation are presented in Section 3. Section 4 details the configuration of the capsule neural network for emotion recognition. The experimental results are provided in Section 5. Finally, Section 6 presents the discussion and conclusion.

Related Work
If only considering the field of research, such as speech processing, there have been some CapsNetbased studies. In [4], the capsule network was applied to capture the spatial relationship and pose information of speech spectrogram features in both frequency and time axes. The authors showed that the end-to-end speech recognition system with capsule networks on one-second speech commands dataset achieves better results on both clean and noise-added tests than baseline convolutional neural network models. A capsule network for low resource spoken language understanding was proposed for commandand-control applications in [5]. For small quantities of data, the proposed model is shown to significantly outperform the previous state-of-the-art model.
The literature [6][7][8][9] provided an overview of speech emotion recognition, including models, used classifiers and corpus, specific parameters and corresponding recognition accuracy. The different classifiers may be Gaussian mixture models (GMM), support vector machines (SVM), artificial neural networks (ANN), k-nearest neighbor classifier, Bayes classifier, linear discriminant analysis with Gaussian probability distribution, and hidden Markov models (HMM). The feature parameters can be classified into 3 groups. The first group includes parameters directly related to the sound source. The second group is the parameters of the vocal tract, while the third group is related to the prosody. For the sound source, the feature parameters may be LP (linear prediction) residual energy or LP residual, glottal excitation signal. The parameters of the vocal tract include MFCC (mel frequency cepstral coefficients), LPCC (linear predictive cepstral coefficients), and RASTA (Relative Spectra) PLP (perceptual linear predictive) coefficients, formants and their bandwidth and spectral features. The prosodic parameters consist of pitch, energy, duration and voice quality features. In addition to these parameters, statistical features such as mean, StdDev, min, and max have been used. Until recently, GMM, HMM, and SVM were still used to recognize emotional speech [10][11][12][13][14][15][16][17] or GMM and DNN, and GMM and SVM have been combined [18][19][20]. Sequential minimal optimization (SMO), J48, and random forest have been used for testing the adaptive data boosting (ADB) technique [21]. For ANN, it can be seen that the models used are ANN with 3 layers [22], DNN [23], progressive neural network [24], recurrent neural network [12,25], backpropagation neural network [26], deep convolutional recurrent network [27], coupled deep convolutional neural network (CDCNN) [28], deep belief networks (DBNs) [29], combination of SVM and belief networks [30], CNN [31], convolutional recurrent neural network (CRNN) [32], deep learning [33,34], a combination of convolutional and recurrent layers for reusing ASR (automatic speech recognition) network [35] and LSTM (long short-term memory) network [36]. A number of issues have also been raised for emotion recognition of speech, such as transfer learning [37][38][39], using cross-corpus [27,40], and adversarial training [41].
For Vietnamese emotion recognition, SVM was used in [42] to classify emotions using the EEG signal. An average accuracy of 70.5% was achieved in real-time for five emotional states. Research in [43] used the GMM model to recognize 6 emotions: happiness, neutrality, sadness, surprise, anger, and fear. In this research, two male voices and two female voices expressed 6 emotions for 6 different sentences. The feature parameters were MFCC, short-term energy, pitch, and formants. The highest recognition score was 96.5% for neutral emotion, and the lowest was 76.5% for sad emotion. In [44], the corpus included 6 voices and 20 sentences and the same number of emotions as in [43]. The recognition score on the Vietnamese language was 96.5% for neutrality and dropped to 84.1% for surprise using SVM with Im-SFLA (improved shuffled frog leaping algorithm). The authors in [45,46] used GMM and CNN to recognize 4 emotions with Vietnamese emotional corpus BKEmo. Details of this study and the corpus BKEmo are briefly presented in the following sections.
In this paper, the EMO-DB corpus was also used to perform emotion recognition using CapsNet. Since its appearance, there have been many emotional speech recognition studies using this corpus. In the context of this paper, we review most of the research conducted in the last more 10 years using EMO-DB (Tab. 4). From the review, most studies with EMO-DB also use models, classifiers and feature parameters, as mentioned above. In addition, EMO-DB is also used in cross-corpus and transfer learning studies [47].

Proposal Method
Overall architecture of our proposal method is shown in Fig. 2 which has composed: data augmentation module, parameter extraction module, CapsNet module. These modules are presented as following.

Data Augmentation
It is well known that for the classification problem or in machine learning in general, the more available data, the better the classification performance. Therefore, if data are insufficient, data augmentation is necessary. In addition, data enhancement is one method to avoid overfitting. Ocquaye et al. [48] implemented data augmentation for EMO-DB by adding background noise as proposed in [49]. For data augmentation in our case, adding white noise and changing voices were made for BKEmo and specifically for EMO-DB because the existing EMO-DB is not a corpus of sufficiently large size.  The example illustrates that after adding white noise, the average signal-to-noise ratio decreases by 11.95 dB. The signal-to-noise ratio is calculated using the Eq. (1):

Adding White Noise
where P S is the power of signal and P N is the power of noise. Since the signal power before adding noise remains the same as the signal power after adding noise, we have 10log 10 P N after adding noise ð Þ P N before adding noise ð Þ ¼ 11:95 dB: That means on average P N after adding noise ð Þ % 15:67 Â P N before adding noise ð Þ : We used Praat toolkit [50] for this process. Fig. 4 illustrates changing the voices for the augmented corpus. If it is a male voice, the formant is raised towards the high frequency so that the male voice is closer to a female voice. If it is a female voice, the formant is lowered towards the low frequency to be closer to a male voice. Translation of formant is performed with Praat toolkits [50]. For the formant lifting case, the lift coefficient used in Praat is 1.1, while for the formant reduction, the reduction factor used in Praat is 0.909.

Parameter Extraction
The mel spectrum image is extracted from sound file with the fixed size 260 Â 260 corresponding to 260 mel spectral coefficients Â 260 frames. The number of frames is taken by 260 because this is the average number of frames for WAV files in both EMO-DB and BKEmo corpus. This parameter set is named MELSPEC.
Beside 260 mel spectral coefficients such as baseline, we added 8 parameters related to fundamental frequency F 0 . This parameter set is named MELSPEC_F0. These 8 parameters include: F 0 + 7 F 0 variants: Derivation of F 0 . Normalization of F 0 according to the average value of F 0 for each file. Normalization of F 0 according to minF 0 and maxF 0 for each file. Normalization of F 0 by the mean and the standard deviation of F 0 for each file. We also added 28 other parameters related to the vocal tract, spectral characteristics and voice quality, so there are 296 parameters in total. These 28 parameters are listed in Tab. 1. This parameter set is named MELSPEC_F0_OTHER.  For the basic discrete-time model for speech production, the vocal tract's transfer function is H z ð Þ. H z ð Þ is a pth-order all-pole rational function of the form [51] by Eq. (2) as following: A z ð Þ is an inverse filter for the vocal tract system and is often called the prediction error polynomial or the linear predictive coding (LPC) polynomial. fa k ; k ¼ 1; 2; . . . ; pg are vocal tract system parameters, and in this paper, they are LPC coefficients.
Other parameters, such as harmonicity, center of gravity, central moment, skewness, and kurtosis, were explained based on Praat and are presented in [45]. ANOVA and T-test were used in [52] to evaluate the corpus BKEmo. The P-value = 0.05 is used in the majority of cases [53], and this value is also used as the cutoff for significance in our case. If the P-value is less than 0.05, a significant difference for a pair of emotions does exist. The T-test results from [52] showed that the emotion pairs of BKEmo are best distinguished for most of the above feature parameters. All feature parameters are calculated using Praat toolkits [50].

Capsule Neural Network
A capsule is a group of neurons in which the inputs and outputs of the capsule are vectors [1]. To illustrate the basic activity of the capsule neural network to be used in the paper, we take an example of a capsule neural network consisting of M capsules in the higher level and N capsules in the lower level, as denoted in Fig. 5. Capsule 1 in the higher level has an output vectorṽ 1 . This vector encodes existence and pose of object 1, for example. Capsule 1 has N input vectors corresponding to N outputs of N capsules 1,…, i,…, N in the lower level. Assume that the output vectors corresponding to N capsules in the lower level arẽ u 1 ; . . . ;ũ i ; . . . ;ũ N . Assume that the output vectorũ i of the capsule i in the lower level encodes existence and pose of part i and part i belongs to the object j described by capsule j in the higher level. Before entering capsule j at the higher level, the output vectorũ i is multiplied by the weight matrix W ij that encodes the spatial relationship between part i and object j and then becomes the vectorû ! jji ¼ W ijũi . The vectorũ jji is multiplied by scalar weight c ij to become c ijũjji before actually entering capsule j. Scalar weight c ij is determined by a routing algorithm. Similarly, capsule 1 and capsule N at the lower level provide vectors entering capsule j, respectively, c 1jũjj1 and c NjũjjN , whereû ! jj1 ¼ W 1jũ1 andû ! jjN ¼ W NjũN . Capsule j in the higher level performs the sum: The output vectors j of capsule j is passed through the squash function: Vector outputṽ j of capsule j encodes the existence and pose of object j. The coupling coefficients (routing coefficients) c ij between capsule i and all the capsules in the layer above sum to 1. These coefficients are determined by a "routing softmax": where b ij are the log prior probabilities that capsule i should be coupled to capsule j.
The basis of dynamic routing algorithms proposed in [2] is that the capsule in the lower level sends its input to a higher-level capsule that agrees with that input. The result of the algorithm with a certain number of routing iterations (usually equal to 3) gives a set of routing coefficients that best match the output from the capsule in the lower level with the output of the capsules at a higher level.
CapsNet computes the margin loss for class k as following: where T k = 1 if an entity of class k is present and m þ ¼ 0:9 and m À ¼ 0:1. The weight λ = 0.5.
The output of layer 5 is the input of the primary capsule in Fig. 6. The configuration of the CapsNet is basically inspired by the CapsNet configuration proposed by [2], but the parameters have been changed to suit our case. Fig. 6 is an illustration of the primary and secondary capsules (capsule layer). In nature, the primary capsule layer is similar to the convolutional layer. This layer reduces the spatial dimension from 10 × 10 to 5 × 5 by using kernel 9 × 9 with stride 2 and no padding. The primary capsule layer uses 8 × 32 kernels to generate 32 8-D capsules, i.e., 8 output neurons are grouped together to form a capsule. The output of the primary capsule layer is reshaped to (800 (=5 × 5 × 32), 8). Next, the capsule layer applies a transformation matrix W ij with shape 8 × 16 to convert the 8-D capsule from the output of the primary capsule layer to a 16-D capsule for one of four emotions. W ij is a weight matrix between each u i ; i 2 1; 32 Â 5 Â 5 ð Þin primary capsules andṽ j ; j 2 1; 4 ð Þ: Dynamic routing is performed between the primary capsules and the capsule layer [2].
The above configuration does not change for the remaining 2 cases (260 and 268 feature parameters). Of course, the number of corresponding parameters to be calculated varies depending on 260 and 268 feature parameters.
Emotion hypothesis is determined as following: 4 Results and Discussions

EMO-DB and BKEmo Datasets
EMO-DB is a German emotional corpus [54]. The corpus was built using a simulation method with 10 professional artists (5 male artists and 5 female artists) and includes 7 emotions: neutral, anger, fear, happiness, sadness, disgust and boredom. There are 10 sentences for the artists to express different emotions. Each emotion is expressed 1 to 6 times. Along with EMO-DB, a Vietnamese emotional corpus BKEmo is also used in this paper. The Vietnamese emotional corpus used for recognition is extracted from the BKEmo corpus developed at Hanoi University of Science and Technology. BKEmo is built according to the simulation method for four emotions: neutral, sadness, anger and happiness. EMO-DB's emotions, neutral, boredom, anger and happiness, are chosen because these are emotions with the largest number of files. The number of emotions in BK-Emo is also equal to four, and thus, the architecture of the emotion recognition system remains the same for both languages. The total number of files for these Using data augmentation, we obtained 2,148 files from 358 files. Of the 2,148 files, the subset has 195 files used for testing. This subset of 195 files includes 42 files for neutral, 42 files for boredom, 41 files for happiness, and 70 files for anger. After taking 195 files for testing, the remaining files are split into 10 subsets (these 10 subsets have slightly different file numbers for each subset) for 10-fold cross-validation.
From the BKEmo corpus, the authors of the paper listened and selected 5,584 files for 4 emotions with 22 sentences, 8 male and 8 female voices, and these 5,584 files were used for emotion recognition. More details about the corpus can be found in [45,46]. The 5,584 files are divided into 11 parts, and 1/11 parts (507 files) are data used for testing, the remaining 5,077 files are used for training and validation. By using data augmentation, these 5077 files were augmented into 20,308 files. The set of files for the test, in any case, does not contain the files used for training and validation. The augmented corpus is split into 10 subsets for 10-fold cross-validation, and these 10 subsets have slightly different file numbers for each subset. Data distribution for each emotion is depicted in Tab. 2.

Experiment Results
The experiments were performed on a machine with the configuration as following: CPU: an Intel Core i7-7700 CPU @ 3.60 GHz

Discussions
At first, for both corpora EMO-DB and BKEmo, the average accuracy score increased when the number of parameters increased from 260 to 268 and 296, respectively. So beside mel spectrum, the parameters related to F 0 and variant, vocal tract, spectral characteristics and voice quality have contributed to increase the accuracy of the speech emotion recognition system. If only comparing the accuracy scores for the EMO-DB corpus of the studies listed in Tab. 4, in general, the average accuracy score in our case is superior to the accuracy score of the vast majority of available studies (except for [55], 99.8% vs. 99.69%). German is not a tonal language. The addition of parameters directly related to F 0 increased accuracy because the law of variable F 0 contributes significantly to emotional expression.  Vietnamese is a tonal language. There are 6 tones of Vietnamese. For Vietnamese, changing the tone of a syllable changes the meaning of the syllable. The variable rule of fundamental frequency F 0 determines the tone among 6 tones. Not only for Vietnamese but also for other languages such as German mentioned above, the law of variable F 0 of a sentence participates in determining the intonation of that sentence, and the intonation of a sentence is closely related to emotional expression. The result of recognition of the set of 268 parameters in which F 0 and 7 variants of F 0 are added shows a significantly higher score than the baseline model with 260 parameters for both corpora. This fits perfectly with the comment on the importance of F 0 mentioned above. For the set of 296 parameters, the accuracy score increases compared to the set of 268 parameters but does not increase considerably, which also reinforces the role of parameter F 0 for emotion recognition for Vietnamese in particular and for other languages (such as German in our case) in general.
In [45,46], the GMM and DCNN models were used to recognize Vietnamese emotions with the same corpus BKEmo containing only 5,584 original files, which means that there was no data augmentation for the corpus. The maximal number of parameters in [45] was 87. Therefore, the corpus of these two models (GMM and CapsNet) is not exactly the same, and the number of parameters of the two models is also different. The average accuracy score for [45] is 93.12% vs. 94.23% for this CapsNet model. Also with the same set of 296 parameters without data augmentation, the DCNN showed the average recognition accuracy of the 4 emotions was 88.01% vs. 94.23% for this CapsNet model. The common point of the two models is that the recognition score increases significantly when adding parameters related to F 0 . The results of this paper also allow us to say that the CapsNet model is also suitable for emotion recognition of speech in which the input parameters can be viewed as corresponding to a large-sized image.

Conclusions
In summary, our experiments of emotion recognition using a capsule neural network with parameters related to mel spectrum, F 0 and variants, vocal tract, spectral characteristics showed an overwhelming advantage in recognition scores compared to many other models and classifiers. A problem for the recognition systems, in general, is that in real environments, the recognition score may be reduced because the actual data are not quite similar to the trained data. To approach this issue of emotion recognition, there is a research direction such as transfer learning. We will then apply our CapsNet-based model to solve the multi-lingual speech emotion recognition. This is also our upcoming research direction of emotion recognition.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.