Classification is the last, and usually the most time-consuming step in recognition. Most recently proposed classification algorithms have adopted machine learning (ML) as the main classification approach, regardless of time consumption. This study proposes a statistical feature classification cubic spline interpolation (FC-CSI) algorithm to classify emotions in speech using a curve fitting technique. FC-CSI is utilized in a speech emotion recognition system (SERS). The idea is to sketch the cubic spline interpolation (CSI) for each audio file in a dataset and the mean cubic spline interpolations (MCSIs) representing each emotion in the dataset. CSI interpolation is generated by connecting the features extracted from each file in the feature extraction phase. The MCSI is generated by connecting the mean features of 70% of the files of each emotion in the dataset. Points on the CSI are considered the new generated features. To classify each audio file according to emotion, the Euclidian distance (ED) is found between each CSI and all MCSIs of all emotions in the dataset. Each audio file is classified according to the nearest MCSI to the CSI representing it. The three datasets used in this work are Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (Emo-DB), and Surrey Audio-Visual Expressed Emotion (SAVEE). The proposed work shows fast classification and high accuracy of results. The classification accuracy, i.e., the proportion of samples assigned to the correct class, using FC-CSI without feature selection (FS), was 69.08%, 92.52%, and 89.1% with RAVDESS, Emo-DB, and SAVEE, respectively. The results of the proposed method were compared to those of a designed neural network called SER-NN. Comparisons were made with and without FS. FC-CSI outperformed SER-NN on Emo-DB and SAVEE, and underperformed on RAVDESS, without using an FS algorithm. It was noticed from experiments that FC-CSI operated faster than the same system utilizing SER-NN.
Numeric data are often difficult to analyze, and functions to link the data are hard to find. In 1998, cubic spline interpolation (CSI) was proposed, which connects pairs of data points using unique cubic polynomials, generating a continuous and smooth curve [
Interpolation has many applications, all with the purpose of smoothening. The fundamental concept of CSI is based on the engineer’s tool used to draw smooth curves through several points. The spline consists of weights attached to a flat surface. A flexible strip is bent across each of these weights, resulting in a pleasingly smooth curve. The mathematical spline is similar in principle. The points, in this case, are numeric data. The weights are the coefficients of cubic polynomials used to interpolate the data. These coefficients “bend” a line so that it passes through each data point without erratic behavior or breaks in continuity [
We propose an algorithm with third-order CSI curves to classify emotions in three datasets. We are aware of no previous use of CSI in classification, although some research has adopted the idea of this work to discover the wave behavior of an audio signal to classify emotions, using a convolutional neural network (CNN) and not CSI. Related work discovered signal behavior directly from raw data, while FC-CSI works with features extracted from raw data. A network based on time distribution CNN, CNN, and RNN was proposed, without traditional feature extraction to classify emotions, achieving 88.01% classification accuracy on the seven emotions of Emo-DB [
CSI has been used in many applications other than classification. Third-order polynomial CSI was used to fit the stress-strain curve of a standardized specimen of SAE 1020 steel hot-rolled flat [
We propose a classification method utilizing CSI. Classification methods are used in applications such as speech emotion recognition. Regardless of the high accuracy performance achieved by classification algorithms adopting ML methods, their training time is high. We aim to minimize this and propose a classification algorithm that uses cubic spline interpolation, which is a new field in classification research.
The limitation of FC-CSI is the conversion of features from 1D to 2D. To draw a curve on a 2D plane, features must be represented in 2D coordinates, i.e., X and Y. In the proposed algorithm, the feature value is considered the Y-axis, and the sequence of the feature is the X-axis. The distance between features on the X-axis is determined by trial and error, which affects the accuracy, and to find the perfect distance affects system time consumption.
The remainder of this paper is organized as follows. Section 2 explains the proposed algorithm, Section 3 discusses experimental results, and Section 4 relates our conclusions and proposes future work.
We design and implement a classification method.
All experiments were performed using MATLAB R2019a on a computer with an Intel Core I7-8750H CPU @2.2 GHz with 32 GB of RAM and a Windows 10 operating system.
The RAVDESS [ Recorded in three frequencies; Representing the main emotions according to Paul Ekman’s definition [ Two languages were used in recording them; Gender balance was required to be available.
The specifications and properties of the datasets are shown in
RAVDESS | Emo-DB | SAVEE | |
---|---|---|---|
Actors | 24 |
10 |
4 |
Language | English |
Germany | English |
Texts used to record the dataset | 2 | 10 | 15 |
Number of emotions | 8 |
7 |
7 |
Emotions represented (Number of samples by emotion) | Fear (192), Disgust (192), Happy (192), Neutral (96), |
Fear (69), Disgust (46), |
Fear (60), Disgust (60), |
Number of features | 2182 | 2182 | 2182 |
Number of samples | 1440 | 535 | 480 |
Speech signal frequency | 48000 Hz | 16000 Hz | 44100 Hz |
Sample size | 16 bits | 16 bits | 16 bits |
Average human classification accuracy | 72% | 84% | 66.5 ± 2.5 % |
Bit rate | 768 kbps | 256 kbps | 705 kbps |
Each audio file in a dataset was converted to a one-dimensional vector. Three preprocessing functions were applied to each vector: All files were grouped according to emotion; Silent parts at the beginning and end (and not in the middle) of each audio file were removed. This was the first function applied to the data. Data in each audio vector were normalized between 0 and 100:
where x is the data read from the audio file; i is the number of values representing the audio file; and Min and Max are the minimum and maximum values, respectively, found in the file. The factor of 100 was discovered through trial and error. This step was applied after feature extraction.
One way to evaluate the performance of the classification method was to compare its results with those of a predefined neural network (NN). We used one hidden layer with 10 nodes. The audio files in each of the three datasets were randomly divided into subsets of 70% for training, 15% for validation, and 15% for testing, and no files were in multiple subsets. The training function used was “trainscg,” the divide function was “dividerand,” and the divide mode was “sample” [
The number of features extracted from each audio file was 2182. Fifteen feature types were extracted from each audio file: entropy, zero crossing (ZC), deviation of ZC, energy, deviation of energy, harmonic ratio, Fourier function, Haar, MATLAB fitness function, pitch function, loudness function, Gammatone cepstral coefficient (GTCC) according to time and frequency, and MFCC function according to time and frequency. The deviations (SDs) of these 15 features were calculated using 10 degrees on either side of the mean (including 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, and 2.5). The feature extraction function will be called the feature extraction mean deviation (FE-MD) function.
This step receives its input from feature extraction and sends its output to cubic spline curve generation, as shown in
Numeric data are commonly difficult to analyze, and a function to successfully link data is hard to find.
In 1998, mathematicians came up with CSI [
Interpolation algorithms find sets of unique polynomials that form the main spline curve to interpolate all the control points and generate new secured points between them. Changes in control point positions can affect the curvature of the generated polynomials. We chose spline interpolation for three reasons: 1) regardless of polynomial degrees, interpolation error can be small; 2) it avoids Runge’s phenomenon [
Interpolation has many applications, and all follow the concept of smoothening [
where
Therefore, a polynomial can be written as either zero or as a sum of a limited number of nonzero terms, each the product of a number (coefficient) and a finite number of variables raised to a positive integer power. The power is called the degree, and the degree of a polynomial is the largest degree among all terms with nonzero coefficients. An example of a cubic polynomial is
The degree of the second term is 2, and the coefficient of the term is +4. The degree of the polynomial is the degree of the highest term, i.e., three.
A third-degree spline is called a cubic spline. Let us assume a cubic spline in the interval x0 ≤ xi ≤ xn is generated by a set of piecewise polynomials si (x). The standard form of the cubic spline is
where i = 1, 2, …, n-1; and n is the number of control points. Hence, n-1 is the number of cubic polynomials that will form the cubic spline interpolation. The first and second derivatives of these n-1 equations are fundamental in cubic spline interpolation methodology, and these are respectively shown as [
From these functions, three matrices can be generated (see
This step generates CSI with equal lengths, so 10 deviation degrees (0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2 (standard deviation), 2.25, and 2.5) will be found for all points in each polynomial to construct the final spline curve. After this step, CSIs representing all audio files in the dataset will be randomly divided into two parts, 70% to find the MCSI curves and 30% for testing.
Each MCSI curve represents a different emotion, so the number of MCSI curves is the number of emotions represented in the dataset, i.e., 8, 7, and 7 for RAVDESS, EMO-DB, and SAVEE, respectively. MCSI is calculated as
where m = 1, 2, …, n; n is the number of emotions in the dataset; j = 1, 2, …, x, where x is the number of points forming each CSI representing each audio file; i = ∈ y(m), where vector y(m) stores 70% of the audio files indices randomly selected from each emotion. The feature subset is (fij, fi+1j, fi+2j, …, fnj), where length (y(m)) equals 1152, 428, and 384, for RAVDESS, EMO-DB, and SAVEE, respectively. At the end of this step, m MCSIs will be generated, where m equals the number of emotions. Each MCSI represents a different emotion though a unique curve. The summation will add the same 70% sample from each feature subset, as defined in
The audio files in the dataset will be divided into two groups, training and testing. The 70% for training will be used to find the MCSIs, and the remaining 30% of the audio files in the dataset, will be used to test the classification accuracy of the proposed work. This 30% is 432, 160, and 144 audio files for RAVDESS, EMO-DB, and SAVEE, respectively.
The standard Euclidean distance function is used in this work [
where (Xin, Yin) are the coordinates of the CSI samples determined for testing, and (Xjm, Yjm) are the coordinates of the MCSI curve that represents one of the emotions in the dataset. The matrix that will be generated from
We analyze the performance of SERS. The best classification accuracy results will be discussed with respect to the fewest features selected. Experiment 1.1 (Exp1.1) and experiment 1.2 (Exp1.2) were respectively performed before and after deploying the pre-designed feature selection t-test fitness (FS-TF) algorithm.
Through Exp1.1, the classification performance of SERS is calculated in two different approaches. First, utilizing the FC-CSI algorithm and second through utilizing the SER-NN. Both approaches are implemented without deploying the FS-TF algorithm.
We observe the following after analyzing the results: Results from Exp1.1 show that FC-CSI outperformed SER-NN FC on Emo-DB and SAVEE and underperformed on RAVDESS. This leads to the conclusion that, FC-CSI does not perform well on datasets of low human-accuracy performance. The most misclassified samples are those of the sad and fear emotions in RAVDESS, and neutral, sad, and disgust in SAVEE, which shows the drawback of FC-CSI in recognizing low-amplitude emotions.
Proposed classification method | NN | |
---|---|---|
RAVDESS | 69.07% | 86.1% |
Emo-DB | 92.52% | 96.3% |
SAVEE | 89.1% | 91.7% |
We observe the following: FC-CSI showed lower performance for emotions with low amplitude, such as calm and bored; FC-CSI showed average performance for emotions that contain many silent fragments, such as neutral; FC-CSI failed to improve performance for the anger emotion on all three datasets; The performance for all emotions of FC-CSI on RAVDESS did not appear similar to that of Emo-DB and SAVEE, except for the happy emotion, where FC-CSI performed better on all three datasets.
Through Exp1.2, the classification performance of SERS was calculated by two different approaches, first utilizing the FC-CSI algorithm, and then SER-NN. Both approaches were implemented by deploying the FS-TF algorithm.
The column charts in
Regardless of the best classification accuracy results gained from the NN, the proposed classification method accomplished the results shown in
Proposed classification method | NN | |
---|---|---|
RAVDESS | 285 | 333 |
Emo-DB | 104 | 247 |
SAVEE | 89 | 270 |
We compared the performance of FC-CSI before and after FS-TF to show how SERS suffered from the FS algorithm, and how it affected the emotions in each dataset (see
We clarify the results of FC-CSI. As discussed, the algorithm finds MCSI curves through the mean of 80% of the audio samples in the dataset, where each MCSI represents a different emotion. We find the Euclidean distance between the CSIs of each audio sample in the testing samples and classify each sample to the nearest MCSI. The MCSI curves that represent the emotions in RAVDESS, Emo-DB, and SAVEE are shown in
It is obvious from looking at the curves in
FC-CSI introduces a new research area in classification. No related work was found that deployed curves in classification, specifically the CSI. The most important conclusion is that CSI is a powerful data analysis tool. Splines correlate data efficiently and effectively, no matter how random the data may seem. Once the algorithm for spline generation is produced, interpolating data becomes easy. It was also noticed that statistical classification methods need less time than SER-NN methods, because SER-NN’s need time to learn and validate, whereas statistical methods require no learning. A set of speech signals representing a certain emotion shares a similar CSI behavior, which can be used in SER systems. Curve classification does not suffer from distinguishing conflicted emotions such as calm and neutral, or happy and anger, and they are gender-, speaker-, and language-independent. Possible future work based on this work is as follows. First, it would be more challenging to develop the proposed classification algorithm to deal with audio samples of unequal length. Second is to deploy a more complex distance measurement function, such as Fréchet distance. Third, we used CSI, but it is worth trying other types of curves, such as Bezier curves, with a higher dimension, such as three dimensions.
The proposed FC-CSI classification algorithm can also be applied to voice, speaker, and gender recognition, and can be used in classification systems such as image detection and object recognition. It can also be used to detect diseases and infections.
Thanks, and appreciation to everyone who contributed to this scientific research, especially the researchers cited in this paper.