A New Hybrid Feature Selection Method Using T-test and Fitness Function

: Feature selection (FS) (or feature dimensional reduction, or feature optimization) is an essential process in pattern recognition and machine learning because of its enhanced classification speed and accuracy and reduced system complexity. FS reduces the number of features extracted in the feature extraction phase by reducing highly correlated features, retaining features with high information gain, and removing features with no weights in classification. In this work, an FS filter-type statistical method is designed and implemented, utilizing a t-test to decrease the convergence between feature subsets by calculating the quality of performance value (QoPV). The approach utilizes the well-designed fitness function to calculate the strength of recognition value (SoRV). The two values are used to rank all features according to the final weight (FW) calculated for each feature subset using a function that prioritizes feature subsets with high SoRV values. An FW is assigned to each feature subset, and those with FWs less than a predefined threshold are removed from the feature subset domain. Experiments are implemented on three datasets: Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and Surrey Audio-Visual Expressed Emotion. The performance of the F-test and F-score FS methods are compared to those of the proposed method. Tests are also conducted on a system before and after deploying the FS methods. Results demonstrate the comparative efficiency of the proposed method. The complexity of the system is calculated based on the time overhead required before and after FS. Results show that the proposed method can reduce system complexity.


Introduction
Feature selection (FS) is a preprocessing step in machine learning [1] that enhances classification accuracy. It is the process of feature subset selection from a pool of correlated features for use in modeling construction [2]. This work aims to decrease the high correlation between features that causes numerous drawbacks, including a failure to gain additional information and improve Many FS methods have been developed, which can be categorized according to multiple topologies. This work is concerned with statistics; hence we classify FS methods according to the distance measures used to evaluate subsets. Distance measures distinguish redundant or irrelevant features from the main pool, and four types of FS methods can be identified according to their distance measures [7].
• Wrapper methods assign scoring values to each feature subset after training and testing the model. This requires considerable time, but it obtains the subset with the highest accuracy. The three wrapped FS methods of optimization selection, sequential backward selection, and sequential forward selection (SFS), based on ensemble algorithms called bagging and AdaBoost, were used [8]. Subset evaluations were performed using naïve Bayes and decision tree classifiers. Thirteen datasets with different numbers of attributes and dimensions were obtained from the UCI Machine Learning Repository. The search technique using SFS based on the bagging algorithm and using decision trees gained the results with the best average accuracy (89.60%). • Filter methods measure the relevance of features through univariate statistics. In tests of 32 FS methods on four gene expression datasets, it was found that filter methods outperform wrapper and embedded methods [9]. • Embedded methods differ in terms of learning and interaction of the FS phase. Unlike filter methods, wrapper methods utilize learning to measure the quality of several feature subsets without knowledge of the structure of the classification or regression method used. Therefore, these methods can work with any learning machine. Embedded methods do not separate the learning and FS phases, and the structure of the class of functions under consideration plays a crucial role. An example is the measurement of the value of a feature using a bound that is valid for support vector machine (SVM) only and not for the decision tree method [10]. • Hybrid methods utilize two or more FS methods. An efficient hybrid method consisting of principal component analysis and ReliefF was proposed [11]. Ten benchmark disease datasets were used for testing. The approach eliminated 50% of the irrelevant and redundant features from the dataset and significantly reduced the computation time.
FS methods employ strategies based on the types of feature subsets: redundant and weakly relevant, weakly relevant and non-redundant, noisy and irrelevant, and strongly relevant [12]. The current study aims to remove redundant and strongly correlated features by deploying a t-test, and to find coupled features with high dependency by deploying a fitness function. Although FS puts an enormous burden on the system performance pool, FS in pattern recognition systems is rarely avoided.

FS Methods main concepts:
• FS methods are employed either to reduce system complexity or increase accuracy. A study in 2006 employed two FS algorithms, the t-test method to filter irrelevant and noisy genes and kernel partial least squares (KPLS) to extract features with high information content [13]. It was found that neither method achieved high classification results. FS methods do not necessarily increase the classification accuracy of pattern recognition systems. They can remove all relevant features without conflict between the removed features [14,15]. • There is no superior FS method. Research has shown that no specific group of FS filter methods outperforms other groups constantly, but observations have indicated that certain groups of FS filter methods perform best with many datasets [3,16]. Many FS methods have been used in pattern recognition research and in different scientific fields, with largely varying results. Furthermore, each FS filter method performs differently with respect to specific types of datasets, and this is called FS algorithm instability [17].
One drawback of statistical FS algorithms is that they do not consider the dependency of features on others; statistical FS algorithms can eliminate a feature whose absence negatively affects the performance of another selected feature because of their strong interrelationship [17]. This work avoids this drawback by calculating the dependency of each feature on other features. State-of-the-art methods make decisions on the removal of highly correlated features without a basis in proper measurement. Two highly correlated features can be powerful in classifying two different attributes. Thus, to remove one can severely affect classification. To avoid this, we calculate the strength of recognition value (SoRV) and assign it a high weight through an exponential function. The proposed method outperforms the state-of-the-art through a fitness function that calculates SoRV for each feature and subset feature (pair of features). To remove a feature can also affect the performance of another feature. To avoid this, we group features in subsets of pairs to calculate the degree of dependence between each feature and all other features.
In the proposed method, there is a maximum of two features in each tested subset. To use a combination of three or more features in each feature subset will exponentially increase the time consumption, and to reach the optimal solution will take months. Nevertheless, subsets of two features provide good results in a reasonable amount of time. Hence, we fix the number of features per subset to two. We focus on statistical filter FS methods because of their stability, scalability, and minimal time consumption.
The remainder of this paper is organized as follows. Section 2 explores some recent FS methods that utilize the t-test and feature ranking approaches. Section 3 explains the proposed methodology. Section 4 shows the experimental setup and the results gained through this work. Section 5 discusses our conclusions and trends for future work.

Related Work
The t-test is deployed in many fields to measure the convergence relevance between samples. A proposed gene selection method utilized two FS methods, the t-test to remove noisy and irrelevant genes and KPLS to select features with noticeable information content [13]. Three datasets were used in a performance experiment, and the results showed that neither method yielded satisfactory results. A modified hybrid ranking t-test measure was applied to genotype HapMap data [18]. Each single nucleotide polymorphism (SNP) was ranked relative to other importance feature measures, such as F-statistics and the informativeness for assignment. The highest ranked SNPs in different groups in different numbers were selected as the input to a class SVM to find the best classification accuracy achieved by a specific feature subset. A twoclass FS algorithm utilizing the Student's t-test was used to extract statistically relevant features, and the -norm SVM and recursive feature elimination were used to determine the patients at risk of cancer spreading to their lymph nodes [19]. A proposed FS method used the Student's t-test to measure the term frequency distribution diversity between one category and the entire dataset [20]. An FS approach based on the nested genetic algorithm (GA) utilized filter and wrapper FS methods [21]. For the filter FS, a t-test was used to rank the features according to convergence and redundancy. A nested neural network and SVM were used as the wrapper FS technique. A t-test was utilized to compare outcome measures pre-and post-ablation through an intraprocedural 18F-fluorodeoxyglucose positron emission tomography (PET) scan assessment before and after PET/contrast-enhanced guided microwave ablation [22]. A fatigue characteristic parameter optimization selection algorithm utilized the classification performance of an SVM as an evaluation criterion and applied the sequential forward floating selection algorithm as a search strategy [23]. The algorithm aimed to reach the optimal feature subset of fatigue motion by reducing the dimensionality of the domain set of fatigue feature parameters. Based on the t-test analysis of variance method, the algorithm was used to analyze the influence of individual athlete differences and fatigue exercises on sports behavior and eye movement characteristics.

Proposed Method
A filtered FS method is proposed to improve the emotion classification accuracy of the datasets deployed in this work. These are the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (Emo-DB), and Surrey Audio-Visual Expressed Emotion (SAVEE). The method uses the minimum number of features to achieve the highest accuracy in the least time.
The structure of the extracted features from each dataset is shown in Tab. 1. We explain the structure of features extracted from the RAVDESS dataset as an example. First, 2,186 features are extracted from each of the 1,440 audio wave file samples. The same number is extracted from Emo-DB and SAVEE.  The number of features is k, as is the number of feature subsets. n is the number of samples in each feature subset, represented as (f i j , f i j+1 , f i j+2 ,. . ., f i n ), where n = 1,440 for RAVDESS, n = 553 for Emo-DB, and n = 480 for SAVEE. The feature number in a feature subset is denoted by i, and j is the sample number. Sections 3.1-3.3 discuss the procedures of the proposed FS method.

QoPV Calculation
The t-test value is calculated between each subset and all other subsets through Eq. (1): where k is the number of feature subsets, n is the number of samples in each feature subset; i = 1,. . ., k − 1, m = 1,. . ., n, and j = i + 1,. . ., k, to avoid calculating the quality of performance value (QoPV) for the same pair of feature subsets. The QoPV is obtained by calculating the t-test value between subset i and all other subsets. The QoPV for a subset decreases each time the t-test value is 0; otherwise, it increases. After calculating the QoPV of each feature subset with respect to all other subsets, the feature subsets are ranked according to their QoPVs in descending order.

t-test
This work uses a two-sample t-test, i.e., the so-called independent t-test, because the two groups of features being tested come from different features. The formula of the t-test function is shown in Eq. (2): where x 1 and x 2 are the means of the two feature subsets being compared, as shown in Eq. (1); S 2 is the pooled standard error of the two subsets; and fe 1 and fe 2 are the numbers of samples in the two subsets and are equal. The t-test indicates significant differences between pairs of feature subsets. A large t-test value indicates that the difference between the means of two groups is higher than the pooled standard error of the two feature subsets [24]. Thus, the higher the t-test value the better the results. Feature subsets with low t-test values must be removed because of the great similarity of their values to those of other feature subsets. However, the final decision is not made at this step, because a feature subset with a low QoPV might have a high SoRV. In case of a high SoRV, a subset may have a final weight (FW) higher than those of other feature subsets with a high QoPV. This reflects the novel idea of our work.

SoRV Calculation
The SoRV for each subset i is obtained using the neural network-based fitness function. The SoRV is calculated through pairs of subsets to observe the classification effect of each feature subset i on all other feature subsets through Eq. (3): where k is the number of feature subsets, n is the number of samples in each feature subset, i = 1,. . ., k − 1, m = 1,. . ., n, and j = i + 1,. . ., k. After performing several experiments, the percentage of 37% achieves the highest performance for the tested features.

Final Weight (FW) Calculation
Several experiments show that SoRV is more important than QoPV. Specifically, SoRV indicates the power of recognition for each feature subset, whereas QoPV indicates the convergence of the feature subset with respect to other feature subsets. Nevertheless, we need QoPV to determine the degree of convergence of each feature subset. Thus, we use Eq. (4) to assign a higher weight to SoRV than to QoPV.

Experimental Setup
All audio files were preprocessed prior to feature extraction. Silent parts at the beginning and end of each file were removed, data were normalized to the interval (0, 100), and files were grouped according to the emotions they represented. The number of features extracted from each audio file was 2,186. Audio file samples were selected randomly for evaluation, and 70%, 15%, and 15% of the samples of each dataset were selected for training, validation, and testing, respectively. To evaluate the proposed FS method, we used a one-layer, 10-node neural network classifier. Feature extraction was applied to each of the three datasets before application of the proposed method.

Experimental Data
The datasets used in this work were selected through an online search according to the following criteria.
• This work proposes an FS method for use in speech emotion recognition. Thus, the most important criteria are the emotions represented in a dataset. Selected datasets should represent the six basic emotions of fear, disgust, happiness, sadness, anger, and surprise, according to Paul Ekman's definition [25]. The three selected datasets intersect to represent fear, disgust, happiness, neutrality, sadness, and anger, which include five of the basic emotions. The RAVDESS dataset represents eight emotions through 1440 audio files, and Emo-DB and SAVEE represent seven emotions through 535 and 480 audio files, respectively. • The selected datasets should be recorded at different frequencies to test the proposed method. The RAVDESS, Emo-DB, and SAVEE datasets were recorded at 48,000, 16,000, and 44,100 Hz, respectively, as shown in Tab. 2. • Datasets should show gender balance; this criterion was met in this work.
The same feature extraction process was implemented on each of the datasets, and 2,186 features were proposed for each audio file. These were established by a predefined feature extraction method that utilizes 15 features: entropy, zero crossing (ZC), deviation of ZC, energy, deviation of energy, harmonic ratio, Fourier function, Haar, MATLAB fitness function, pitch function, loudness function, Gammatone Cepstrum Coefficient according to time and frequency, and the MFCC function according to time and frequency. The standard deviation (SD) of these features was calculated using 14 degrees on either side of the mean (i.e., 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.5, and 4). All experiments were implemented separately on each dataset.

Performance Analyses
We discuss the experimental results. The performance efficiency of the proposed FS method is evaluated through a neural network classifier. Three emotional datasets are used in the evaluation process, as shown in Tab. 2. The accuracy of the classifier is calculated using confusion matrices and receiver operating characteristics (ROCs). The confusion matrices represent emotions as numbers. The confusion matrices related to the RAVDESS dataset show the following emotions from left to right, which we denote as 1 to 8, in the following order: neutrality, calm, happiness, sadness, anger, fear, disgust, and surprise. The confusion matrices related to the Emo-DB dataset show the following emotions from left to right, denoted as 1 to 7, in this order: fear, disgust, happiness, boredom, neutrality, sadness, and anger. The confusion matrices related to the SAVEE dataset show the following emotions from left to right, denoted as 1 to 7, in the following order: anger, disgust, fear, happiness, sadness, surprise, and neutrality. The ROC line chart is one of the best techniques for testing the results of a classification system. It is a two-dimensional line chart; the x-axis shows the false-positive rate (FPR), and the y-axis shows the true-positive rate (TPR). The ROC shows the relationship between sensitivity and specificity. It is generated by plotting the TPR value against the FPR value. TPR is the ratio of cases correctly predicted as positive (i.e., true positive, or TP) to all positive cases (i.e., false negative, or FN), as shown in Eq. (5).
The FPR is the ratio of cases incorrectly predicted as positive (i.e., false positive, or FP) to all negative cases (i.e., true negative, or TN), as shown in Eq. (6).
The ROC curve is a compromise between TPR (or sensitivity) and (1 -FPR) (or specificity). The degree to which the curves are tangent to the top-left corner of the ROC line chart indicates the performance of the classification process in making correct predictions. The closer the curve is to the 45 • diagonal of the ROC space, the less accurate the classification is because of incorrect predictions [26]. The greatest advantage of the ROC in evaluating classifiers is that it does not depend on class distribution, but rather depends on classifier prediction. The results achieved from our experiments are presented in Tab. 3, which compares the proposed FS method to the widely used F-test and F-score methods. Tab. 3 shows that the proposed FS method achieves the highest classification accuracy among these methods.
Tab. 3 and Figs. 2-4 present the accuracy classification results for the three datasets before deploying the FS methods (utilizing all 2,186 features), which are 93.05%, 95%, and 97.2% for the RAVDESS, Emo-DB, and SAVEE datasets, respectively.  show the classification accuracies after running the three FS methods on the RAVDESS, Emo-DB, and SAVEE datasets, respectively. The highest classification accuracy gained in this work was through running the proposed FS method on all three datasets. The highest classification accuracies achieved from running the proposed, F-test, and F-score FS methods on the RAVDESS dataset are 93.5%, 92.6%, and 92.1%, respectively, as shown in Figs. 5a-5c, and Tab. 3. These values are lower than those obtained without using FS methods because many of the emotions represented in RAVDESS audio samples are similar and are thus difficult to distinguish. The same is true of realistic datasets. This similarity between audio samples produces similarity in the extracted features; hence the proposed, F-test, and F-score FS methods yield poor outcomes.  Fig. 6 show the classification accuracies after deploying the three FS methods on the Emo-DB dataset. The proposed FS method gains the highest classification accuracy. The F-test and F-score FS methods achieve accuracies of 97.5% and 96.3%, respectively. As observed in the confusion matrices, each FS method affects the recognition of a certain emotion. The proposed method affects the recognition of the happy emotion. The F-test FS method affects the recognition of fear and anger. The F-score FS method affects the recognition of boredom.
Tab. 3 and Fig. 7 show the classification accuracies after deploying the three FS methods on the SAVEE dataset. We notice that the proposed FS method gains the highest accuracy among all compared methods. Specifically, the proposed FS method gains 100% classification accuracy, compared to 98.6% and 97.2%, respectively, for the F-test and F-score methods. The F-score FS method achieves no improvement to the classification accuracy.

Figure 7: (a) Test confusion matrix after applying the proposed FS method on SAVEE (b) Test confusion matrix after applying F-test on SAVEE (c) Test confusion matrix after applying F-score on SAVEE
All the results shown in the confusion matrices are described by the legend charts shown in Fig. 8. The results highlight the superiority of the proposed FS method over the F-test and F-score FS methods. As mentioned in Section 4.2, the results are also analyzed using the ROC line chart. Figs. 9-11 show the ROC curves for the classification processes on the RAVDESS, Emo-DB, and SAVEE datasets, respectively, before deploying the FS methods. Through the confusion matrices, we show numerically the superior performance of the proposed FS method over the other two FS methods. Through the ROC line charts, we show visually that the proposed method outperforms the F-test and F-score FS methods.
Through a visual comparison of the ROC curves in Fig. 9 and the ROC curves in Fig. 12, we notice that all the ROC curves in Figs. 12a-12c are farther from the top-left corner than those in Fig. 9. This demonstrates the failure of FS methods to prove the results, while the proposed method attained the highest results. The ROC curves in Fig. 12a, are closer to the top-left corner than those in Figs. 12b and 12c. This demonstrates that the optimum performance is achieved by the proposed FS method.   For the RAVDESS dataset, eight epochs are needed to achieve 93.1% classification accuracy without using any FS method (Fig. 15). For the Emo-DB dataset, 67 epochs are needed to achieve 95% classification accuracy (Fig. 16). For the SAVEE dataset, six epochs are needed to achieve 97.2% classification accuracy (Fig. 17). Tab. 4 compares the numbers of epochs needed to classify the emotions in the datasets before deploying the FS methods for RAVDESS, Emo-DB, and SAVEE, respectively (Figs. [15][16][17], and similarly after deploying the FS methods (Figs. [18][19][20]. When the proposed FS, F-test, and F-score FS methods are applied, classification takes 6 and 7 epochs, respectively (Fig. 18). Thus, the three FS methods have adequate classification times, but the proposed FS method is faster than the other two. When the proposed FS, F-test, and F-score FS methods are applied on the Emo-DB dataset, the classification process takes 9, 6, and 8 epochs, respectively (Fig. 19). Thus, the three FS methods have adequate classification times, and the F-test FS method is faster than the other two. Although the F-test FS method achieves the fastest time, its classification accuracy is 1.3% less than that of the proposed FS method.  17 and 20). Hence no improvement in classification time is achieved. Nevertheless, the classification accuracies are adequate, as discussed previously. Before applying the FS methods, 2,186 features are extracted from each audio file in the three datasets, because the same feature extraction process is applied to all three datasets. The number of features selected by the three FS methods are different (Tab. 5). Although the proposed FS method uses the fewest features from the RAVDESS dataset, it records the highest classification accuracy. The same is true for the SAVEE dataset. For the Emo-DB dataset, the proposed method achieves the highest accuracy in recognizing the seven emotions in the Emo-DB dataset and records the largest number of features.

Conclusion and Future Work
The confusion matrices in this study reveal a strong relationship between each FS method and the number of emotions. Each FS method affects the recognition of one or two emotions and affects different emotions. According to the results for the Emo-DB dataset, the proposed method negatively affects the accurate classification of happiness, the F-test FS method negatively affects the accurate classification of fear and anger, and the F-score FS method negatively affects the accurate classification of boredom. In summary, each FS method negatively affects the classification accuracy of a different emotion. Therefore, to build a hierarchical or ranking FS method from the three FS methods utilized in this work will result in Strong classification results, but it will consume more time. Ultimately, no relationship exists between the number of features, speed, and classification accuracy. The highest accuracy can be obtained with the lowest number of features, and the highest speed can be achieved with the largest number of features. The variation depends on the SoRV factor utilized in selecting the most powerful features in recognizing different emotions. Thus, to measure the power of classification for each feature is the key to the success of the proposed work. Specifically, many features can be excluded from the main feature domain because they are highly convergent but have high classification power. Such features are neglected by most FS methods. By contrast, our work assigns greater importance to the SoRV than to the QoPV because of its contribution to classification.