Enhancing Parkinson’s Disease Prediction Using Machine Learning and Feature Selection Methods

: Several millions of people suffer from Parkinson’s disease globally. Parkinson’s affects about 1% of people over 60 and its symptoms increase with age. The voice may be affected and patients experience abnormalities in speech that might not be noticed by listeners, but which could be analyzed using recorded speech signals. With the huge advancements of technology, the medical data has increased dramatically, and therefore, there is a need to apply data mining and machine learning methods to extract new knowledge from this data. Several classification methods were used to analyze medical data sets and diagnostic problems, such as Parkinson’s Disease (PD). In addition, to improve the performance of classification, feature selection methods have been extensively used in many fields. This paper aims to propose a comprehensive approach to enhance the prediction of PD using several machine learning methods with different feature selection methods such as filter-based and wrapper-based. The dataset includes 240 recodes with 46 acoustic features extracted from 3 voice recording replications for 80 patients. The experimental results showed improvements when wrapper-based features selection method was used with K-NN classifier with accuracy of 88.33%. The best obtained results were compared with other studies and it was found that this study provides comparable and superior results. and RF for the First Best and POS search methods. The best results were obtained using K-NN classifier with accuracy, precision, recall and F-scores of 0.883, 0.884, 0.883 and 0.883 respectively.

In addition, feature selection plays an important role in the explanation of medical data. Feature selection technique constitutes a significant issue of global combinatorial optimization in machine learning, which is used to decrease the number of features from the original features, removes irrelevant or redundant features without incurring much loss of information, as well as simplification of models to make them easier to interpret and shortening training times [17]. Therefore, a good feature selection method is required to accelerate processing time and predictive accuracy. There are three types of feature selection algorithms, which are: filter (extract features from the data set without any learning), wrapper (use learning techniques to estimate useful features) and hybrid (gather the feature selection step and the classifier construction) [18,19].
2. These feature selection methods include both filter-based methods such as (Information gain IG, Principle Component Analysis PCA) and wrapper methods that include different search methods such as First Best Greedy Stepwise PSO Method. 3. A comparative analysis was conducted to examine the performances of all methods/combinations used and the best prediction results were reported.
This paper is organized as follows: Section 2: the related works. Section 3: discussion of the methods. Section 4: experimental results and discussion. Section 5: conclusions and future works.

Related Studies
Several works have investigated the diagnosis of PD, in which many machine learning methods were applied such as Support Vector Machine, neural network, Naïve Bayes, K-nearest neighbor and Random Forests. In this paper, several datasets were used to search for related studies on Parkinson's disease, including Scopus, IEEE Xplore, Science Direct and Google Scholar.
In [20] a supervised ML method was proposed that combined the Principal Components Analysis (PCA) to extract features and SVM as classification method to identify PD patients. The main goal of this method was to determine patients that will be diagnosed with PD or with Progressive Supranuclear Palsy (PSP). The experiments were conducted on data of several patients with clinical and demographic features. The results depicted good accuracy of the proposed method in identifying the PD patients compared to existing related works.
In addition, the authors in [21] proposed an expert system of PD using features extracted from recordings of patients' voice. They developed a Bayesian classification approach to deal with the dependence to match the replication-based experimental design. The experiments were performed on voice recordings involving 80 subjects, 50% of them had PD. The aim was to identify which subjects had no PD and which did have the disease. Naranjo et al. addressed the problem of identifying PD patients using the extracted acoustic features from repeated voice recordings. The proposed method was based on two steps, namely variable selection, and classification. The first step aims to reduce the number of features, while the next step uses a regularization method named LASSO (Least Absolute Shrinkage and Selection Operator) as a classifier. The proposed method was tested on the previously described database and showed a good capacity for PD discrimination.
In addition, the authors in [22] addressed the problem of PD diagnosis by developing an approach that investigated gait and tremor features that were extracted from the voice reordering data. They started by filtering data to remove noises, then, using this data to extract gait features they detected the peak and measured the pulse duration. The average accuracy obtained for the identifying PD patients by the proposed approach was satisfactory.
The authors in [23] proposed a method to automatically detect PD by using the convolutional neural network (CNN). The authors suggested considering electroencephalogram (EEG) signals to build a thirteen-layer CNN model. The proposed approach experimented with EEG signals of 20 Parkinson's disease patients (50% men and 50% women). The CNN method obtained interesting results to identify PD patients; however, its performance should be evaluated using a large population.
Recently, Mostafa et al. [24] tried to enhance the diagnoses of PD by using several methods of feature evaluation and classification. They used a multi-agent system to evaluate multiple features by using five classification methods, namely DT, NB, NN, RF, and SVM. To evaluate the proposed method, they conducted several experiments using original and filtered datasets. The results depicted that this method enhanced the performance of ML methods used by finding the best set of features.
In addition, several methods were applied by [25][26][27] in order to predict Parkinson's disease. These methods applied several machine learning and feature selection methods to enhance the prediction of Parkinson's disease and other studies utilized machine learning and deep learning to improving prediction of diseases [28][29][30][31][32][33][34][35][36][37][38]. This paper extends these efforts by applying a comprehensive approach to investigate the performance of several machine learning with feature selection methods.

Methods
There are many feature selection techniques available, and we have considered the utilization of the following feature selection techniques: Filter-based technique, Correlation-based Feature Subset Selection (CfsSubsetEval), Principle Component Analysis (PCA), and Wrapper technique. The aforementioned techniques use different strategies or search algorithms to generate subsets and progress the search processes including (i) Best First (ii) Greedy Stepwise, (iii) Particle Swarm Optimization (PSO), and (vi) Ranker (see Fig. 1). The dataset used in this paper is available online at UCI Machine Learning Repository [14]. The dataset contains acoustic features of 80 patients, 50% of them suffering from Parkinson's disease. The data set has 240 recordings with 46 acoustic features extracted from 3 voice recording replications per patient. The data set is well-balanced by gender and class label (whether the patients have Parkinson's disease or not).
The experimental protocol was designed for evaluating the combination of the above techniques and search algorithms when they were used with the following classification models: (i) Naïve Bayes, (ii) Support Vector Machine (SVM) 1 , (iii) K-Nearest Neighbor (K-NN), (vi) Multi-Layer Perceptron (MLP) and (v) Random Forest (RF). The experiments were carried out on WEKA tool version 3.8 and MacBook Pro with OS X Yosemite version 10.10.5 as an operating system. To evaluate the performance of each classifier, we first ran feature selection in order to find the representative features and then we applied the classification models. Additionally, 10-fold cross validation was applied and the results have been reported in terms of Accuracy, Recall, Precision and F-score. Finally, we analyzed the results achieved from the experimentations. As stated earlier, the main goal of the research is to enhance the prediction of Parkinson's disease. However, this work also provides a useful guide to selecting the best feature selection technique for different classification models.

Feature Selection Techniques
Several feature selection techniques were applied before feeding the data into the classifier. The filter-based techniques consider the relevance between the features. Thus, they have low complexity, acceptable stability and scalability [39]. A disadvantage of this type of technique is that it might ignore some informative features, especially when the data is coming in stream [40]. The filter-based approaches can be either univariate or multivariate [41]. The univariate methods examine features according to the statistically-based criterion such as Information Gain (IG) [42][43][44]. Multivariate methods compute feature dependency before ranking the feature. In addition, Principle Component Analysis (PCA) is a common statistical method that is used for data analysis. PCA reduces the size of the data sets by selecting a set of features that represents the whole data set. Since PCA is a conversion technique, the principal components of the first variables is the component with the highest variance value. Then, other principal components are ordered with descending variance values [45]. In addition, the wrapper-based techniques evaluate the quality of the selected features using the performance of the learning classifier.
Regarding the search strategies, the search algorithms follows either sequential forward search (SFS), or sequential backward search (SBS). The SFS starts with a single feature and then iteratively adds or removes features until some terminating criterion is met whereas SBS starts with the whole feature set and then continues with adding and deleting operations. Since the SBS method attempts to find solutions ranged between suboptimal and near optimal regions [41], it is worth fully employing optimization techniques to figure out the subset that leads to maximizing the learner's performance, in particular, with the wrapper approach. At this end, the wrapper-based method can take advantage of various optimization methods such genetic algorithm [46,47] and ant colony optimization algorithm (ACO) [48].

Machine Learning Classifiers
In machine learning, the data classification is still an attractive domain. Lately, there are many proposed algorithms that have been examined in several domains such as NB, SVM, K-NN, MLP and RF, which are presented briefly in the next subsections.

Support Vector Machine
The basic idea behind SVM algorithm is to construct a hyperplane between groups of data. The quality of the hyperplane is evaluated by measuring to which degree it can maintain the largest distance from the points in either class [39]. Therefore, as it is presented in Fig. 2, the higher the separation ability of the hyperplane, the lower the error in the value [49]. The computational complexity of SVM is O(n 2 ) [50,51].

Naïve Bayes
Naïve Bayes (NB) is a probabilistic classifier that is based on Bayesian theorem. It is called Naïve because the classifier works on a strong features independence assumption. In literature, there are several variants of NB: simple Naïve Bayes, Gaussian Naïve Bayes, Multinomial Naïve Bayes, Bernoulli Naïve Bayes and Multi-variant Poisson Naïve Bayes in which the main different among them is the way the probability of the target class is computed. The time complexity of Naïve Bayes is O(d × c) where d is the query vector's dimension, and c is the total classes.

K-Nearest Neighbor
K-NN is a type of lazy learning, in which there is no explicit training phase and all computations are deferred until classification. It is a method of classifying data based on the nearest training data points in the feature space. The K-NN classifier uses the Euclidean distance measure, or another measure such as Euclidean squared, Manhattan, and Chebyshev, to estimate the target class. The performance of the classifier depends upon the parameter k, while the best value of k depends upon the dataset. In general, the greater the value of k, the lower the noises in the classification, but the boundaries between the classes become less distinct as shown in Fig. 3. The time complexity of K-NN is O(n × m), where n is the number of training examples and m is the number of dimensions in the training set [52].

Multilayer Perceptron Model
The MLP is a classical feedforward neural network classifier in which the errors of the output are used to train the network [53]. MLP consists of three layers of nodes: (i) input layer, (ii) at least one or more hidden layer(s), and (iii) output layer. The input layer is connected to the hidden layers which are connected to the output layer. All the layers are processed by weighted values. Fig. 4 represents a MLP with a single hidden layer. MLP is one-way error propagation where back-propagation techniques have been utilized to train and test these weight values. The time complexity of MLP is O(n 2 ).

Random Forests
The Random Forests (RF) classifier is a type of ensemble method that combines multiple decision tree predictions. In RF, the trees are generated randomly by selecting attributes at each node. The output of the ensemble is tree votes with the most popular class. The pseudo-code of the Random Forest ensemble is presented in Tab    The random forest method is more robust to errors and outliers. Therefore, the problem of overfitting is not faced. The accuracy of the model depends mainly on the strength of the base classifiers and measure of the dependence between them [55].

Experimental Results
The experiments were conducted such that 10-fold cross validation was applied for each classifier. The performance of each classifier was measured by the accuracy, precision, recall and F-score. Tabs. 2-12 show the experimental results of several machine learning methods both with and without different feature selection methods.             Tab. 2, shows the performance of all classifiers used before applying features selecting methods. The results showed Naïve Bayes obtained the best performance using all evaluation measures compared to the other classifiers. It obtained 82.92%, 83.30%, 82.90% and 82.90% for accuracy, precision, recall and F-score respectively.
The number of features was reduced using correlation based feature selection (CfsSubsetEval) method to 23, 17, 18 for the search methods of First Best, Greedy Stepwise and POS respectively, as shown in Fig. 5. The performance of with CfsSubsetEval combinations for each classifier is shown in Tab. 3. The results showed that no improvements were obtained by most of the combinations, except for RF with Greedy Stepwise and POS methods.
Tab. 4 showed the performance of classifiers used when features selection method based on information gain was applied. As shown in Fig. 5, the number of features was reduced to 10. The results showed that no improvements were reported on the performance of all classifiers after applying this feature selection method. Tab. 7 showed that, when Naïve Bayes was used as the base classifier for wrapper-based feature selection method, the performance of NB using PSO search method was enhanced to 0.854, 0.855, 0.854 and 0.854 for accuracy, precision, recall and F-score respectively. The performance of the other classifiers using this method was reduced.
Tab. 8 shows the performance of classifiers when the wrapper-based features selection method with c-SVM as the base classifier was applied. The results showed the enhancements obtained by all classifiers using all search methods. However, the best performance was obtained by SVM using First Best and Greedy Stepwise search methods.
However, Tab. 9 shows the performance of classifiers when wrapper-based features selection method with nu-SVM as the base classifier was applied. The results showed that the enhancements were obtained by applying c-SVM, K-NN and RF, especially when the POS search method was used.
In addition, Tab. 10 shows the performance of classifiers when wrapper-based features selection method with MLP as base classifier was applied. The results showed that the enhancements were obtained by applying MLP and RF for the three search methods. The best results were obtained using MLP classifier.
Moreover, Tab. 11 shows the performance of classifiers when wrapper-based features selection method with K-NN as base classifier was applied. The results showed that the enhancements were obtained by applying K-NN and RF for the First Best and POS search methods. The best results were obtained using K-NN classifier with accuracy, precision, recall and F-scores of 0.883, 0.884, 0.883 and 0.883 respectively.
Tab. 12 shows the performance of classifiers when wrapper-based features selection method with RF as base classifier was applied. The results showed that the enhancements were obtained by applying MLP and RF for the three search methods. The best results were obtained using RF classifier.
Tab. 13 shows a comparison of different wrapper-based features selection methods (using different base classifiers). The results showed that the best performing classifier was K-NN associated with the wrapper-based feature selection with KNN as base classifier, obtaining 88.33% accuracy. The number of features was reduced (with the best performance obtained) to 20, 5 and 22 using First Best, Greedy Stepwise and PSO search methods. Finally, Tab. 14 shows a comparison of using different features selection methods (filter and wrapper base methods). It shows that the best performance was obtained by K-NN classifier associated with wrapper-based feature selection method with K-NN as base classifier and using Best First and PSO search method. For this paper a comparison has been conducted between the best performing methods and the previous studies on predicting Parkinson's disease using the same dataset, and other datasets, as shown in Tab. 15. The comparison results showed that the best performing method (K-NN classifier associated with wrapper-based feature selection method with K-NN as base classifier and using Best First and PSO search method) obtained comparable and superior results. 57.5% Other dataset [57] 82.5% Other dataset [58] 82.5% Other dataset [59] 87.5% Other dataset

Conclusions and Future Works
This paper examined the performance of several classifiers with filter-based and wrapper-based features selections methods to enhance the diagnosis of Parkinson's disease. Different evaluation metrics were used including accuracy, precision, recall and F-score. The experiments compared the performance of machine learning on original and filtered datasets. The results showed that wrapperbased features selection method with K-NN enhanced the performance of predicting Parkinson's disease, with the accuracy reached to 88.33%. In future work, more machine learning and deep learning methods could be applied with these combinations of features selection methods. In addition, other features selection methods could be investigated to improve the performance of predicting Parkinson's disease.