Enhancing Parkinson’s Disease Diagnosis Accuracy Through Speech Signal Algorithm Modeling

: Parkinson’s disease (PD), one of whose symptoms is dysphonia, is a prevalent neurodegenerative disease. The use of outdated diagnosis techniques, which yield inaccurate and unreliable results, continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field. To solve this issue, the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’ voice recordings. Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers. Results were highly successful, with 90% accuracy produced by the random forest classifier and 81.5% by the logistic regression classifier. Furthermore, a deep neural network was implemented to investigate if such variation in method could add to the findings. It proved to be effective, as the neural network yielded an accuracy of nearly 92%. Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’ voices. This research calls for a revolutionary diagnostic approach in decision support systems, and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.

substantia nigra. This leads to a dopamine deficiency, yielding the motor problems of Parkinson's. The full causes behind neural death are still unknown, so there is no definitive medical test that can identify the disease, making it ostensibly problematic to accurately diagnose [1].
Studies indicate that approximately 89% of PD patients experience speech and voice disorders, including a soft, monotone, and hoarse voice, coupled with hesitancy and uncertain articulation. This is because of the disordered motor system that comes in conjunction with PD. Poor muscle activation leads to bradykinesia and hypokinesia can carry on to the muscles involved in speech, possibly leading to reduced locomotion of the respiratory system, larynx, and deficient articulation [2]. An in-depth study specified that when the basal ganglia in the cerebral hemisphere of the brain is affected (as is apt to occur with PD) dysarthria might take place when muscle control is involved in the pronunciation mechanism, as displayed in Fig. 1 [3]. With the progression of PD, vocal folds occur-changes in the vocal cords. The vocal cord muscles become thinner and less taut, affecting their vibration performance and inducing the development of a gap in between the cords. The leakage of air through this gap is what results in the softness or hoarseness that is noticeable in the speech assessment of PD patients [4]. Furthermore, a closer look at the speech pattern identifies short bursts of words with pauses represented in longer, inappropriate silences before speaking again [5]. All of the previous findings identify exactly what to look for when trying to relate between speech signals and potential PD diagnosis.

Difficulty of Parkinson's Disease Early Diagnosis and Detection
It is very important to note that studies have shown that initial diagnoses conducted by general neurologists showed erroneous results in 24% to 35% of the cases upon postmortem patient examination [6]. To this day, diagnostic practices are mainly reliant upon clinical assessments. This is because there are no diagnostic biological markers for Parkinson's disease. Hence, in typical clinical settings, identifying cases of early-stage PD presents a diagnostic challenge. The reason this challenge is evident especially in the early stages of the disease is due to symptoms being highly subtle. Diagnosis relies on detecting the presence of three of the four cardinal motor signs (bradykinesia, tremor, rigidity, and postural instability); this criteria becomes inefficient because of the existence of a multitude of movement disorders sharing the same symptoms (vascular parkinsonism, essential tremor, multiple system atrophy) [7]. The situation can be elucidated by a well-thought conclusion made by Rizzo et al. on the accuracy of clinical diagnosis of PD: "The overall validity of clinical diagnosis of PD is not satisfying. The accuracy did not significantly improve in the last 25 years, particularly in the early stages of disease, where response to dopaminergic treatment is less defined and hallmarks of alternative diagnoses such as atypical parkinsonism may not have emerged" [8].
This paper uses machine learning and deep learning models to enhance the accuracy of earlystage PD diagnosis through classification of processed speech signals. Unlike previous research, which focuses mainly on the speech signal algorithm processing phase, this paper is dedicated to the pursuit of enhancing classification accuracy as much as possible. A gap in the research persists-maintaining the same ultra-specific scope of said algorithms while prioritizing the experimentation of both machine and deep learning methods on the resulting processed data. This voice-mixed-with-computer approach minimizes human error and provides a unique yet robust substitute for the outdated diagnosis techniques practiced in clinical settings worldwide.
Studies are showing that early intervention in PD could potentially help preserve neuron functionality, reduce symptoms, slow disease progression, and improve patient quality of life (QoL) [9]. Additionally, early diagnosis of PD is crucial because treatments such as levodopa/carbidopa are extensively more effective during early administration [10]. From an economic standpoint, estimated annual costs for PD in the US alone are approximately $11 billion, with $6.2 billion in direct costs [11]. Considering that the largest portion of the costs are spent in the later stages of PD, at which symptoms are most acute, then it would be unimaginably more cost-efficient to find a way to address the disease in its early stages rather than spend this huge excess on treating it ulteriorly, and the same policy is applied to the patient's QoL.
The structure of the paper is as follows: the first section presents the introduction, clarifying the challenge at hand and the paper's proposed solution. The first section also includes the structure of the paper which clarifies the content being discussed in each section of the paper. The second section is a background/literature review, which mainly focuses on researchers' previous efforts to solve the issue of inaccurate early diagnosis of PD. The scope starts out general and narrows down to the implementation of computational methods using processed speech signals. The third section represents the methodology, and basically consists of a detailed description of the model implementation. This includes, but is not limited to, model selection, feature selection, and the implementation of each specific machine learning algorithm. Graphs and statistical equations are used as supportive evidence. The fourth section shows the experimental results. This is followed by a comparative analysis and brief discussion to contextualize the value of the paper's contribution. Finally, the fifth section provides the conclusion of the research and future recommendations.

Efforts to Improve Early Diagnosis of Parkinson's Disease
Efforts to improve accuracy of early-stage PD detection have been headlined by biological markers and advancements in neuropathological findings. Using the latter as the gold standard, studies have indeed increased accuracy and called for diagnostic biomarkers [12]. Other researchers have taken intersectional approaches, such as looking for symptoms and biomarkers in cerebral fluid, while also performing tissue imaging and biopsies, as many neurodegenerative diseases are a product of misfolded proteins [13]. In a separate study, a variety of premotor symptoms were identified, and unique approaches such as diminished olfactory functions and REM behavioral sleep disorders were used in attempt of early diagnosis, accompanied by other means of detection such as sonography, MRI, and exceedingly complex neuroimaging techniques. Once again, biological biomarkers such as protein panels, auto-antibody testing, and a 5-gene panel proved to be excellent diagnostic markers [11]. However, a computational and statistical-based approach could spare a lot of human and time resources being exerted in manually and biologically attempting to refine accuracy. Even better, an intersectional approach between all these methods could perhaps be the coup de grâce to end or significantly minimize inaccuracy of PD diagnosis once and for all.
Perhaps the progress most relevant to this paper in the efforts to increase PD early diagnosis accuracy (outside of speech signal processing) comes in a study by Mohskova et al., in which hand movements were obtained (via a motion sensor) to detect PD through machine learning methods. The kinematic parameters of subjects with PD and a PD control group were obtained via three motor tasks-finger tapping, pronation-supination of the hand, and opening-closing hand movements. Different classifiers were used and the key point determination was conducted using maximums and minimums finder algorithm in order to determine the binary disease status (PD or non PD) of each subject. The results were highly informative, displaying 95.3% in finger tapping accuracy, 90.6% for opening-closing hand movements, and 93.8% for pronation-supination [14].

Speech Signal Processing Algorithms
There is an intricate web of steps taken to convert analog sound signals phonated by patients into numbers that the model can analyze. Such is the process carried out in feature extraction, or "extracting features characterizing the underlying patterns of the speech signals using signal processing algorithms" [15]. Dysphonia, or malfunction in voice production, is measured by a series of stages: subjects are brought in to record several volume variations of their voice, and after initial filtration to remove any phonations prone to error (short recording, ensuing of coughing, etc.), thousands of signals of the sustained vowel "a" are processed. The next step is feature extraction, which is concerned with specifics regarding voice oscillatory motion; this has to do with the vocal fold previously explained in the section above. The vocal fold oscillation pattern (vocal fold opening and closure) is almost periodic in healthy voices, meaning that the intervals of time between two successive cycles are almost equal where the vocal folds are apart or in collision. These oscillation intervals are regarded by speech scientists as "pitch period" or "fundamental frequency." While in healthy voices the vocal folds collide and remain together for a certain segment of this cycle, dysphonia is identified by an "incomplete vocal fold closure," resulting in unnecessary breathiness and turbulent noise in the lung airflow. Therefore, those with voice disorders cannot vocalize steady phonations, and this is where speech signal processing algorithms come in; these algorithms take into account the aforementioned physiological conditions and quantify this inefficiency to prepare digital data ready for analysis that ultimately aids in clinical decisions. In speech jargon, these algorithms are called "dysphonia measures." Dysphonia measures are implemented on the thousands of speech recordings obtained from the subjects, and there exists a multitude of software packages such as Praat [15]. Another notable feature extraction method is tunable Q-factor wavelet transform which in a study performed better or comparable to the most recent and developed techniques in PD classification [16].

The Use of Machine Learning and Statistical Methods on Speech Signal Processing Algorithms
An important step in noninvasive PD diagnostic decision support was taken in perhaps the most similar study to the current one; a wide spectrum of speech signal processing algorithms (dysphonia measures) were analyzed using two statistical classifiers: random forests and support vector machines. Patients were asked to vocalize sustained vowels, from which 132 different dysphonia measures were computed. The results were beyond state-of-the-art, with nearly 99% accuracy of classification of ten dysphonia features, proving that this suggested approach can complement existing algorithms in assisting classifiers in differentiation between control and PD patients [17]. However, a limitation in this study is that classification was conducted on only ten dysphonic features, while in reality the number of characteristics in speech signal processing is exponentially more.
A separate study took this idea a step further by implementing non-linear analysis of the range of speech signal processing algorithms on the standard clinical score that determines PD symptom severity (Unified Parkinson's Disease Rating Scale or UPDRS). Along with the normal set of tasks required of the patient, the study tested accuracy using self-administered speech tests. Selection algorithms were used to filter for the best subset, which was pumped into non-parametric regression and classification algorithms. The results were more accurate than clinicians' predictions, showing about 2 points' difference. This suggests the advancement of this technology to scale it up to large-scale clinical trials [16].
Mustaqeem et al. [18] provides a good example of using a neural network to identify patterns in the voice. The researchers formulated a speech emotion recognition system using a stacked network with dilated convolutional neural network features. This specific research is not related to diagnostics of a disease with voice symptoms, but it represents another step taken towards using voice intonations and fluctuations to infer the state of the human subject.

Model Selection
It is no secret that quality of results and model accuracy depend ultimately on two factors: -Data quality -Model selection (then fine tuning that model to optimal performance) Therefore, choosing which model to work with is a decision that in no way can be taken lightly. In fact, a variety of factors go into such a question, and these factors played a substantive role in influencing which models were selected to work with in this research.
Factors affecting model selection: • Size of training data: A large dataset such as the one present in this research is better suited for low bias/high variance algorithms such as decision trees, random forest, and K-nearest neighbor.
• Accuracy: There will always be a tradeoff between accuracy and interpretability of output, as is represented in Fig. 2. In this case, because the goal is to achieve the highest accuracy possible, then a flexible model is highly preferred. • Speed/training time: Models with higher accuracy will usually require a higher training time, such as SVM and random forest, while models like logistic regression are quicker to implement. • Linearity: Kernel SVM and random forest are preferred for non-linear data, while logistic regression and linear SVM are preferred for linear data. • Number of features: Because this dataset has an extremely high number of features, dimension reduction is necessary before continuing to input the data to a classification model [19].
Based on the previous factors, and taking into consideration that a classification model is required to divide between positive detection and negative detection (0's and 1's), a model with the following specifications is required: Classification model-High size of training data (low bias/high variance)-High accuracy-Linear or non-linear data (tested to see)-High number of features.

Figure 2: Accuracy-interpretability tradeoff
Thus, the decision was made to use random forest, logistic regression, and deep neural network algorithms. These three were chosen in specific because they covered all of the aforementioned criteria, but at the same time all three algorithms are significantly different than each other. Each algorithm is distant from the next on the accuracy-interpretability tradeoff spectrum. Each algorithm varies in run time and complexity of handled data. With such a unique challenge at hand, it is imperative to diversify the approach in order to identify the best point of attack in coming trials. The high versatility found between these 3 models facilitated algorithm experimentation, as through the results, it became evident which models were better suited for this unique task. Because of this, the findings hold significant importance for future research papers working on the same topic. Fig. 3 shows the framework of this study's methodology.

Data Retrieval and Import
Data was retrieved as a csv file from a dataset on Kaggle™ (the Google online data science community), and this data was collected from UCI Machine Learning Repository [21].
The data was gathered from 188 patients (107 men, 81 women) with ages ranging from 33 to 87 (65.1Â ± 10.9), provided by the Department of Neurology, Faculty of Medicine, Istanbul University. The control group is made up of 64 people (23 men, 41 women) with ages ranging from 41 and 82 (61.1Â ± 8.9). The data collection process consisted of the subject sustaining phonation of the vowel "a" for three repetitions.
Attribute information: A variety of speech signal processing algorithms have been applied to the dataset, including Time Frequency Features, tunable Q-factor wavelet (TWQT), Wavelet Transform-based Features, and Vocal Fold Features in order to derive clinically significant information for PD diagnosis [22].
Data import: Using Anaconda Jupiter™ notebook in Python, a multitude of libraries were used, namely NumPy (for linear algebra and arithmetic equations), Pandas (data processing and csv file reading), and scikit-learn™ (free machine learning algorithm library for Python).

Data Visualization
Data visualization was done to identify general patterns within the data that would be difficult to recognize with just numbers. Various tools such as MatPlot library, Seaborn, and others facilitate this process. An example of this is a heatmap, such as the one included in Fig. 4, which uses Spearman's rank correlation coefficient to show the correlation between different features. It benefitted the research by indicating which features have high correlations and can be significant, therefore narrowing down the data extensively.

Feature Selection
One of the most crucial and significant phases of the data cleaning process, feature selection is the process in which the 754 features are slimmed down to only 15-20 via dimension reduction. This select batch of features is assumed to be the most significant group of features with the highest correlation, and therefore will yield the most useful data. Numerous dimension reduction methods can be used, but in this research the techniques used were wrapper method (for the random forest model) and tree-based classifier (for the logistic regression model). Each technique will be elaborated on in later segments.

Data Splitting
A commodity found in any predictive modelling, splitting is necessary to divide the dataset into a training set that the model can learn from and a test set that the model can test its predictive accuracy upon. There are endless ways to split the data, but in this research the method used was 70/30. Fig. 5 gives an example of one of the ways used to split the data.

Random Forest Model
One of the most useful and accurate models, the random forest model is basically an aggregation of decision trees whose final decision is equivalent to the majority of final decisions of the trees composing it. After obtaining the dataset and importing it (as explained above), the dataset's first five rows is taken a look at using the head( ) function from the Pandas library.
Feature selection is done using the wrapper method, which follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. In this case for classification, the evaluation criterion is accuracy. The method then selects the combination of features that gives the optimal results for the algorithm [23]. ROC AUC is an integral part of this process (computing Area Under the Receiver Operating Characteristic Curve), in which the curve information is summarized in one number. ROC curve is the plot of the true positive rate against the false positive rate at all classification thresholds. It is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative. This is evident in Fig. 6.
To test if the model works, train_pred and test_pred are compared. Parameters used to create the model: n_estimators (number of trees to build before taking the maximum voting or averages of predictions) random_state (facilitates replication of any solution) max_depth (longest path between root node and leaf node) 2962 CMC, 2022, vol.70, no.2 Figure 6: Graph of area under the ROC curve [24] Finally, the data is fit to the random forest classifier and a confusion matrix is produced to indicate true positives, true negatives, false positives, and false negatives. Lastly, accuracy is produced.

Logistic Regression Model
One of the simpler models, logistic regression is a good way to test the data and see where it stands in terms of complexity and linearity. It is based on the standard sigmoid logistic function in statistics. Fig. 7 shows the graph of the standard sigmoid logistic function. Steps: -Import data and libraries (previously explained) -Display data (previously explained) -Select features using a tree-based classifier: Because logistic regression is a restrictive model that depends on interpretability and time efficiency, using wrapper method would be too time-consuming. Similarly, univariate selection would be incompatible because it can only work with positive values, while the used dataset contains positive and negative values. Another technique, correlation matrix with heatmap, also requires a lot of time as well as high computational power. Therefore the method used, tree-based classifier, was the most suitable method, as it relies on decision trees. Although it might not yield the highest accuracy, it is much more efficient and works with negative numbers.

Deep Neural Network
In the context of this research paper, a neural network can be best defined as an intricate web of classifiers all linked together in the form of a network. This network contains input, output, and hidden layers that pick up on hard-to-detect patterns that simple classifiers would find difficulty with. Neural networks are extremely beneficial in situations where pattern complexity becomes a viable obstacle, and this became the driving force behind the idea of attempting to implement a deep neural network in this research. Steps: -Noteworthy libraries used here are TensorFlow, one of the main neural network frameworks, from which keras (API) and layers are imported. Pandas is also used to display the data entries in a table. -Next, the data is split. The first split is 60% training, 40% validation. At the end of each epoch, the loss is evaluated as well as any model matrices of this data. Then the validation data is split into 80% training and 20% cross validation. Normalization of the data follows; training, valid, and test data are normalized. This is a transformation that maintains an output close to 0 and an output standard deviation close to 1. -Subsequently, layers are created via a sequential model which trains stacks of layers. Data is input and output in the form of tensors, or 3D matrices. The type of layer used in this research is a regularly densely connected neural network layer (dense layer).  Fig. 8. -After the creation of each layer, it undergoes batch normalization to provide more layer stability. In addition, the function Dropout() is used with parameter 0.1 to randomly remove 10% of the nodes, as this prevents overfitting. -Optimization of the neural network is done using the notable method "adam".
-Another detail worth analyzing is the loss factor: The whole goal of training is to increase accuracy by removing losses. The loss factor used here is cross entropy, which is commonly used as a loss function when optimizing classification models. Fig. 9 indicates the relationship between cross entropy and predicted probability.
The callback used in this algorithm was early stopping of the training. Noteworthy parameters are: -min_delta (determines the minimum change in the monitored quantity to qualify as an improvement) -patience (counts how many epochs weren't improved on because the processing stopped) -Batch size (how many samples are processed each time before the model is updated -epoch (number of complete passes through the dataset before the improvement stops) -Verbose (gives a status report of the training).
Lastly, accuracy is determined and visual representation is shown (elaborated on in the results section).

Results and Discussion
After applying different approaches to the same dataset in terms of feature selection and predictive modelling, the resulting accuracy shows discrepancies that can indicate significant differences in quality using all the experimented models, as the neural network showed once again its capability to increase accuracy with a highly pleasing result of nearly 92% by detecting hidden patterns through its multiple layers, as depicted in Fig. 10. Additionally, random forest classifier model and the logistic regression model both yielded high accuracy percentages (90.7% and 81.5% respectively). Looking at the confusion matrices, for the random forest: there were 32 true positives, and 174 true negatives, where true negatives are outcomes that actually were the same as the predictions by the model. There were a combined 21 false predictions, giving a laudable accuracy of 90%, which is highly efficient. For the logistic regression model, there was a combined 124 true predictions (12 true positives and 112 true negatives) and 28 combined false predictions, giving the model a commendable accuracy of 81.5%.
For the sake of a legitimate comparative analysis, the paper whose results were chosen to compare to this study's results was reference 16 "A comparative analysis of speech signal processing algorithms for Parkinson's disease classification and the use of the tunable Q-factor wavelet transform". This is because not only is that paper considered state-of-the-art (SOTA) in its domain, but it is also the paper most similar to the current research. Results of this study are included below in Fig. 11. It has an identical scope of study, yet the authors focus more on the part concerning speech signal processing algorithms, and less so on how to enhance the classifiers used in later steps of their research. The current study focuses on enhancing accuracy via experimentation on the machine learning algorithms. Not only that, but the current study proceeds also to use a deep neural network. Thanks to the above, the current study has yielded favorable results in classifier model accuracy over the SOTA study, as shown below: The results in this figure show us that through different trials and by using various classification methods, accuracies varied from mid-60% all the way up to mid-80%. The highest accuracy found across all trials is 86%, when all feature subsets were used and classified by SVM (RBF). The highest accuracy obtained by logistic regression in the SOTA study is 85% (using all feature subsets). The current study achieved a logistic regression accuracy of 81.5%. Thus, the SOTA study gains the edge in that regard. The highest accuracy obtained in the SOTA study using random forest is 85% (using all feature subsets). The current study achieved a random forest accuracy of 90.7%. The current study is superior in this regard. The current study went a step beyond anything done in the SOTA study by implementing a deep neural network in addition to the machine learning methods. The resulting accuracy was nearly 92%, almost 6% higher than any model in any trial conducted in the SOTA study.  [17] Thus, the current study proves to have significantly enhanced the accuracy of the data present in its counterpart. This implies that if this research's methods had been used in the SOTA study, it would have actually yielded much more accurate results, although this wasn't the purpose of the SOTA as all the authors wanted to do was relatively compare variable features. However, in light of this numerical representation, the bigger-picture takeaway is evident: this study makes a significant contribution to the field of PD diagnostics, doing it in a unique and simple waypatients' voices.

Conclusion
The medical field can rest assured that the future of diagnostics is in good hands with the development of rapidly advancing machine learning technologies such as this one. Setting out with the goal of improving the early diagnosis of Parkinson's disease, this research has certainly achieved its task. After ending up with three different approaches (one deep neural network implementation, one machine learning model that is more flexible and aims for accuracy and non-linearity, and another machine learning model more restrictive, time-efficient, and linear) the researchers managed to conclude the study with a high prediction accuracy of Parkinson's Disease by modelling speech signal processing algorithms. With an accuracy of nearly 92% using the neural network, 90% using random forest and 81.5% using logistic regression, this proves as yet another step towards conquering diagnostic obstacles in the medical field, and is the beginning of a stable implementation of healthcare software in hospitals to aid clinicians with diagnostic decisions for PD patients. The limitations of this proposed method include: 1) the voice recordings of patients must be analyzed by speech signal processing algorithms as a preliminary step, in order to be broken down into features that the computational models can classify. Therefore, the data always has to be pre-processed. 2) Only 3 models were used in this research, which is quite a limited number considering the diversity that other extensive research papers show in model experimentation. Looking towards the future, the authors believe that the next step is to integrate speech signal processing with machine learning modelling. The authors also want to try more models to investigate differences in outcome. If these two caveats are taken care of, then this paper could very well be the first step on the road to worldwide implementation of a healthcare software that diagnoses PD by simply testing patients' voices for a matter of minutes.