With the worldwide analysis, heart disease is considered a significant threat and extensively increases the mortality rate. Thus, the investigators mitigate to predict the occurrence of heart disease in an earlier stage using the design of a better Clinical Decision Support System (CDSS). Generally, CDSS is used to predict the individuals’ heart disease and periodically update the condition of the patients. This research proposes a novel heart disease prediction system with CDSS composed of a clustering model for noise removal to predict and eliminate outliers. Here, the Synthetic Over-sampling prediction model is integrated with the cluster concept to balance the training data and the Adaboost classifier model is used to predict heart disease. Then, the optimization is achieved using the Adam Optimizer (AO) model with the publicly available dataset known as the Stalog dataset. This flow is used to construct the model, and the evaluation is done with various prevailing approaches like Decision tree, Random Forest, Logistic Regression, Naive Bayes and so on. The statistical analysis is done with the Wilcoxon rank-sum method for extracting the
Globally, 42 million premature deaths are encountered annually due to non-communicable diseases reported by World Health Organization (WHO), i.e., roughly 72%. The total count of the non-communicable disease-based death reaches 52.1 million annually by 2030 [
Various recent researches have adopted Machine Learning (ML) models as a decision-support system as the growing awareness of hypertension and diabetes risks for earlier prediction of the disease [
However, none of the previous research works has integrated the characteristics of SMOTE with ML classifier and optimizer models to enhance the prediction accuracy of the model, specifically over hypertension and Type II diabetes datasets. Thus, this research concentrates on modelling and efficient decision-support systems using ML classification and optimization for predicting hypertension and Type II diabetes using individual risk factor data [
The proposed decision support system model includes four major parts: 1) statlog dataset acquisition; 2) pre-processing with SMOTE; 3) classification with Adaboost classifier model; and 4) Adam Optimizer to attain a global solution. These methods adapt their way to predict the disease in the earlier stage to help the individual. The significant research contributions are given below.
Generally, SMOTE technique is applied in the ML algorithm as the newly constructed instances are not the actual copies. Therefore, it softens the decision-support boundaries and helps the algorithm for hypothesis approximation more accurately. Also, it reduces the Mean Absolute Error (MAE) and gives better prediction accuracy.
Here, the Adaboost Classifier model is adopted as it is a best-suited boosting algorithm. The AdaBoost model’s significant benefits are reducing the generalization error, more straightforward implementation of the algorithm, and a broader range of classifier adaptability with no parameter adjustment. However, the classifier model needs to be provided with massive care as the algorithm is more sensitive to data outliers.
Finally, the Adam Optimizer (AO) is used for this research as the model works well with the sparse gradients, noisy data, significant dataset parameters, and it needs lesser memory space. This optimizer model is well-known for its computational efficiency and more straightforward implementation.
The simulation is done in the MATLAB 2018a environment. Various performance metrics like prediction accuracy, sensitivity, specificity, and execution time are evaluated to show the model’s significance and improve prediction accuracy.
The proposed decision support system model includes four major parts: 1) statlog dataset acquisition; 2) pre-processing with SMOTE; 3) classification with Adaboost classifier model; and 4) Adam Optimizer to attain a global solution. These methods adapt their way to predict the disease in the earlier stage to help the individual. Generally, SMOTE technique is applied in the ML algorithm as the newly constructed instances are not the actual copies. Therefore, it softens the decision-support boundaries and helps the algorithm for hypothesis approximation more accurately. Also, it reduces the Mean Absolute Error (MAE) and gives better prediction accuracy. Here, the Adaboost Classifier model is adopted as it is a best-suited boosting algorithm. The AdaBoost model’s significant benefits are reducing the generalization error, more straightforward implementation of the algorithm, and a broader range of classifier adaptability with no parameter adjustment. However, the classifier model needs to be provided with massive care as the algorithm is more sensitive to data outliers. Finally, the Adam Optimizer (AO) is used for this research as the model works well with the sparse gradients, noisy data, significant dataset parameters, and it needs lesser memory space. This optimizer model is well-known for its computational efficiency and more straightforward implementation. The simulation is done in the MATLAB 2018a environment. Various performance metrics like prediction accuracy, sensitivity, specificity, and execution time are evaluated to show the model’s significance and improve prediction accuracy.
The work is structured as Section 2 summarises various existing approaches used to predict the disease and its pros and cons. Section 3 gives the ideology of the anticipated model that includes dataset acquisition, SMOTE technique, Adaboost classifier and Adam optimizer for enhancing the prediction rate. Section 4 discusses the performance evaluation outcomes with the analysis of various metrics to balance the data. Section 5 gives the summarization of the research work with research constraints and ideas for future research improvements.
Generally, medical data is composed of various relevant features and categorized based on its usefulness, i.e., less practical and not useful features (redundant features) based on its progression and formulation of diverse practice measurements. The predictions of these attributes are essential for representing the proximity of the domain appropriately [
In this review section, some notable researches are performed for predicting the disease risk using the Cleveland heart dataset from the UCI ML repository. Avci [
With Tun et al. [
Singh et al. [
References | Methodology | Accuracy (%) | AUC (%) | Total number of chosen features |
[ |
Enhanced ID3 | 78.78 | – | 13 |
[ |
FILM | 88.85 | – | 11 |
[ |
Voting method | 88.42 | – | 9 |
[ |
Cuckoo search with k-NN | – | 77.89 | 13 |
[ |
Discrete Cuckoo search with k-NN | – | 77.58 | 13 |
[ |
MLP | 85.08 | – | 13 |
[ |
SVM | 83.60 | – | 13 |
[ |
J48 | 85.82 | – | 13 |
[ |
MRMD | 86.98 | – | 13 |
[ |
SS-RF | 79.69 | – | 13 |
[ |
PA | 87.58 | – | 13 |
[ |
F-SVM | 85.89 | – | 4 |
[ |
RF-relief attribute | 87.95 | 84.3 | 13 |
[ |
NF-LDA | 86.89 | – | 12 |
[ |
NFR + SVM | 87.58 | 91.13 | 13 |
[ |
NFR + RF | 88.56 | 90.83 | 8 |
[ |
PCA algorithm | 91.11 | – | – |
[ |
Linear SVM model | 92.22 | – | – |
[ |
Statistical |
93.33 | – | – |
[ |
Mutually informed NN | 93.33 | – | – |
[ |
DNN+PCA_GWO | 97.3 | – | – |
[ |
F-DA | 95.59 | ||
This section discusses three phases of the proposed flow: data pre-processing, classification, and optimization. Here, the SMOTE technique pre-processes the missing data and handles the imbalanced data problem encountered in the statlog dataset. Moreover, the Adaboost-based classifier model is adopted to perform disease classification and evaluate the model’s outcome. This work aims to classify the class labels and optimize the values to attain better results.
Starlog dataset is a heart disease dataset like another dataset available in the UCI Machine Learning (ML) repository but shows some slight difference in its form.
S. No. | Attributes | Type | Information |
1 | Age | N | Patients’ age |
2 | Sex | C | Gender |
3 | Chest Pain | C | |
4 | Normal/Resting BP | N | 120/80 |
5 | Serum Cholesterol (mg/dl) | N | Cholesterol level |
6 | F. Blood Sugar | C | |
7 | Resting ECG | C | |
8 | Max. heart rate | N | 72 normal |
9 | Exercise-induced angina | N | |
10 | Old peak (ST depression induced by exercise to rest) | N | – |
11 | Slope (peak exercise ST segment) | C | |
12 | Number of significant vessels coloured by fluoroscopy | N | 0 |
13 | Thal: |
C |
Note: Here, ‘N’ is numeric, and ‘C’ is categorical.
The provided statlog dataset comprises various positive and negative classes where positive is specified as 1, and negative class is defined as 0. However, there is an uneven distribution of classes over the dataset, and these uneven distributions are depicted as the foremost cause of the reducing prediction accuracy in the classifier model. The significant cause is that most ML algorithms do not drastically learn about the pattern for all the positive and negative classes due to the imbalanced dataset. However, negative classes are considered as the minority class with a smaller amount of instances. Therefore, the outcomes generated using the proposed classifier model for a class definition is often more inefficient. Various existing approaches do not consider this minority class as the major flaw in attaining classification outcomes. The key contribution of this pre-processing step is to deal with the imbalanced instances of the provided statlog dataset and handle this issue efficiently using SMOTE approach. Moreover, the outcomes for the minority and majority classes are separately documented to compute the class performance in generating overall classification outcomes.
The author describes that SMOTE is a prominent approach adopted for the classifier construction with the imbalance dataset. This imbalanced dataset is composed of uneven distribution of underlying output classes. It is most commonly used in classification issues of the imbalanced dataset. The handling of uneven instances is a pre-processing technique and a reliable approach. With the origin of the SMOTE approach, there are diverse SMOTE variants deployed and proposed to improve the general SMOTE approach with efficient adaptability and reliability under various situations. It is depicted as the most innovative pre-processing approach in information mining and the ML domain. This approach aims to adopt interpolation with data of minor class instances, and therefore the numbers are improved. It is essential to attain classifier generalization and most extensively adopted to cater to the problems raised due to the imbalanced classification instances over the statlog dataset. Generally, the minority classes are oversamples with the production of various artificial models. The minority class-based feature space is adopted for the construction of these samples. Based on the classification and sampling requirements, the number of neighbours is selected. Lines are constructed with the minority class data points utilizing the neighbours and most efficiently used while handling the imbalanced datasets. It pretends to equalize the minority and majority class instances during the sample training process. A novel imblearn library is utilized for the SMOTE implementation to handle the imbalanced dataset.
Adaboosting helps in boosting the performance of the weak learner and improves the learning performance. Adaboosting is connected with the bagging and bootstrap model, and it is conceptually diverse from others. Generally, bagging is aggregation and bootstrapping, while bootstrap performs a specific sample-based statistical approach where the samples are randomly drawn with replacement. Here, three diverse methods are constructed with random sampling, and others are sampling with replacements. The general factors between these two approaches are depicted as a practical voting approach and the classifier model is more popular with the most satisfactory performance. It is experimentally proven that these model gains superior performance and are determined as the best classifier model. The classification process of this classifier is based on the combination of learning outputs of the provided weak classifiers, and it is mathematically expressed as in
Here,
Adam Optimizer is used to handle the noisy problem and used to optimize the classifier model. Adam optimizer is known as a replacement optimization algorithm for training deep learning models. It merges the best properties of RMSprop and AdaGrad algorithms to enhance the optimization algorithm that handles sparse gradients over noisy problems. The main objective of integrating Adam Optimizer with AdaBoost is that Adaboost has some drawbacks, i.e., vulnerable to a constant noise (empirical evidence). Some weak classifiers turn to be weaker because of low margins and over-fitting issues. Thus, it helps to boost the weak classifier performance and works effectually over the large dataset and is computationally efficient. Adam is a well-known algorithm used in the Deep Learning (DL) field as it acquires better outcomes faster. Adam optimizer outcomes work well in real-time applications, and also it is one of the finest techniques among other stochastic optimization techniques. Adam optimization is an alternative to commonly used Stochastic Gradient Descent (SGD) for iteratively updating NN weights based on training data. The optimizer functionality of stochastic Gradient is different from conventional optimizer because it applies the learning rate of alpha for the overall weight updation. This updation is performed during the learning process. The learning rate is preserved for all network weights and adapted during the learning process separately. It integrates the benefit of two stochastic gradient descents as:
The optimized Adam output is expressed by
where ‘x′-input parameters, i.e., input parameters needed for predicting brain tumour (tumour or non-tumour class). ‘w′-weights that are varied for iterations. To improve the accuracy, ‘b′-bias is added to nodes of NN where the learning rate is unique for all weights. Therefore, bias and weights are updated with the use of AO. It is utilized for updating the bias and weights of the network model used. Updated values are multiplied with given input parameters and classifier functions like ReLU, used for performing classification. The bias and weights are iteratively updated until the model acquires better classification accuracy.
The experimentation is done over the provided statlog dataset, and the associated outcomes are attained. The simulation is done in MATLAB 2018b environment and intends to free the results from biases. Classification needs to be achieved efficiently by balancing the instances with SMOTE technique. Pre-processing is performed over the raw data and later fed into the ML approaches, and the outcomes are evaluated and compared with various existing approaches. The major constraint with the evaluation of the ML approach is that the dataset needs to validate the same evaluation metrics like prediction accuracy, which leads to misleading outcomes due to the uneven distribution of the samples. The system weakness is measured with the uneven sample distribution. The predicted values are not measured as the true value with the missing samples. Most of the general approaches concentrate on the UCI ML dataset for accuracy evaluation. Standard evaluation metrics like accuracy, specificity, sensitivity, AUROC, F1-score and execution time compute the anticipated model. Generally, accuracy is measured as the ratio of proper predictions to the total amount of inputs. The confusion matrices are modelled with the evaluation of True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN), respectively.
Similarly, specificity and sensitivity are computed with FP/(FP+TN) and TP/(FN+TP), respectively. AUROC is another metric used for the evaluation of the classification accuracy of the provided model. Some special constraints are given for the evaluation of minor dataset samples. It is performed with the validation of accuracy measures. The experimentation is done with the Adaboost classifier model and SMOTE technique, outperforming the existing approaches. The methods include Linear Regression (LR), Random Forest (RF), Lightweight Gradient Boosting Machine approach (L-GBM), Support Vector Machine (SVM), Boosted Regression Tree and ensemble model, respectively. The evaluation is done with other datasets like Cleveland and the UCI Machine Learning repository (see
The performance metrics of the anticipated AB-AO model w.r.t. statlog dataset is evaluated. The accuracy without feature selection is 93.25% which is 3.43%, 5.45%, 3.43%, 3.86%, 11.5% and 0.54% higher than other approaches. The prediction accuracy of the AB-AO model with feature selection is 95.56% which is 5.43%, 4.44%, 5.17%, 3.76%, 5.43%, and 2.41% higher than other models. The sensitivity of AB-AO is 91% which is 18%, 16%, 15%, 17%, 21% and 1% higher than other models. The specificity of AB-AO is 92% which is 24%, 25%, 23%, 25%, 21%, and 1% higher than other models. The F1-score of AB-AO is 24% which is 12% higher than Linear Regression (LR), Random Forest (RF), Lightweight Gradient Boosting Machine approach (L-GBM), Support Vector Machine (SVM) and Boosted Regression Tree model and 4% lower than ensemble model. Here, the model considers 13 features where LR selects 11 features, RF chooses 10 features, L-GBM selects 9 features, SVM selects 12 features, BRT selects 8 features, ensemble and AB-AO selects least 7 features. The execution times of all these models are 0.05 min, 0.08 min, 0.78 min, 6.06 min, 1.62 min, and 0.04 min, respectively. The AUROC of AB-AO is 93.58% which is 2.08%, 2.44%, 3.44%, 4.2%, 1.61%, and 1.02%.
Now, the evaluation is done for the AB-AO model with all three datasets. The prediction accuracy without feature selection of AB-AO with statlog dataset is 93.25% which is 3.1% higher than Cleveland and 5.22% lower than UCI dataset. The prediction accuracy with feature selection is 95.56% which is 0.91% and 3.44% lower than another dataset. The sensitivity is 92% which is 2% higher for Cleveland and 4% lower than the UCI dataset. The specificity of AB-AO with statlog and Cleveland dataset is 92% which is 3% lower than the UCI dataset. The F1-score of statlog is 24% which is 4% higher than Cleveland and 18% lower than the UCI dataset. The Cleveland dataset and statlog dataset considers 13 features, while UCI considers 14 features. But, the average feature selection ratio of Cleveland is 8, statlog and UCI is 7 features. The average execution time for the Cleveland dataset and statlog dataset validation is 0.04 min, while for the UCI dataset, the execution time is 0.03 min, respectively. The AUROC value of statlog is 93.58% which is 3.2% and 3.67% lesser than other models.
Dataset | ML algorithm | Accuracy (%) | Sensitivity | Specificity | F1-score | No. of features | Time (min) | AUROC | ||
Without feature | With feature | Without feature | With feature | |||||||
Cleveland | LR | 86.25 | 92.54 | 0.74 | 0.70 | 0.14 | 13 | 10 | 0.10 | 92.86 |
RF | 80.98 | 87.87 | 0.77 | 0.68 | 0.13 | 13 | 9 | 0.10 | 92.36 | |
L-GBM | 84.72 | 88.98 | 0.71 | 0.71 | 0.14 | 13 | 11 | 0.78 | 93.86 | |
SVM | 80.98 | 86.76 | 0.74 | 0.72 | 0.14 | 13 | 7 | 6.80 | 91.40 | |
BRT | 79.87 | 87.87 | 0.85 | 0.89 | 0.15 | 13 | 12 | 1.100 | 92.32 | |
Ensemble Model | 87.56 | 93.14 | 0.90 | 0.92 | 0.16 | 13 | 8 | 0.06 | 95.89 | |
AB-AO | ||||||||||
Catalogue | LR | 89.82 | 90.13 | 0.73 | 0.68 | 0.12 | 13 | 11 | 0.05 | 91.50 |
RF | 87.80 | 91.12 | 0.75 | 0.67 | 0.12 | 13 | 10 | 0.08 | 91.14 | |
L-GBM | 89.82 | 90.39 | 0.76 | 0.69 | 0.12 | 13 | 9 | 0.78 | 90.14 | |
SVM | 89.39 | 91.80 | 0.74 | 0.67 | 0.12 | 13 | 12 | 6.06 | 89.38 | |
BRT | 81.75 | 90.13 | 0.70 | 0.71 | 0.12 | 13 | 8 | 1.62 | 91.97 | |
Ensemble model | 92.71 | 93.15 | 0.91 | 0.92 | 0.20 | 13 | 7 | 0.04 | 92.56 | |
AB-AO | ||||||||||
UCI machine | LR | 88.56 | 95.54 | 0.70 | 0.68 | 0.35 | 14 | 12 | 1.65 | 88.56 |
learning | RF | 79.59 | 92.03 | 0.69 | 0.96 | 0.35 | 14 | 9 | 0.33 | 79.96 |
repository | L-GBM | 95.45 | 94.3 | 0.66 | 0.71 | 0.35 | 14 | 12 | 4.05 | 95.45 |
SVM | 95.18 | 94.04 | 0.68 | 0.72 | 0.35 | 14 | 12 | 37.98 | 95.18 | |
BRT | 85.65 | 93.67 | 0.69 | 0.68 | 0.35 | 14 | 10 | 2.58 | 85.65 | |
Ensemble model | 97.65 | 98.56 | 0.95 | 0.94 | 0.40 | 14 | 7 | 0.04 | 96.85 | |
AB-AO | 14 | |||||||||
The performance metrics of the anticipated AB-AO model w.r.t. UCI ML dataset is evaluated. The accuracy without feature selection is 98.47% which is 9.91%, 18.88%, 3.02%, 3.29%, 12.82% and 0.82% higher than other approaches. The prediction accuracy of the AB-AO model with feature selection is 99% which is 3.46%, 6.97%, 4.7%, 4.96%, 5.33%, and 0.44% higher than other models. The sensitivity of AB-AO is 96% which is 26%, 17%, 30%, 28%, 17% and 1% higher than other models. The specificity of AB-AO is 95% which is 27%, 1% lower than RF, 24%, 25%, 27%, and 1% higher than other models. The F1-score of AB-AO is 42% which is 7% higher than Linear Regression (LR), Random Forest (RF), Lightweight Gradient Boosting Machine approach (L-GBM), Support Vector Machine (SVM) and Boosted Regression Tree model and 2% higher than ensemble model. Here, the model considers 14 features where LR selects 12 features, RF chooses 9 features, L-GBM selects 12 features, SVM selects 12 features, BRT selects 10 features, ensemble selects 7, and AB-AO selects the least 6 features. The execution times of all these models are 1.65 min, 0.33 min, 4.05 min, 37.98 min, 2.58 min, 0.04 min and 0.03 min, respectively. The AUROC of AB-AO is 97.25% which is 8.69%, 17.29%, 1.8%, 2.07%, 11.6%, and 0.4%. The performance of the AB-AO over the statlog dataset is optimal as it eradicates uneven distribution of samples using SMOTE technique. However, the Cleveland and UCI ML repository approaches do not adopt SMOTE or other methods to handle these issues. However, it is noted that the performance metrics can be misleading results compared to the proposed model. Thus, it is proven that the anticipated model outperforms the existing approaches efficiently by adopting essential pre-processing steps.
This research work has provided a clinical decision support system for predicting heart diseases by evaluating the provided statlog UCI ML dataset. The anticipated Adaboost classifier and Adam optimizer work as a superior prediction approach to classify the imbalance dataset efficiently with the well-known SMOTE technique. Moreover, the execution time is considered a remarkable metrics with lesser computation time to show the reliability of the anticipated model. The classifier’s performance entirely relies on parameter tuning with a ‘1’ learning rate and a ‘30’ estimator. The prediction accuracy of the AdaBoost classifier is 90.15% and 95.87% for without and with feature selection. The model’s sensitivity is 91%, specificity is 94%, F1-score is 0.20, time for execution is 0.04 seconds, and 96.78% AUROC. The major constraint with this research work is the lesser number of available samples (statlog) for the evaluation process. However, the prediction accuracy achieved by this model is 95.87% and acts as a superior CDSS during the time of emergency. In the future, deep learning approaches are adopted to enhance the heart disease prediction rate, and statistical analysis with