A New Multi-Agent Feature Wrapper Machine Learning Approach for Heart Disease Diagnosis

Heart disease (HD) is a serious widespread life-threatening disease. The heart of patients with HD fails to pump suf cient amounts of blood to the entire body. Diagnosing the occurrence of HD early and ef ciently may prevent the manifestation of the debilitating effects of this disease and aid in its effective treatment. Classical methods for diagnosing HD are sometimes unreliable and insuf cient in analyzing the related symptoms. As an alternative, noninvasive medical procedures based on machine learning (ML) methods provide reliable HD diagnosis and ef cient prediction of HD conditions. However, the existing models of automated ML-based HD diagnostic methods cannot satisfy clinical evaluation criteria because of their inability to recognize anomalies in extracted symptoms represented as classi cation features from patients with HD. In this study, we propose an automated heart disease diagnosis (AHDD) system that integrates a binary convolutional neural network (CNN) with a new multi-agent feature wrapper (MAFW) model. The MAFW model consists of four software agents that operate a genetic algorithm (GA), a support vector machine (SVM), and Naïve Bayes (NB). The agents instruct the GA to perform a global search on HD features and adjust the weights of SVM and BN during initial classi cation. A nal tuning to CNN is then performed to ensure that the best set of features are included in HD identi cation. The CNN consists of ve layers that categorize patients as healthy or with HD according to the analysis of optimized HD features. We evaluate the classi cation performance of the proposed AHDD system via 12 common ML techniques and conventional CNN models by using a This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 52 CMC, 2021, vol.67, no.1 cross-validation technique and by assessing six evaluation criteria. The AHDD system achieves the highest accuracy of 90.1%, whereas the other ML and conventional CNN models attain only 72.3%–83.8% accuracy on average. Therefore, the AHDD system proposed herein has the highest capability to identify patients with HD. This system can be used by medical practitioners to diagnose HD ef ciently.


Introduction
Heart disease (HD) is a life-threatening disease that can cause heart failure. The heart is responsible for pumping the desired amount of blood to the entire body. The presence of HD may result in insuf cient blood supply [1,2]. In many countries, including the United States of America, HD has the highest rate of incidence [3,4]. According to the European Society of Cardiology, over 3.5 million people are diagnosed with HD annually. The total number of patients with HD worldwide is 2.6 million [5], and half of them have lost their lives after the rst or second year of diagnosis [6]. The estimated expenditure for HD prevention and treatment is about 3% of the global budget for healthcare [7]. The major symptoms of HD are dif culties in breathing, feelings of fatigue or tiredness, and peripheral edema. These symptoms arise due to abnormalities in cardiac or noncardiac functions [3]. The current methods for diagnosing HD are incapable of identifying HD in its early stages [4]. The severe lack of medical supplies and resources, including specialists and equipment, in developing countries, contribute to inef cient and ineffective HD diagnosis and treatment in these nations [5].
Therefore, appropriate prevention and early diagnostic methods must be developed to minimize the risk of death due to HD [6]. Traditional HD diagnostic methods involve invasive techniques that are time-consuming and tedious. In many cases, the accuracy of these methods is inaccurate. Intelligent noninvasive decision support methods are proposed as an alternative to address the limitations of invasive HD diagnostic methods and reduce the risk of death due to HD. These methods provide an advanced diagnosis by analyzing the medical history of patients with HD patient, examining the physical condition of the patients, and then generating a compressive report on HD cases [8]. To perform data analysis, they utilize data mining and machine learning (ML) techniques, including arti cial neural network (ANN), AdaBoost, logistic regression, support vector machine (SVM), Naïve Bayes (NB), fuzzy logic (FL), k-nearest neighbor (K-NN), and decision tree (DT) [9][10][11]. Various ML methods are used to classify and diagnose HD with reasonable accuracy [12,13]. Numerous researchers have tested their respective proposed ML classi cation methods by extensively using HD datasets to investigate and predict heart health conditions [14].
Predicting the occurrence of HD in its early stages on the basis of risk factors that could lead to HD, such as diabetes, hypertension, smoking, age and sex [8], provide a potential solution. The common ML methods utilized in analyzing HD risk factors are ANN, DT, NB, and SVM. In a previous study that aimed to predict risks of HD, regression SVM achieved the highest accuracy of 92.1%, whereas DT obtained the lowest accuracy of 89.6% [15]. Previous studies that compared the accuracy and exibility of various ML techniques in predicting HD reported that associative classi cation approaches, such as NB, ANN, and DT, are superior to other techniques [16,17]. Another empirical study demonstrated that DT-based methods can produce good accuracy despite its simplicity [18]. Another study tested KNN and NB classi ers on the Cleveland HD database and Stat log Heart HD datasets and found that KNN outperforms NB [19]. A prior study employed ANN with backpropagation training algorithm for HD prediction and achieved convenient accuracy [20]. A previous work proposed an ensemble method to enhance HD diagnosis; results showed that the proposed method achieves a higher accuracy than methods with individual classi ers [21]. Earlier studies also employed ANN to minimize human errors in predicting medical indicators of HD, blood pressure, and blood sugar [22,23].
In predicting HD, most ML classi ers, when combined with other methods, perform better and achieve higher accuracy than when they are used as a standalone method. A previous research presented a hybrid FL-based method that combines genetic algorithm (GA) with ANN [24]. In this hybrid method, FL is used to extract HD features, GA is utilized to optimize feature selection, and ANN is employed to classify HD cases. Experimental results indicated an increase in the overall classi cation accuracy. Another work described a combination of GA, FL, and ANN for HD diagnosis [25], which performed well in predicting HD. Another study proposed a hybrid approach combining FL and ANN for HD prediction [20]. This approach yields good results with an accuracy of 87.4%. Two previous studies integrated a combination of GA, ANN, and FL into a coactive neuro-fuzzy inference system (CANFIS) [26,27]. In this system, GA is employed to optimize feature selection and automate the tuning of CANFIS parameters. Results demonstrated that CANFIS predicts HD with high accuracy. Prior works combined an SVM model with particle swarm optimization (PSO) to classify heartbeats [28,29]. In this model, PSO is used to optimize and tune SVM parameters. Results demonstrated that the proposed model produces a higher classi cation accuracy than SVM alone. A related research combined different ML classi ers by adopting an ensemble technique to improve the performance of several classi ers [30]. The authors applied the combined classi ers based on the ensemble technique in healthcare provision and assessed their usefulness in this eld. Another study evaluated different HD datasets via several classi cation models [31], some of which obtained high accuracy. A previous work combined another classi cation method based on the K-means clustering algorithm with the maximal frequent item set algorithm (MAFIA) to address problems in HD diagnosis [23]. In this classi cation method, the K-means algorithm is employed for data extraction, whereas MAFIA is utilized for mining frequent patterns. The authors tested the proposed method by using different weights and factors and found that it has a higher accuracy in predicting myocardial infarction than similar basic methods.
Similar to what a previous study conducted [32], the common evolutionary methods used for feature selection are evaluated on the Cleveland HD dataset. Data show that these methods obtain a higher classi cation accuracy than basic methods. A study integrated a multilayer perceptron (MLP) classi er into SVM methods for HD diagnosis [33]. This classi er achieves a classi cation accuracy of up to 80.41%. Another study proposed a different classi cation method on the basis of MLP ANN for HD diagnosis [34]. The classi er is combined with feature selection and backpropagation learning algorithms. A previous work employed DT, NB, and ANN in a medical computer-based tool to help in HD diagnosis [35]. NB produced the best performance with an accuracy of 88.12%, followed by ANN and DT with an accuracy of 86.12% and 80.4%, respectively. Two studies suggested a three-phase ANN-based approach for HD classi cation [36,37]. Another study utilized a logistic regression classi er in a decision support system for HD case classi cation [38]. However, the classi er produced a low accuracy of only 77%.
The research of [30] is more similar to the present work than to the aforementioned studies. Both studies comprehensively investigate the performance of various ML methods and propose a hybrid ML model for HD feature selection and classi cation. In HD feature selection, the features are divided into sets that contain six features, and then the features of the sets are switched. By comparison, in HD classi cation, classi cation accuracy is measured to nd the best set of features. Nevertheless, the two studies have crucial differences. The work of [30] greatly improves the accuracy of running multiple models. However, the results of this method are applicable to classifying instances with six features only, thereby restricting its generalization. Moreover, its measurement of performance is still limited to a few classi cation models in which the highest accuracy of 85.48% is achieved by the hybrid model. By contrast, the present study adopts a new feature selection method to determine the best set of features. Furthermore, this study utilizes ML models to improve further the classi cation accuracy.
In the present study, we propose a new hybrid model for developing an automated heart disease diagnosis (AHDD) system that classi es HD cases on the basis of deep learning and multiagent paradigms. It includes a multi-agent feature wrapper (MAFW) model for nding a subset of features most relevant to prediction or classi cation tasks. The MAFW model performs within the framework of the wrapper approach to ensure that the best subset of features is obtained. It is especially useful in performing feature selection for small datasets. The MAFW consists of four types of agents, namely, a data preparation agent (α 1 ), a feature selection agent (α 2 ), a data classi cation agent (α 3 ), and a feature evaluation agent (α 4 ). The model includes two popular classi ers in the data classi cation agent, namely, SVM and NB. The model also includes a GA for performing primary feature selection. In addition, cross-validation techniques are employed in particular k-fold. Moreover, different evaluation metrics are assessed to evaluate the performance of our proposed method in terms of accuracy, TP, FP, precision, recall, and F1-measure. The HD dataset is also analyzed via data preprocessing methods. Our method is tested on the 2016 Cleveland HD dataset for HD diagnosis. The main contributions of this study are as follows: • A new hybrid model is proposed for developing an AHDD system. The AHDD system integrates a binary CNN model with an MAFW model. The CNN consists of ve layers that categorize subjects into healthy individuals or patients with HD by analyzing several HD features. The binary CNN architecture includes an input HD data, a convolution, a pooling activation, and output layers.
• The new MAFW model implements SVM and NB to conduct initial classi cation to tune the CNN and GA to perform a global search on HD features and adjust the weights of CNN to include the best set of features.
• The performance of all classi cation ML-based methods is evaluated in terms of prediction accuracy of the overall feature set.
• The performance of all classi cation ML-based methods is also evaluated in terms of prediction accuracy of the selected features chosen by feature selection via the MAFW model combined with cross-validation (k-fold).
• On the basis of the performance evaluation results, this study recommends the use of a particular classi cation ML-based method that works well with a certain feature algorithm in designing powerful computer-aided HD prediction systems.
• The performance of various classi cations ML-based methods as applied to an HD dataset is compared and analyzed.
The rest of the paper is organized as follows. The materials and methods are described in Section 2. In Section 3, the implementation of the proposed method is presented, and its performance in HD prediction is scrutinized. Lastly, the conclusions and directions for future work are highlighted in Section 4.

Materials and Methods
Numerous intelligent decision support systems are utilized to aid in various medical and healthcare needs, including diagnosis, patient follow up, disease remediation, and prognosis. To handle data complexity and uncertainty, these intelligent systems combine some of the most successful and widely used intelligent computational algorithms, such as ANN, GA, and K-means clustering [39]. In the context of the learning process, the issue of HD prediction is considered as a clustering classi cation problem. However, a framework is required to manage different sets of data types. Only one type of class with a restricted HD class set may be classi ed to address this classi cation problem. Doing so allows the easy detection of the correct class, resulting in high accuracy.

HD Dataset
The Cleveland HD dataset is available from the UCI Machine Learning Repository. The Cleveland HD dataset is extensively used by data miners and researchers on ML for evaluation and analysis purposes. The Cleveland HD dataset is composed of 270 instances and 13 features/attributes, including 6 numeric attributes and 7 categorical attributes. A description of the dataset is shown in Tab. 1.
The range of age of the patients selected is 29-79 years. A gender value of 1 represents male patients, whereas a gender value of 0 denotes female patients.
Symptoms of HD are associated with four types of chest pain: 1. Heart muscles do not receive the full amount of blood required, resulting in the narrowing of coronary arteries, a condition that causes Angina type 1. 2. Heart muscles do not receive the full amount of blood required, resulting in the narrowing of coronary arteries, a condition that also causes Angina type 2. The main difference is that Angina type 2 is associated with the chest pain felt when experiencing emotional or mental stress. 3. Some chest pains not related to Angina are experienced for various reasons, and this case is not associated with HD. 4. No symptoms re ecting an HD case are noted.
With respect to features, trestbps represents blood pressure reading in the resting position, Chol indicates the level of cholesterol, and Fbs denotes the fasting blood sugar level. If the blood sugar level is less than 120 mg/dl, a value of 1 is assigned; otherwise, this feature is given a value of 0. Furthermore, Restecg represents electrocardiographic results in the resting position, thalach is the maximum value of the heart rate, and exang indicates exercise-induced angina. If pain is felt, then exang is assigned a value of 1; otherwise, it is given a value of 0. Moreover, an old peak represents exercise-induced ST depression. The slope represents the peak slope exercise of ST-segment. In addition, ca is the number of main vessels colored by uoroscopy, that provides the duration of test exercise in minutes, and num represents the class attribute. With regard to num, a value of 0 is assigned for normal cases; otherwise, a value of 1 is given (for cases with HD abnormality). The desired attribute is classi ed into four categories: the rst three categories re ect HD cases, whereas the fourth category denotes healthy cases. The holdout technique in which the dataset is split into two sets is adopted for training and testing.

Methodology
The proposed approach aims to properly classify individuals as healthy or with HD. The performance of different ML-based methods is evaluated in terms of accurately diagnosing HD on the basis of complete and selected features. A supervised learning-based method is adopted for classi cation data availabilities. A diagnostic system for HD is then proposed. The proposed approach includes different ML classi ers to enhance prediction accuracy. Our proposed methodology involves ve stages: (1) Preprocessing of HD dataset, (2) A feature selection stage involving the MAFW model, (3) A cross-validation process, (4) Theoretical contexts of 11 ML techniques, and (5) Evaluating ML performance via various techniques. The dataset is split into training and testing sets. The ef ciency of the ML classi ers is tested on the dataset described in Fig. 1.

Figure 1:
Machine learning-based identi cation approach for heart disease diagnosis

Data Preprocessing
Data preprocessing is necessary to obtain a suitable data representation for each ML classi er and ensure effective testing and evaluation. Some of these methods are standard scalar method, missing values removal, and MinMax scalar. In the standard scalar method, each feature has a value of 0 for the mean and a value of 1 for the variance, and all features are bridged to a similar factor. The same is true for the MinMax scalar method, in which the data are shifted between 0 and 1 for all features. In the missing values removal method, missing values in each feature row are removed from the entire dataset [41]. The aforementioned data preprocessing methods are adopted in the present study.

Multi-Agent Feature Wrapper
Selection of relevant features is critical for identifying the required classes. It has a positive effect on the ef ciency of ML classi ers in terms of prediction accuracy and execution time. By contrast, selecting irrelevant features in the learning process can negatively affect the performance of ML classi ers. In our proposed method, the MAFW model is applied in selecting the important features of targeted classi cation. HD datasets have over thousands of features but only 13 attributes. Hence, classifying HD is a complex process because of the existence of a wide variety of features and inessential or irrelevant attributes. If the full dataset containing a huge number of features is used, then achieving reliable and accurate results becomes laborious and requires a long computational time. Thus, the size of features must be reduced by selecting proper characteristics as the initial step in the learning process. Doing so helps in understanding the outcomes, thus increasing the classi cation accuracy while enhancing the performance of classi ers.
The MAFW model is proposed to nd a subset of features that are most relevant to prediction or classi cation tasks. The MAFW model performs within the framework of the wrapper approach to ensure that the best subset of features is obtained. This model is especially useful in performing feature selection for small datasets. The MAFW model consists of four types of agents, namely, a data preparation agent (α 1 ), a feature selection agent (α 2 ), a data classi cation agent (α 3 ), and a feature evaluation agent (α 4 ). The model includes two popular classi ers in the data classi cation agent, namely, SVM and NB. The model also incorporates a GA for performing primary feature selection. The MAFW model works according to backward elimination mechanism in which it starts with selecting all features and, within its iteration, removes the least important features while maintaining the most relevant ones. The stopping condition is linked to both the number of removed features and the progress of performance improvement of classi ers. Fig. 2 shows the main components of the MAFW model. In the MAFW model, the agents interact with each other to perform feature selection tasks to reduce the number of features of a given dataset. The agents' goal is to select features that best improve the prediction performance with a minimum effect on the boundaries of learning generalization of the classi ers. The agents' roles in the MAFW model are presented below and summarized in Algorithm 1 (Fig. 2): • α 1 : This agent prepares the feature vector for α 2 to perform the feature selection task and prepares the cross-validation data for α 3 to perform the classi cation task.
• α 2 : This agent integrates a GA that applies a binary feature selection operation to produce subsets of features. The GA presents the feature space as a one-dimension binary vector that forms a GA chromosome. This chromosome represents an individual population in which the total population indicates the actual number of features. Each chromosome contains a number of genes that is equal to the actual number of features (e.g., 15 features are represented by 15 chromosomes, and each chromosome is represented by 15 genes). Each gene can hold a binary value (0 or 1) in which assigning a value of 1 to a gene denotes including a feature, whereas assigning a value of 0 to the gene signi es excluding the feature. The rst initial population of chromosomes is randomly generated to represent subsets of features. α 2 passes a copy of the generated subset of features to α 4 for further evaluation. The GA performs binary crossover to update the genes of two selected chromosomes and binary mutation to shuf e or re ne the selection of a particular chromosome. The crossover and mutation decisions are made by α 2 on the basis of feature evaluation results provided by α 4 (as explained below). It also decides on the stopping condition according to a user-de ned setting to the number of features and iteration thresholds. The discussion above describes the main components and provides a complete description of a run cycle of the MAFW model that constitute the main contribution of this paper. The MAFW model differs from existing models by considering the wrapper of an independent feature analysis classi er (i.e., NB) and a dependent feature analysis classi er (i.e., SVM) in selecting the best subset features. They are speci cally selected to avoid the over tting disadvantage of the wrapper feature selection approach. Moreover, the MAFW model applies a multi-agent system that renders the feature selection process more exible by segregating selection functionalities into four tasks, namely, preparation, selection, classi cation, and evaluation. These tasks then interact with each other and reason over the input, process, and output of each task and apply the necessary revision to the processes responsible for achieving the tasks during runtime. Given that this model relies on the wrapper feature selection approach, its main limitations are high computational and time complexity [42,43].

Machine Learning and Classi cation Algorithms
In the context of the learning process, the issue of HD prediction is considered as a clustering classi cation problem. However, a framework is necessary to manage different sets of available data. Only one type of class with a restricted HD class set may be classi ed to address this classi cation problem. Doing so allows the easy detection of the correct class, resulting in high accuracy. In this section, the theoretical contexts of 11 ML classi cation methods adopted herein are explained. These methods are then compared and analyzed.
• NB is a Bayes theorem classi cation technique. In general, NB claims that a speci c feature present in a speci c class is irrelevant to another presented feature [44]. If the fruit is orange, round, and about 10 cm in diameter, then it can be called an orange. If these characteristics are dependent on each other, or they depend on the presence of other characteristics, both features separately lead to an apple fruit probability; hence, this technique is regarded as "Naïve".
• Stochastic gradient descent (SGD) is a method also employed to nd a minima function. SGD is a linear classi er (linear SVM is by default in sklearn) that uses SGD to train (i.e., to scan for loss minima by using SGD). This estimator utilizes SGD with regularized linear model learning: the estimation of each sample at a time by the gradient of the loss and the model is modi ed along the way with a reduced force schedule (i.e., learning rate) [45].
• The sequential minimal optimization (SMO) algorithm is derived from taking the concept of a decomposition method to its maximum and optimizing at each iteration a minimum subset of just two points. The strength of this technique lies in the fact that an analytical solution is admitted for the optimization problem for two data points, thus eliminating the need to use as part of the algorithm an iterative quadratic programming optimizer [45].
• The voted perceptron method (VPM) is based on the Rosenblatt and Frank perceptron algorithm. This algorithm exploits data with large margins to get the full bene ts of linearly separable classes. Compared with Vapnik's SVM, this approach is easier to apply and also more ef cient in terms of computational time. This algorithm can also be implemented with kernel functions in very high dimensional spaces [46].
• KNN or IBK algorithm is a simple supervised learning from the family of ML algorithms. The main idea behind this approach is to nd a training sample nearest to the new point at a distance and to estimate the label from those data points [47]. Despite its simplicity, this algorithm suffers from numerous classi cation and regression problems concerning the nearest neighbors.
• AdaBoostM1 is a shortcut term for adaptive boosting, which is an ML meta-algorithm devised by Yoav Freund and Robert Schapire, who received the 2003 Gödel Prize for this work [48]. Combined with several types of learning algorithms, this meta-algorithm can be utilized to enhance achievement. Other learning algorithm outputs ("weak learners") are merged with a weighted sum, which indicates the boosted nal outperformance of classi ers.
• LogitBoost is a boosting algorithm developed by Jerome Friedman, Robert Tibshirani, and Trevor Hastie on the basis of ML and computational learning theory. Their original work lays a mathematical foundation for the AdaBoost algorithm [49]. If one considers AdaBoost as a generalized additive model, then one can derive the LogitBoost algorithm and then apply the logistic regression cost function.
• MultiClassClassi erUpdateable is a meta classi er for handling multiclass databases with two-class methods. Moreover, this classi er is competent in using error-correcting yield codes for expanded precision. The main method should be an updateable method [50].
• The Hoeffding Tree is an incremental learner of decision tree for big data streams, assuming the distribution of data does not change over time. A decision tree grows incrementally according to the Hoeffding boundary (or Chernoff bound additive) theoretical guarantees. As soon as suf cient statistical evidence is obtained, a node is expanded until an optimal splitting function is achieved, a decision based on the Hoeffding bound, which is independent of distribution [50].
• J48 is an upgrade to ID3. J48 accounts for extra features for the pruning of decisionmaking trees, continuous attribute value ranges, rule derivation, and missing values. J48 is an execution of open-source Java within the WEKA data mining framework of the C4.5 algorithm. The WEKA tool provides several related choices for tree pruning [50].
• Random Forest (RF) is an ensemble technique typically utilized in the process of classi cation, whereby the use of different decision trees is employed in data classi cation [51,52]. Bootstrap templates are built from the main RF numbers, and a raw classi cation process or regression tree is developed in every bootstrap pattern.
• Hybrid Model for AHDD: A hybrid model for AHDD system is described in this subsection.
The AHDD system integrates a binary CNN model with the MAFW model. The CNN consists of ve layers that categorize patients into "healthy" or "with HD" on the basis of several HD features. The binary CNN architecture includes input HD data, convolution, pooling activation, and output layers. The MAFW model implements SVM and NB to do initial classi cation to tune the CNN and GA to perform a global search on the HD features and adjust the weights of the CNN to include the best set of features. Fig. 3 shows the basic model of the AHDD system.
The AHDD system process starts with an input layer that receives HD symptoms as inputs in a speci c structure, which are then fed to the convolutional layer. Eight convolutional layers (l1-l8) reconstruct the features through a ltering process by using different numbers of kernels (l1:6k, l2:2k, l3:6k, l4:6k, l4:6k, l5:12k, l6:12k, l7:18k, and l8:18k). In conventional CNN, the weights of kernels are randomly initialized. In our hybrid model, the weights are initialized and adjusted on the basis of the MAFW model in which the CNN assigns weights between (0-1) according to the initial classi cation results of SVM and NB during the data classi cation phase in the MAFW model. The operation of feature selection is presented in Section 2.2.2. The weights of features are transformed from a 2D matrix into a 1D matrix to be processed by CNN. Subsequently, the pooling layer reduces the dimension of the feature map by calculating the average of kernels in the convolutional layers by using the average pooling function. The activation layer applies the ReLU function to enhance the rate of converging for the learning process. Finally, the output layer classi es the processed cases into "healthy" or "with HD" according to the training process. When the training process converges the best solution results (highest accuracy), the model parameters are set and the model becomes ready for the testing phase. In the testing phase, the weights tuned by CNN are obtained and by which the minimum classi cation or diagnosis error rate is achieved. Algorithm 2 represents the MAFW optimization to CNN.

Validation of Classi ers
The k-fold cross-validated approach and six metrics are assessed to evaluate the performance of the classi ers. In k-fold cross-validation, the dataset is separated into k of the same size, wherein the classi ers are trained using k − 1 group and the outperformance in each step is checked using the remaining part. The validation cycle is replicated k-times. The performance of 0e classi er is calculated on the basis of k results. Various values of k are chosen for CV. k = 10 is used in our experiment because its performance is good; 90% of the data are utilized for 10-fold CV preparation, whereas 10 percent are employed for research purposes. The procedure is replicated 10 times for every fold of the procedure, and before collecting and testing new sets for the new cycle, both training and evaluation group instances are randomly distributed over the entire data collection [53,54]. Finally, the averages of all output metrics are set at the end of the 10-fold cycle.

Evaluation Metrics of HD Performance
Speci c performance evaluation metrics are assessed to evaluate the performance of the classi ers. A confusion matrix, which predicts any observation in exactly one box in the test set, is used (Tab. 2). This matrix is a 2 to 2 matrix because there are two groups of repose. It also provides two forms of proper prediction for classi ers and two types of incorrect prediction classi er. The following metrics are calculated from the confusion matrix: • TP: performance is measured as a true positive (TP): the observed subject of HD is classi ed correctly and the person suffers from HD.
• TN: performance is expected to be a true negative (TN): the observed subject is healthy and classi ed correctly.
• FP: performance is expected to be a false positive (FP): the observed subject is healthy but wrongly classi ed as having HD (type 1 error).
• FN: the performance is estimated as a false negative (FN): the observed subject is healthy but wrongly classi ed as having no HD (type 2 error).
A value of 1 indicates that the positive case is unhealthy, whereas a value of 0 denotes that the negative case is healthy.
The output of each method is evaluated at this phase to determine which method could achieve the best result. The following parameters are evaluated: Accuracy, precision, recall, and F-measure. These parameters are described and calculated as follows: • Accuracy refers to a measurement's closeness parameter when reading the data value against the real data values: • Precision tests the proportion of related subjects. It measures the classi er's ability to turn down irrelevant subjects: • Recall measures the proportion of identi ed related subjects. It tests the classi er's ability to produce all applicable subject matters: • F-score can be regarded as an average weight of recall and precision, wherein an F1 score achieves the worst at 0 and the highest value at 1. The precision of relative contribution to the F1 score and recall is equal. The F1 score is calculated as follows: where the value of precision is obtained on the basis of Eq. (2) and that of recall on the basis of Eq. (3).

Parameter Settings of Machine Learning Techniques
Each classi cation technique requires one or more parameters that control (effects) the classi er's predictive outcome. Selecting the best values for those parameters is dif cult and requires seeking a trade-off between model complexity and model generalization. In this study, a grid search is used to nd the parameter settings. A grid search involves changing a value grid (2D or 3D depending on the number of model parameters) and increasing each parameter by an af xed interval before the values of the optimal parameter are found. The advantage of this approach is that it allows the selection of optimal parameters at speci ed intervals. However, this approach is considered expensive in terms of computation time. The ML techniques used herein are shown in Tab. 3. SGD Epochs = 500, learning rate = 0.01, loss function = Hing loss, regularization constant = 0.0001 3.
AHDD Eight convolutional layers (l1-l8) reconstruct the features through a ltering process by using different numbers of kernels (l1:6k, l2:2k, l3:6k, l4:6k, l4:6k, l5:12k, l6:12k, l7:18k, and l8:18k) After the step of parameter setting, the dataset is split into testing and training sets according to the cross-validation leave-one-out protocol. The labels are used during the learning process for the supervised approaches. Subsequently, during the test phase, the labels calculated by each classi er are matched with the true labels (reference labels) for calculating the classi cation performance. Unlike supervised models, unsupervised models are trained using the features extracted only and reference labels are not used. Instead, the labels are only utilized for evaluation purposes of classi cation. Remember that (1) all the extracted features are used as classi er data, and (2) only the selected features are implemented. Every sub-dataset is considered separately for selecting the most important features.

Experimental Results and Discussion
The performance of various ML methods, namely, J48, RF, NB, SGD, SMO algorithm, VPM, IBk, AdaBoostM1, LogitBoost, MultiClassClassi erUpdateable, Hoeffding Tree along with the hybrid model (binary CNN model with the MAFW model), and Cleveland HD dataset, are tested and discussed via different perspectives. The MAFW model and cross-validation k-fold method are adopted for critical feature selection. Several metrics are assessed to evaluate the performance of these methods and test the ef ciency of classi ers. All features are standardized and normalized before they are applied to the classi ers. The overall results for the original dataset are obtained and presented in Tab. 4 on the basis of the following 13 features: AGE, SEX, CPT, RBP, SCH, FBS, RES, MHR, EIA, OPK, PES, VCA, and THA. Furthermore, the following parameters are used in the evaluation process: accuracy, TP rate, FP rate, precision, recall, and F-measure. Based on the eight features selected by the MAFW model, namely, SEX, CPT, MHR, EIA, OPK, PES, VCA, and THA, the hybrid model has the highest diagnostic accuracy of 90.1% (Tab. 5), followed by SMO and LogitBoost with 83.8%. SMO is also higher than X in terms of precision but not in terms of recall and F-measure. By contrast, RF has the lowest accuracy of 72.3%. Except for J48, the other methods have accuracies above 80%. In general, remarkable improvements are noted in the diagnosis results obtained for the 12 classi ers when the ltered dataset is implemented. The Hoeffding Tree achieves the highest accuracy improvement of 34.4%; thus, it's performance highly in uenced by the proposed feature selection model. The improvement in accuracy is 6.3%, 5.6%, 4.8%, and 3.9% for SGD and Mul-tiClassClassi erUpdateable, SMO, hybrid model, and J48, respectively. By contrast, IBk has the lowest accuracy improvement of 0.3%. The average development in diagnostic accuracy outcomes is up to 6.17%. Furthermore, the TP rate for all classi ers is substantially increased, whereas the FP rate for SMO and LogitBoost is decreased to the minimum level, indicating that the classi cation results of HD cases are more reliable. In summary, the proposed feature selection model has a considerable effect on accuracy improvement for 11 out of 12 classi ers. This result con rms that the proposed model can work successfully with different types of classi cation algorithms. The differences in diagnostic accuracy outcomes obtained via the ML models for the two databases is illustrated in Fig. 4.
Classi cation can be precisely conducted by exploring the following aspects: (i) The most excellent technique for diagnosing or predicting a given disease or issue, (ii) The ideal classi er for the assessment and determination of HD features, and (iii) The most excellent parameters for the ML methods based on the HD features selected. Thus, by utilizing various classi cation methods to HD datasets, the most appropriate and ef cient ML method can distinguish healthy individuals from patients with HD. In previous studies, when HD features are decreased, the dataset's feature vector is evidently improved, the complexity of ML models is reduced, and the precision of diagnosis is enhanced. For example, MultiClass Classi er Updateable, which is a set of methods that can precisely work with complex choice boundaries, frequently exhibit sensitivity to feature determination. Thus, such methods would likely suffer from over tting issues. Therefore, the assessment, choice, and ranking of HD features within classi er problems are not guided by settled factors. This procedure is executed on the basis of the properties of HD data learning, type of ML models, and complexity of choice boundaries. The MAFW model applies a multi-agent system that renders the process of feature selection more exible by segregating selection functionalities into four tasks, namely, preparation, selection, classi cation, and evaluation. These tasks interact with each other and reason over the input, process, and output of each task and apply the necessary revision to the processes responsible for achieving the tasks during runtime. In this way, this study aims to determine the ideal combination of HD features that can be utilized to obtain adjusted feature selections and to enhance the accuracy of HD identi cation. The MAFW model selects important HD features to distinguish healthy people from patients with HD. According to the MAFW model, the most important and reasonable features for HD identi cation are exercise-induced angina, thallium scan, and type of chest pain. Moreover, the model suggests that fasting blood sugar is not appropriate for the identi cation of patients with HD and healthy individuals. In this study, the classi cation, feature extraction, dataset preprocessing, validation, and evaluation of classi cation performance are comprehensively discussed. A complete set of features and a selected set of features are used to evaluate the performance of our system. The complete set of features is reduced to generate the selected set of features. This process greatly affects the accuracy of classi cation methods and their performance time. The proposed HD diagnosis system can help medical practitioners in identifying patients with HD ef ciently.
Benchmarking is the most essential step that needs to be considered in performing research on common medical processes of disease diagnoses. Benchmarking can be used to compare the ef ciency and reliability of newly developed approaches and existing ones. Benchmarking is usually conducted either through the use of a standard dataset or different approaches for the same problem domain or application. Moreover, benchmarking is achieved by utilizing the best and modern methods for HD classi cation based on existing ML approaches and feature selection methods. Tab. 5 summarizes the different benchmarking approaches for several processes. The limitations of the present work can be summarized as follows: • The models evaluated herein are 12 different classi ers. Additional classi ers should be tested to provide a more comprehensive evaluation of the results.
• Given that the proposed model relies on the wrapper feature selection approach, its main limitations are high computational cost and time complexity.
• Runtime is not considered as an evaluation criterion.

Conclusion
In this study, an AHDD system for HD diagnosis is proposed. The AHDD system integrates a new MAFW model with a binary CNN model. The MAFW model performs feature selection and optimization tasks, whereas the CNN model conducts classi cation tasks. The MAFW model consists of four software agents that operate a GA, an SVM, and NB. The agents instruct the GA to perform a global search on HD features and adjust the weights of SVM and BN during the initial classi cation phase. It chooses imperative features for enhancing the performance of the classi ers. The AHDD system is trained, tested, and validated using the Cleveland HD database. The benchmarking classi cation models of NB, SGD, SMO, VPM, IBk, AdaBoostM1, LogitBoost, MultiClass Classi er Updateable, Hoeffding Tree, J48, RF, and the hybrid model proposed herein are integrated with the proposed MAFW model for testing and evaluation. The K-fold cross-validation technique is used to evaluate the performance of the ML models and the MAFW model in terms of accuracy, TP rate, FP rate, precision, recall, and F-measure. The MAFW model selects important HD features that increase the accuracy of distinguishing patients with HD from healthy individuals for all the tested classi ers. According to the MAFW model, the strongest features are exercise-induced angina, thallium scan, and type of chest pain, whereas fasting blood sugar is found to be a weak feature. Moreover, the hybrid model achieves the highest accuracy of 90.1%, a high precision of 88.9%, and a high recall of 98.4%. The average accuracy of the benchmarking ML models with the aid of the MAFW model is 75.08%, in which SMO and LogitBoost have the highest accuracy (83.8%) and RF has the lowest accuracy (72.3%). Furthermore, the MAFW model increases the overall accuracy of the benchmarking classi ers by 6.2% and that of the hybrid classi ers by 4.8%. In a follow-up study, we will test the proposed hybrid model and the MAFW model in other multivariate datasets. In addition, we will include training and testing runtime as effective evaluation criteria.
Funding Statement: "This research received funding from Basque Country Government."

Con icts of Interest:
The authors declare that they have no con icts of interest to report regarding the present study.