Distributed Healthcare Framework Using MMSM-SVM and P-SVM Classification

: With the modernization of machine learning techniques in healthcare, different innovations including support vector machine (SVM) have predominantly played a major role in classifying lung cancer, predicting coronavirus disease 2019, and other diseases. In particular, our algorithm focuses on integrated datasets as compared with other existing works. In this study, parallel-based SVM (P-SVM) and multiclass-based multiple submodels (MMSM-SVM) were used to analyze the optimal classification of lung diseases. This analysis aimed to find the optimal classification of lung diseases with id and stages, such as key-value pairs in MapReduce combined with P-SVM and MMSVM for binary and multiclasses, respectively. For nonlinear classification, kernel clustering-based SVM embedded with multiple submodels was developed. Both algorithms were developed using Apache spark environment, and data for the analysis were retrieved from microscope lab, UCI, Kaggle, and General Thoracic surgery database along with some electronic health records related to various lung diseases to increase the dataset size to 5 GB. Performance measures were conducted using a 5 GB dataset with five nodes. Dataset size was finally increased, and task analysis and CPU utilization were measured.


Introduction
Big data plays a vital role in analyzing extremely large data sets with reduced complexity and efficient analysis. With enhanced techniques of big data, a large amount of data has been handled in parallel. In specific, data classification has been performed using salient solutions. In the real world, data with exponential growth are complex and challenging to classify [1]. Prediction of coronavirus disease 2019 (COVID-19) is mandatory to prevent the risk of spread, and predetermination of lung cancer stages is mandatory to determine lung cells damaged in increasing stages [2]. In medical science, affected parts can be retrieved and used to diagnose early stages of the disease [3]. Biopsy is the initial step in diagnosis; during this process, cells are sampled Lung cancer is a leading disease worldwide. For eradication of lung cancer, health checkers should employ various methods, but processing and extracting results from many datasets are challenging [7]. A previous study [8] extracted information from several datasets by using P-SVM. This technique uses row-based, approximate matrix factorization, which loads only essential data to each machine to perform parallel computation. In addition, some of the computations use big data tools. Another study [9] solved optimization problems over the cloud by using MapReduce techniques along with parallel computation. It also used statistical learning theory to predict the hypothesis that minimizes empirical risks and focused on multiclass parallel computations.
In [10], the author used multiple submodel parallel SVM (MSM-SVM) on a spark to accelerate the training process with non-linear SVM. Furthermore, data splitting methods improve the performance of parallel computations and approximate global solution with several local submodels. The author deployed and encountered a multiclass with a "one-against-one" strategy [11]. A new convolutional neural network-based multimodal disease risk prediction algorithm has been proposed to handle structured and unstructured data [12,13]. In addition, the latent factor model has been developed to handle incomplete data [14]. The former process also reconstructs missing data. Reference [15] analyzed the persistence of diabetes by using HUE. Moreover, they accurately counted the number of persons suffering from diabetes by using SVM. Reference [16] developed a tele-ecg system with Hadoop and big data framework by using mining techniques for processing and classifying datasets related to cardiovascular disease. Although Hadoop has been developed, some of the issues in handling large datasets raised concerns in terms of server handling. The most significant and essential tool in big data is MapReduce. The efficient use of MapReduce improves performance [17,18]. The author analyzed MapReduce impacts and penalty parameters with respect to large-scale datasets, divided datasets into chunks, and processed them under the Hadoop framework. Another efficient sub-model in MapReduce is an adjoint method [19]. The MapReduce based adjoint method prevents brain disease by detecting it earlier.
Reference [20] implemented communication efficient versions of parallel SVM and further developed CA-SVM. The author deployed statistical methods to improve its efficiency in communication and used algorithmic refinements. C-means clustering, which uses the UCI machine repository to collect data, has been proposed for analyzing patient records [21,22]. The author provided a framework for predicting and prescribing drugs for specified diseases. Reference [23] provided predictive pattern matching with Hadoop MapReduce environments to predict diabetes mellitus. The developed machine learning-based prediction methodology has drawbacks in its early analysis. Therefore, a new accurate prediction methodology is required to overcome the proposed methodology.
References [24,25] deliberated the basics of predictive analytics in healthcare. In our system, RBF acts as a non-linear kernel for SVM with respect to study. The study showed the impacts of predictive analytics in healthcare as general applications. A study [26] deployed a parallel RMC algorithm to classify medical data. This algorithm works better for integrated data as in our model. Hence, we used this model for comparison with our proposed model. Cascade SVM from a previous study [27] has been updated and compared with our proposed model. The only difference with cascade SVM is that it classifies the seed of flowers, which is the general application.
In this study, datasets with underlying SVM with threshold-based techniques for classification were developed. Furthermore, classified support vectors were fed to MMSM-SVM with some parameter changes and passes to MapReduce to extract id and stages from classified vectors. Apart from multiple submodels, to cluster similar datasets, were incorporated with kernel clustering-based SVM (KCB-SVM) and de-clustering was reduced and to cover all hidden data the most of dataset falls near the margin of support vectors. P-SVM and MMSM-SVM with some parameter settings were convened for binary classification. Finally, id and stages were retrieved from the MapReduce framework with four nodes of parallel computation. This analysis aimed to find the optimal classification of lung diseases with id and stages, such as key-value pair in MapReduce combined with P-SVM and MMSVM for binary and multiclasses, respectively. In this analysis, the MMSM-SVM algorithm was developed from MSM-SVM to classify highdimensional lung disease datasets. Furthermore, the MapReduce technique was utilized to retrieve different id and stages from the classified support vectors. The obtained result proves that the developed MMSM-SVM algorithm has 92% higher accuracy in classification with optimal data sets when compared with other learning techniques. The P-SVM algorithm also has an accuracy of 90% in classification with different parameter settings for every dataset. Both algorithms were developed using Apache spark environment, and the data for the analysis were retrieved from microscope lab, UCI, Kaggle, and General Thoracic surgery database along with some HER (Electronic Health Records) related to various lung diseases to increase the datasets.

Proposed Approach and Methodology
In big data classification, SVM models and sub-models have their own architecture. The proposed classification architecture is shown in Fig. 1. Samples similar in nature form one cluster, and others are more likely to become support vectors. Samples in different regions are less likely able to train. Meanwhile, the training of samples uses local sub-models.

Modelling of Multi Class-Based Multiple Sub Models Support Vector Machine
Multiclass classification ensembles the most significant part in various classification tasks because it resides in the stages or classes of datasets. The submodel approach is suited for multiclass classification. For every class C ∈ DT/C, a complete multiclass with function f i (X) is trained. The class C t / is selected as the preferred class of any sample, where sample ∈ DT/C and wins all other classes using the winner-takes-all strategy. The resultant models can be formed as Decision function of local sub models can be derived as For one-vs.-all, let training set T = ((x 1 , y 1 ),(x 2 , y 2 ),. . .,(x n , y n )), where y = 1. . .k, where k is the number of classes. Let l = 1. . .k number of classes and l considered as positive class and other k − 1 classes are considered as negative classes. With these representations, decision function becomes To find if the specified class belongs to or not, where r represents the assigned class. (4)

Figure 1: Overall layout of proposed methodology
In multiple submodels, the system must leave away local training by enabling the cluster and splitting models. Some of the clusters may have classes C t , where C t ∈ DT with the effect of classification insight into classes with the largest similarity in the feature space. The clustering model forms local subsets and classes with high preference. KCB-SVM is incorporated with approximate hierarchical clustering method, which scans whole large data sets and provides boundary for similar classes. It also estimates the best boundary with respect to limited resources and provides high scalability.
In the clustering stage, the clustering feature (CF) for every cluster should include where c and r form the center of the cluster and radius. where the radius is calculated by Let (x i , y i ) be input parameters and H i be the geometric metric of the mapped feature space. Radius is calculated with respect to the cluster center and distance between two data points.
For the RBF kernel, the distance measures are computed as follows: Suppose some clusters are not selected for computing the cluster center, then it is computed as follows: H (l−1)k is merged into cluster H ly and then CFs of cluster H ly are updated as H ly = H ly , where H (l−1)k depicts the unselected cluster and H ly depicts the unselected cluster with margin Y (8). The radius can be calculated as the maximum summation of clusters and unselected clusters with the distance of clusters and unselected clusters. Here, l is the cluster level.
Declustering can be implemented with the condition for positive classes, for negative classes, Let the parameters be the number of cores, sample size, LC lung cancer datasets with 1 to C, where C depicts the number of classes, SP sputum datasets, and datasets with 1 to C and DT depicts datasets and clustering model.  Fig. 2 depicts the layout of P-SVM. The support vectors that are already classified are given as input to P-SVM. Subvectors are calculated and optimized using P-SVM. Then, the calculated support vectors of the previous sub-SVM are given as an input to the next sub-SVM. Therefore, the output of more than two last sub support vectors forms input to the present support vectors. The process continues until single support vectors are derived as the result. Furthermore, P-SVM can be achieved in spark using library LIBSVM.   11. S → s + n; \\ add global support vectors with subsets of training data 12.

Modelling of Parallel Support Vector Machine
Train support vector machine with new merged dataset.

13.
Find out all the support vectors with each data subset. 14. Merge

Modelling of MapReduce
MapReduce is a programming model suitable for processing huge data. The developed MapReduce is shown in Fig. 4. Hadoop is capable of running MapReduce programs written in various languages, such as Map phase and Reduce phase. An input to each phase is key-value pairs, and every programmer needs to specify two functions: map and reduce.
MS depicts the splitting of input data in the map phase with respect to different tasks. The data derived from MS is the partial function of given input data, which is required.
Herein, the map phase involves portioning of input data and further returns reduced data, which is then fed as input to reduce tasks. Therefore, results of map tasks would be of (id, stages) pairs with unstructured format. The Reduce task is formatted as, The above function (14) Reduce splits (RS) process intermediate results by formatting and generate partition of reduced data. Then, the reduced task (RT) is given by RT takes input as RS and partitions the reduced data, which is in required format (id, stages).

Experimental Setup
We computed the classification accuracy in the data center with three executors per node and five nodes used. Hence, five nodes with RAM size 64 GB and executor memory at 19 GB and total big data size to 500 GB are used. Furthermore, we increased data size starting from 5 GB by leap and bounds and reduced the running time. Hence, we need 15 tasks/node for data and used 75 tasks with five nodes in parallel. Moreover, Pyspark, LIBSVM, and MapReduce for Parallel SVM binary classification and MMSM-SVM environments were used for the parallel execution of multiclass. The datasets used in the experiment are listed in Tab. 1. We used 8:2 for sputum, thoracic surgery, and lung cancer datasets. Sample size for MMSM-SVM was 0.5. The iteration of the experiment increases by n times, where n depends on the size of the datasets. a. System specific:

CPU (system specific)
Details-cores (4 cores/cpu) No. of nodes Memory-128 gb 3 Network-10 gbps Another system core-4 cores/cpu Memory-64 gb Network-10 gbps 2 The datasets used in our work are listed in Tab. 1. Furthermore, 8:2 was considered for sputum datasets, 7:3 for thoracic surgery datasets, and 5:5 for lung cancer datasets as training data. The sample size for MSM-SVM was 0.5. The iteration of the experiment increases by n times, where n depends on the size of the datasets. We set the iteration as 200 for binary because stability was achieved in the 200 th iteration.

Results and Discussion
MMSM-SVM is also a submodel of P-SVM. The difference is that P-SVM classifies well in binary classification. To obtain accurate results, we used MMSM-SVM and P-SVM for multiclass and binary classification, respectively. The obtained experimental results are shown in Tabs. 1 and 2 for binary and multiclass classification, respectively. The obtained results were compared with previous literature [19,27]. C and γ values changed, and time in sec and accuracy were measured. The analysis was carried out on the basics of C = 2 and γ = 0.09 for sputum datasets, C = 2 and γ = 2 for sputum datasets, C = 2 and γ = 0.09 for thoracic surgery datasets, and C = 2 and γ = 2 for thoracic surgery datasets. As shown in Tab. 1, the proposed methodology takes 28 s and 90% accuracy with C = 2 and γ = 0.09 for sputum datasets while 15 s and 90.4% accuracy for C = 2 and γ = 2 sputum datasets. Computation timing is 31 s and accuracy is 92.2% for C = 2 and γ = 0.09 thoracic surgery while 19 s and 92% for C = 2 and γ = 2 thoracic surgery. This analysis indicates that the proposed methodology takes lesser computational timing with higher accuracy when compared with the methods in [19,27]. These measures were observed at the dataset size of 5 GB.
The results obtained for multiclass classification are listed in Tab. 2. This analysis was carried out on the basics of C = 2 and γ = 0.09 for sputum datasets, C = 2 and γ = 2 for sputum datasets, C = 2 and γ = 0.09 for lung cancer datasets, and C = 2 and γ = 2 for lung cancer datasets. As shown in Tab. 2, the proposed methodology takes 10 s and 91% accuracy with C = 2 and γ = 0.09 for sputum datasets while 80 s and 91.4% for C = 2 and γ = 2 sputum datasets. It has 43 s computation timing and 92.2% accuracy for C = 2 and γ = 0.09 lung cancer datasets while 47 s and 92.4% for C = 2 and γ = 2 lung cancer datasets. The average time for every model was compared with accuracy metrics to show that our proposed method performs better. As shown in Fig. 6, at the specified time 120 s, the accuracy of P-SVM is higher than those of other existing models. Meanwhile, the accuracy of MMSM-SVM is higher than other existing works, as shown in Fig. 6b. This analysis indicates that the proposed mythology takes lesser computational timing with increasing accuracy when compared with the methods in [19,27]. In addition, our dataset contains replicas of data to increase the dataset size.
The execution time and accuracy of our model analysis for 100-1000 mb samples are listed in Tabs. 1 and 2, and graphs for the corresponding plots are shown in Fig. 6.
For the five nodes and above parameter settings in Tab. 2, the average time computation and accuracy for the corresponding time were measured and compared with existing models.
Figs. 5a-5c depict the performance analysis of the sputum, lung cancer, and thoracic surgery datasets obtained for the proposed methodology. From the perspective of the results in Figs. 4-6, the accuracy improved to 92.2% and stabilized for varying iterations. Then, we increased the number of nodes and analyzed the performance. Fig. 6 shows the accuracy analysis for the binary and multiclass classification. As shown in Fig. 6a, the accuracy of P-SVM is higher than those of MapReduce and Cascade SVM. The accuracy measures are 3% higher than those of MapReduce and 8% higher than those of Cascade SVM. As illustrated in Fig. 6b, the accuracy of MMSM-SVM is higher than those of MapReduce and Cascade SVM. The above figure shows that the running time for each node is 120 s on average, which increases with increasing dataset size. For five nodes, it would become 300-380 s for five nodes. Similarly, task analysis was obtained from below graph for about 5 GB dataset. Hence, we increased the dataset size from 2 to 5 GB, and metrics outcomes deviate for each dataset size that has been discussed so far. The number of tasks analyzed for 5 GB data shows that it performs the optimal number of tasks for the corresponding dataset size. That is, it yields only 75 tasks for 5 GB data.
No. of tasks shared between five nodes (number of nodes required and allotted is discussed in Section 3) are as each node-4 cores, 15 tasks/node. In accordance with MapReduce and other tasks, optimized performance includes 20 tasks for 2 GB data and increased data set size as in Tab. 3. We achieved this optimization with respect to all jobs, specifically for MapReduce jobs. The graph plots are illustrated in Fig. 7. Furthermore AUC values were computed by measuring the specificity and sensitivity of various algorithms, as shown in Fig. 9. The corresponding values are listed in Tab. 4.   From the resource utilization, our algorithms and dataset should achieve better CPU utilization. In our study, we achieved about 70%-75% CPU utilization in an average of all algorithms. Fig. 9 illustrates the varying measures of balanced datasets in all our proposed algorithm. In specific, datasets utilize 74% in existing works compared with our proposed method. Even though all mechanisms work well in all metrics, we prove that our datasets work dynamically with respect to every algorithm. CPU utilization plots are illustrated in Fig. 8.  Sensitivity and specificity of all the classes in every model are discussed in Tab. 5, from which AUC values were calculated. The accuracy of the proposed method is 2.3% higher than that of MapReduce and 7.2% higher than that of Cascade SVM. The results prove that the prediction efficiency using the proposed algorithm is greater than that using the MapReduce-based adjoint [19] and Cascade SVM [27]. In some plots, the parallel RMC proposed by [26] has been compared and proves the efficiency of the proposed models for some parameters.

Conclusion
P-SVM and MMSM-SVM were proposed to analyze the optimal classification of diseases, such as lung cancer. The proposed models for binary and multiclass classifications outperform other methodologies. For binary classification, P-SVM deployed and retrieved the stages by using the MapReduce phase. Meanwhile, for multiclass classification, MMSM-SVM retrieved the results with improved accuracy. Using KCB-SVM, datasets split regarding likely samples in a cluster so that the training phase is easier to do and works well in nonlinear dimensions. In addition, the proposed solution approximates better accuracy without repeated training and testing, which enables the model to use the classification and storage capacity. For load balancing, the model uses HDFS balancer. The approach enrolls multiclass with the winner-takes-all strategy. Results show that the support vectors and training time with a large set of data sets scrutinize binary and multiclass classification with optimized parameter settings. In addition, the proposed method shows an accuracy of 90% in classification when compared with competitive methodologies. Our work could diagnose the stages earliest. Thus, the proposed method can be applied to predict other healthcare-related issues, such as COVID-19, by collecting symptoms of patients from electronic health records. Our study can prevent COVID-19 by collecting health conditions of in-patients who treated for other diseases and predict the possibility of COVID-19.
Acknowledgement: This study is supported by the Tamil Nadu State Council of Science and Technology. The authors thank the government for their financial assistance and valuable support.
Funding Statement: This study is supported by the Tamil Nadu State Council of Science and Technology.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.