An Ensemble Methods for Medical Insurance Costs Prediction Task

: The paper reports three new ensembles of supervised learning predictors for managing medical insurance costs. The open dataset is used for data analysis methods development. The usage of artificial intelligence in the management of financial risks will facilitate economic wear time and money and protect patients’ health. Machine learning is associated with many expectations, but its quality is determined by choosing a good algorithm and the proper steps to plan, develop, and implement the model. The paper aims to develop three new ensembles for individual insurance costs prediction to provide high prediction accuracy. Pierson coefficient and Boruta algorithm are used for feature selection. The boosting, stacking, and bagging ensembles are built. A comparison with existing machine learning algorithms is given. Boosting modes based on regression tree and stochastic gradient descent is built. Bagged CART and Random Forest algorithms are proposed. The boosting and stacking ensembles shown better accuracy than bagging. The tuning parameters for boosting do not allow to decrease the RMSE too. So, bagging shows its weakness in generalizing the prediction. The stacking is developed using K Nearest Neighbors (KNN), Support Vector Machine (SVM), Regression Tree, Linear Regression, Stochastic Gradient Boosting. The random forest (RF) algorithm is used to combine the predictions. One hundred trees are built for RF. Root Mean Square Error (RMSE) has lifted the to 3173.213 in comparison with other predictors. The quality of the developed ensemble for Root Mean Squared Error metric is 1.47 better than for the best weak predictor (SVR).


Introduction
Digital health is a sector that is growing globally. In the whole world, the number of Digital Health companies has been doubled in the last five years [1]. The governments pledged hundreds of millions of dollars to support the local digital health industry. The prediction of individual Health insurance in developed countries is experiencing two critical problems, such as the rapid cost of health care and the growing number of people who are not insured. Such influence generates growing political support for broad-based reforms to address these issues.
An analysis of this problem will allow us to assess the risks to human health, namely the projected cost of treatment, the quality of life of people and their level of well-being. This applies to the cost of insurance in the lives of individuals. In addition, the results of the analysis will provide an assessment of the risks of insurance companies regarding payments. Namely, the forecast results become important. They must be accurate enough to measure or quantify the amount covered by a particular policy and the insurance costs to be paid for it. Different variables evaluate these indicators, where each of them is important. Thus, insurance is a process that reduces or eliminates the cost of losses caused by various risks and factors.
If an indicator is omitted when calculating the amounts, the insurance policy changes in general. Therefore, it is critical that these tasks be performed with high accuracy because human mistakes can happen. ML can summarize the effort or method for insurance policymaking. The model trained based on insurance data can be defined as the model's input data, then the model can correctly predict the cost of the insurance policy. This reduces human effort and resources and improves the insurance company's profitability. Thus, the accuracy can be improved with the proposed three different ensemble models, which will optimize the forecasting process.
Medical insurance is an essential part of the medical domain. However, medical costs are difficult to predict since most money comes from rare conditions of the patients. Different machine learning algorithms and deep learning technics are used for data prediction. The two parameters training time and accuracy, are analyzed. The training time of the biggest part of machine learning algorithms is not too huge. However, the accuracy of the prediction results for these methods is not so high. Deep learning models allow to find hidden patterns too, but training time does not allow to use of these models in the real time [5].
That is why the paper aims to develop new ensembles for individual insurance costs prediction to provide high prediction accuracy. The novelty of the paper is new schema of stacking ensemble base on weal predictors selection and hyperparameters choosing. (1) First, two feature selection technics for the comparison of the prediction accuracy of the different machine learning algorithms were applied. The weak components for the design an ensemble models were found; (2) Second, three different ensemble models based on boosting, bagging, and stacking approaches for solving medical insurance costs prediction task were designed; (3) Lastly, it is experimentally established that the new stacking model based on machine learning algorithms that use Random Forest as a meta-algorithm provides higher prediction accuracy for solving the stated task.
The paper is organized as following. The literature review and methods and models for medical insurance cost prediction are given in Section 2. The Section 3 represents the dataset description and Exploratory data analysis. Weak predictors were selected. Next, the novel approach based on three ensembles of the weak predictors are developed. Section 4 represents the results of the developed ensembles and comparison with other predictors. The conclusion (Section 5) underlines the novelty of the proposed approach and prospects for further research.

Literature Review
Methods and systems for medical data analysis are given in [5][6][7][8][9]. The usage of artificial intelligence in the management of financial risks will enable economically wear time and money and save the health of patients. Machine learning is associated with many expectations, but its quality is determined by choosing a good algorithm and the proper steps to plan, develop, and implement the model. The main drawback of the RBF networks for solving this task is that they provide only a local approximation of the nonlinear response surface.
The unique features of data mining with medical data are described [10]. Artificial intelligence works effectively in the initial stages of risk assessment, starting from collecting and analyzing information and ending the development of control algorithms.
Medical data has a multilevel structure with hidden dependencies [11]. There is very important to find patterns and use various methods of analysis together. That is why different ensembles of machine learning (ML) models are used for medical data analysis. The model for nested data is developed in [11]. However, this is a limitation for non-nested dataset.
In paper [12], the ensemble of random forests (RF) and support vector machine (SVM) is used to predict the modulus of elasticity of recycled aggregate concrete. This classical ensemble allows to increase the accuracy, that is why it can be taken into account.
The ML models and their performance for different application domains are analyzed in [13]. The comparison show the different quality of the proposed algorithms.
Paper [14] is focused on SMEs' credit risk problem by forecasting with the help of random subspace and MultiBoosting ensemble. This approach combines more than one type of ensembles.
Paper [15] presents a new framework incorporating 7 supervised ML algorithms to exploit multiple variant callers' strengths, using a non-redundant set of biological and sequence features.
An ensemble of K-Nearest Neighbour (KNN) classifiers for recommendation to leverage the heterogeneity of different groups of meta-features is analyzed in [16].
In work [17], an ensemble-based machine learning model comprises RF, ID3, Adaboost, KNN, Logistic Regression has experimented on diabetic retinopathy dataset. This approach can be used only for classification task.
Paper [18] represents the solution of the forecasting problem of the direction of stock price movement. The tree-based ensemble consists of Random Forest, XGBoost, Bagging Classifier, AdaBoost, Extra Trees combined with Voting Classifier is developed. Bagging Ensemble Classifier is used for Diabetic retinopathy in [19].
The analysis of mentioned papers shown the effectiveness of ensembles in comparison with single ML-based methods. Besides, specific models can be used for regression tasks too. For example, paper [20] presents a non-iterative model using Wiener polynomial and linear SGTM neural-like structure. Wiener polynomial provides a nonlinear input extension. The approximation properties of this polynomial give highly accurate results. Polynomial coefficients are sought using SGTM ANN, which offers high speed. In general, this method shows a significant increase in solving the medical costs prediction task.
However, large degrees of the polynomial significantly increase the learning time of this model. In addition, the method's accuracy is not satisfactory for its practical implementation in insurance companies [21].
The ensembles for text analysis are developed in [22][23][24]. Clustering and randomized search are combined into ensemble for text sentiment classification. Experimental analysis of classification tasks includes also software defect prediction, credit risk modeling, spam filtering, and semantic mapping. However, mentioned methods are used for classification task solving.
That is why it is necessary to develop new or improve existing individual insurance costs prediction methods and tools that would provide high prediction accuracy with sufficient training speed. Authors propose to develop three different ensemble models based on boosting, bagging, and stacking and compare the prediction accuracy with well-known machine learning algorithms.

The Experimental Setup
The experimental setup is organized as following: • Exploratory data analysis (missing data imputation and feature selection); • Weak predictors selection; • Hyperparameters choosing based on Grid search; • The ensemble development.

Dataset Description and Exploratory Data Analysis
The medical insurance payments dataset [29] was selected. It consists of 7 attributes and 1338 vectors. The task is to predict individual payments for health insurance. Data preprocessing stage is described in [20]. The preprocessing for mentioned dataset consists of the following stages: • Missing data imputation, • Data transformation.
In the missing data imputation stage MICE algorithm [30] is used. Totally 13 instances have had missing data. For data transformation stage one-hot encoding is used for binary (sex, smoker) and categorical (region) variables.
Attribute Y is target variable. Dataset consists of variable charges. Statistics is represented below (Tab. 1).  The next step is feature selection. To do this, Pierson coefficient is used (Fig. 1). A significant correlation between features is absent. However, smokers (x6 and x7) correlated with the target variable y. For non-smoker patients (X7), the correlation between bmi (X4) and charges (Y) is not clear.

Figure 1: Correlation matrix
Next, Boruta algorithm was used for significant variables selection. Boruta is the heuristic algorithm for selecting substantial features based on the use of Random Forest. The algorithm's essence is that at each iteration, features are removed whose Z-measure is less than the maximum Z-measure among the added features. To get the Z-measure of a feature [31], it is necessary to calculate the importance of the feature, obtained using the built-in algorithm in Random Forest, and divide it by the standard deviation of the feature importance. The result of the selection is given in Tab. 2. So, age (X1), bmi (X4), smoker (X6, X7), children (X5) and region Northeast (X9) are the most important features.
X6 and X7 are chosen by two methods.

Weak Predictors Selection
For model development, splitting the dataset into the training dataset and testing dataset is built. The general rule of thumb is 75% for split ratio, 75% train, 25% test.
The two prediction models will be built for the whole dataset and selected features, respectively. To create the ensemble, the weak predictors must be selected.
First, linear regression is built for the whole dataset. The regression coefficients and model parameters are given in Tab. 3.  To sum up, there is no significant difference in R-squared error values for the whole dataset and selected features. That is why the whole dataset will be used for other predictors' development.
In the next step, a regression tree is built. 10-fold cross-validation repeated 3 times is performed. The important attributes are X6, X1, X4. The cross-validation error in this table represents x-error. As factors for tree pruning were used xstd, rel-error and x-error. For a description of the tree's height row was used. As a sign of a better model's accuracy, a high number of levels in the tree could be used. Xstd is the bias of x-error. The complexity parameter (CP) controls the size of the regression tree. In addition, the selection of optimal tree size could be done with the help of CP. The stopping criteria of tree building are comparing the cost of adding another variable to the regression tree from the current node and the value of cp. If the first is higher than the second, then the building is stopped. So, CP is penalty results in a fully grown tree. Nsplit represents the number of splitting in single tree. In the next step, the well-known ML algorithms are analyzed for "weak" predictors choosing.
KNN, Support Vector Regression (SVR) with Radial Basis Function and perceptron with 10 neurons in the hidden layer and tangent hyperbolics (tanh) activation function, Stochastic gradient descent are used for proposed dataset analysis. The kernel trick enables the SVR to obtain a fit, and then data is charted to the initial space. The hyperpaparemeters are chosen based on Grid Search. The Cost complexity criterion is used for optimization. The hyperparameters combination was presented in grid form. In the next stage, the optimal parameters for each repressor were chosen. The comparison of weak predictors is given in Tab. 6.

Proposed Ensemble Development
There are three time-tested ways to make ensembles: stacking, bagging, and boosting.
• In short, the peculiarity of stacking is that we teach several different algorithms and pass their results to the input of the last, who makes the final decision. The critical difference is different algorithms because if we teach the same algorithm on the same data, it will not matter. Regression is usually used as the final algorithm.
• For bugging, we train one algorithm many times on random samples from the source data.
In the end, the results are average. The most famous example of bugging is the Random Forest algorithm. It is the possibility of paralleling that gives bugging an advantage over other ensembles. • A distinctive feature of the boosting ensemble is that we train our algorithms consistently, even though each subsequent one pays special attention to the cases in which the previous algorithm failed. We take samples from the source data in the running, but now it's not entirely random. In each new selection, we take part of the data on which the previous algorithm worked incorrectly. In fact, we are learning a new algorithm from the mistakes of the previous one. This ensemble has a very high accuracy, which is an advantage over all other ensembles. However, there is also a downside -it is difficult to parallelize. It still works faster than neural networks, but slower than bugging.
All possible ensembles are developed in the paper.
At the first stage, Boosting modes based on regression tree and stochastic gradient descent is built. Boosting is a compositional machine learning meta-algorithm, which is mainly used to reduce bias (estimation error) and variance in supervised learning also defined as a family of machine learning algorithms that transform weak learning algorithms into strong ones.
The number of folds or number of resampling iterations is equal to 10. The number of complete sets of folds to compute is equal to 3. Automatic tuning of parameters is used too. Mean absolute error (MAE), Root mean squared error (RMSE) and Rsquared error are used for model evaluation.
The results are given in Fig. 2. We can see that the Boosted Stochastic gradient descent produces a more precise model with RMSE equal to 44487.912.

Figure 2: Boosting results
In the next step, a new bagging machine learning algorithm is developed. Bagging includes training the same algorithm many times by applying different subsets sampled from the training dataset. The final output forecast is then averaged across the estimates of all the sub-models.
Bagged CART and Random Forest algorithms are proposed. Both algorithms include parameters that are not tuned (Fig. 3).

Figure 3: Bagging results
We can see that results are worse than for Boosted Stochastic gradient descent. The next step is the combination of multiple predictors using stacking. KNN, SVM, rtree, linear regression (lm), GBM are used for ensemble development.
The final stacking schema is given in Fig. 4. The random forest (RF) algorithm is used to combine the predictions. 100 trees are built for RF.
We combine the predictions of the predictors using random forest. We can see that stacking model has lifted the RMSE to 3173.213 (Tab. 8).

Results
The simulation of the proposed method was carried out using the author's software (console application). The proposed and existing methods are tested on the same hardware: Intel Core 5 Quad E6600 2.4 GHz, 16 GB RAM, HDD WD 2 TB 7200 RPM.
The comparison of ML models and proposed ensembles is shown in Fig. 5. The most significant errors in solving the stated task were obtained using classical single models (NN, Linear regression, SGD). The knn, rtree and SVR methods show slightly better results in terms of RMSE-based accuracy. However, the highest model accuracy is for stacking developed as a combination of weak predictors.
The difference between the rest two ensembles and weak predictors SVR and KNN are not significant. The tuning parameters for boosting do not allow to decrease the RMSE too. So, bagging shows its weakness in generalizing the prediction.

Discussion
The stacking gives the best results and it is built on chosen weak predictors. The developed method increased generalization properties.
A model averaging ensemble combines the predictions from multiple trained models. A limitation of this approach is that individual model contributes the same amount to the ensemble prediction, regardless of how well the model performed. A modification of this approach called a weighted average ensemble weighs the contribution of each ensemble member by the trustor expected performance of the model on a holdout dataset. This allows well-performing models to contribute more and less-well-performing models to contribute less. The weighted average ensemble provides an improvement over the average model ensemble.
A further generalization of this approach is replacing the linear weighted sum (e.g., linear regression) model used to combine the predictions of the sub-models with any learning algorithm (Random Forest). In proposed stacking, an algorithm takes the outputs of sub-models as input and attempts to best combine the input predictions to better output prediction.
The simulation of the developed method for solving the medical insurance costs prediction task showed a significant increase in accuracy compared with existing approaches (regression tree, multilayer perceptron, K Nearest Neighbor, Support Vector Machine, Stochastic Gradient Descent, linear regression, etc.). The quality of developed ensemble for RMSE is 1.47 better than for the best weak predictor (SVR).
The results are presented for the whole dataset. The usage of the well-known ML methods and proposed ensembles are not significantly different.
An essential role in implementing the computational intelligence methods for solving the practical tasks of processing large data arrays is important for the duration of the training procedure. That is why the comparison of the training procedure duration for all considered methods is given too.

Conclusion
The paper describes three new ensembles of supervised learning predictors for managing medical insurance costs. Open dataset is used for data analysis methods development. Several weak predictors are implemented on this dataset.
As it shown, the adding new predictor can improve the predictive accuracy, because the base predictors' outputs are features for the final predictor. In this case, these 'second level' features are likely correlated because all base predictors are all trying to predict the same thing. But, they do it suboptimally. The hope is that they behave in different ways, so that the final predictor can combine the noisy predictions into a better final prediction. Loosely, then, adding new base predictors has the best chance of helping when they do a good job and behave differently than existing base classifiers, but this isn't guaranteed. If the new predictors perform at chance they can't help, and will probably hurt. The final predictor can overfit, and providing it with more base classifiers may increase its ability to do so.
Seven weal predictors were analyzed with tuned hyperparameters. The best weak predictor is SVR with RMSE equal to 4665, 074.
Four ensembles were developed in the paper, two of them are boosted ensembles. The boosting and stacking ensembles shown better accuracy than bagging. The worth accuracy is shown the bagged Random forest equal to 4651, 663. The stacking is developed using K Nearest Neighbors (KNN), Support Vector Machine (SVM), Regression Tree, Linear Regression, Stochastic Gradient Boosting. The random forest (RF) algorithm is used to combine the predictions. One hundred trees are built for RF. Root Mean Square Error (RMSE) has lifted the to 3173.213 for training dataset and to 3185.423 for testing dataset. A comparison with existing machine learning algorithms is given. The highest model accuracy is for stacking developed as a combination of weak predictors. The quality of developed ensemble for RMSE is 1.47 better than for the best weak predictor (SVR).
The limitations of the study are the following: • The time complexity allows to use the proposed ensemble in the real time in distributed mode only. • The quality of the ensemble depends on the dataset. For an imbalanced dataset, the prediction accuracy will be lower; • The modeling of charged cases should be provided together with clustering analysis. The authors plan to model each separated cluster and compare the predicted accuracy.
We also will conduct future research in designing cascades based on existing machine learning algorithms or ANN. This approach will provide the possibility of linearization of the response surface, which will significantly affect the overall accuracy of the regressor.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.