An Intelligent Forecasting Model for Disease Prediction Using Stack Ensembling Approach

: This research work proposes a new stack-based generalization ensemble model to forecast the number of incidences of conjunctivitis disease. In addition to forecasting the occurrences of conjunctivitis incidences, the proposed model also improves performance by using the ensemble model. Weekly rate of acute Conjunctivitis per 1000 for Hong Kong is collected for the duration of the first week of January 2010 to the last week of December 2019. Pre-processing techniques such as imputation of missing values and logarithmic transformation are applied to pre-process the data sets. A stacked generalization ensemble model based on Auto-ARIMA (Autoregressive Integrated Moving Average), NNAR (Neural Network Autoregression), ETS (Exponential Smoothing), HW (Holt Winter) is proposed and applied on the dataset. Predictive analysis is conducted on the collected dataset of conjunctivitis disease, and further compared for different performance measures. The result shows that the RMSE (Root Mean Square Error), MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), ACF1 (Auto Correlation Function) of the proposed ensemble is decreased significantly. Considering the RMSE, for instance, error values are reduced by 39.23%, 9.13%, 20.42%, and 17.13% in comparison to Auto-ARIMA, NAR, ETS, and HW model respectively. This research concludes that the accuracy of the forecasting of diseases can be significantly increased by applying the proposed stack generalization ensemble model as it minimizes the prediction error and hence provides better prediction trends as compared to Auto-ARIMA, NAR, ETS, and HW model applied discretely.


Introduction
The research community has been drawn to clinical databases for potential study and accurate forecasting, which allows people to take appropriate precautions to prevent future diseases. Time series forecasting techniques are frequently used to design forecasting systems for disease prediction through a collection of clinical datasets. These techniques discover patterns and trends in the time series data and use that in conjunction with the current year patterns to estimate the future occurrences [1]. Time series can be defined as a series of measurements for the time span selected. This time span may be equivalent to weekly, monthly, quarterly, annual, etc. [2]. A time series represents a series of t real value data is shown as Z 1 , . . . . . . ., Z t , where Z i (1 ≤ i ≤ t) are the values recorded at a particular time i [3]. In addition to finding meaningful patterns in the data, time series forecasting techniques offer several advantages like reliability, able to find seasonal patterns, and trend estimations. On the Other Hand, these techniques suffer from the drawback of high generalization error of prediction. However, combining different forecasting models or the Ensemble model can be applied for reducing the generalization error and enhancing accuracy. Ensemble modeling is a metaheuristic way of combining different machine learning techniques to form a final forecast model to reduce variance and enhance prediction accuracy [4]. Ensemble converts multiple weak learners into single strong learner [5]. An ensemble model is primarily applied because of its capability in producing accurate results in different applications like classification or regression problems [6]. Following are the two main ways to perform ensembling for different models:

Sequential Ensemble Method
In this technique, the base learners are combined consecutively, wherein values obtained from previous model are used in the next model (e.g., AdaBoost), so the upcoming model handles error in the last model. Working of a sequential ensemble model can be illustrated in Fig. 1 [7].

Parallel Ensemble Method
The base learners are produced in parallel i.e., side by side (e.g., Random Forest), and training data are provided to each model parallelly then combine all model result simultaneously. Working of a parallel ensemble model is shown in Fig. 2 [8].
One of the most widely used parallel ensemble model is stacking, where different classification or regression models are combined by a meta model [9]. It is essentially two-tier ensemble model, one is the base level (level 0) model that is trained on the entire training set, next is the meta (level 1) model which is trained over the outcomes of the base model [9]. It can be depicted in Fig. 3.
The manuscript proposes a stacked generalization ensemble model for time series forecasting techniques for prediction of number of incidences of Conjunctivitis disease. This research work proposes a meta learning approach i.e., stacking for robustly combining time series forecasting techniques that specializes them across the time series. The proposed model is applied on the conjunctivitis disease dataset and empirical results demonstrate the competitiveness of our model in contrast with the independent approaches for time series forecasting.  Conjunctivitis is the conjunctiva inflammation, the thin and transparent tissue layer that forms inside the eyelid covering the eye's outer surface (white part or sclera) [10]. Each year, approximately 3 million cases occur in the United States. By dint of inflammation, the blood vessels in the conjunctiva become more visible that causes a reddish or pink appearance in the eye. It is mainly caused by viruses, bacteria (like Hemophilus influenzae and Streptococcus pneumoniae, etc.), allergic or immunological reactions, or by medicines. The symptoms of Conjunctivitis are itching in the eye, blur vision, swelling of the conjunctiva, gritty feeling in eye, pain, burning sensation in the eye, tearing, discharge in the eye that forms a crust at the time of sleeping which makes eyes to be stuck shut in the morning [11]. Conjunctivitis comes in many different forms, like Infective Conjunctivitis, Allergic Conjunctivitis, and Irritant Conjunctivitis [12].
Conjunctivitis is one of Hong Kong's most rudimentary ailments. Hong Kong's Department of Health and Government is carrying out many possible operations to avoid the possibility of future conjunctivitis disease. Many cases of conjunctivitis are still registered in Hong Kong city every week, even after the government's vital course of action. Hence, the advance prediction of future instances of conjunctivitis cases can help the government take pre-action to curb it. Time series forecasting techniques can be used to predict the future events of the same.
This manuscript aims to provide an ensemble model for evaluating and finding the most suitable method in estimating future instances of conjunctivitis disease. In this manuscript, the Conjunctivitis case dataset for the past few years is collected for analysis and forecasting, and initially, different time series forecasting models are applied to the data for future prediction of cases of conjunctivitis, and then a novel ensemble model is created with stack generalization technique. The research hypothesis is to generate a robust model based on diverse learners which can capture all the details of the time-series data and produce the accurate results. The base time series forecasting model to create the ensemble model are ETS, NNAR, Auto Arima, and Holt Winter, which are henceforth defined in the section on methodology.
In addition, each predictive model delivers different predictive outcomes depending on the dataset used. So, with various error metrics, the quality of the appropriate model is estimated. Error metrics that are used in this manuscript are as follows: RMSE, MAE, MAPE, and ACF, etc., details on the same is provided in the section on methodology [13]. Fig. 4 shows the proposed ensemble model for conjunctivitis disease prediction. The proposed ensemble model is stack ensemble where three model are used as base models and one model as meta model [14]. Used Base models are auto Arima, NNAR and ETS, and with this used meta model is Holt Winter model. Step 1: Divide the Historical data for conjunctivitis is into train and test set.
where X is train and Y is test data.
Step 2: Train each base model level 0 type (i.e., auto arima, NNAR and ETS) on train set.
Step 3: Find out the fitted values for auto arima, NNAR, ETS is given as: where X is training data, ε ta , ε tb , ε tc are errors generated by each model at time t respectively andX 1 , X 2 and X 3 are fitted values from model function f 1 (X ), f 2 (X ) and f 3 (X ) respectively w.r.t. the model auto arima, NNAR and ETS.
Step 4: Fitted value from step 3 are passed to the stack generalizer which will calculate the mean of all fitted values. Let the mean of all fitted value isX , can be given as: Step 5: The mean of fitted values calculated in step 4 with the help of stack generalizer will be given to level 1 meta model (i.e., Holt Winter model) as training set for train the model.
Step 6: Now forecasting is done from trained Holt Winter model, which can represent as: whereX is forecasted value on training dataX with hw() as Holt Winter model function.
Different models used in the Ensemble model are detailed as below:

Neural Network Auto Regression (NNAR/NAR)
A model based on the design and structure of the brain is known as an artificial neural network. It is said to be a smart model having the potential to acknowledge nonlinear features and time seriesbased patterns and then deal with the varied nonlinear relationship among dependent variables and its independent variable. The defining equation for the NNAR model can be given as follows, in which the target value for a neuron can be defined as shown in Eq. (7) [15]: where fun() is said to be the activation function of NAR, b is the bias of neuron, x j is the input variable, w j is the weight of neuron and Z depicts what the outcome of the model will be. Equation for predicted value can be given as Eq. (8) [15]: whereŷ is predicted value ofy, f is a nonlinear function, y(t − p) previous vector time values and ε t is the vector of random errors side. ε t represents the error between the actual value and predicted value.

Auto ARIMA (Autoregressive Integrated Moving Average)
This model is the combination of AR (Auto Regression) model that predicts past values and MA (Moving Average) model [16] that makes a prediction on random error terms, and I stand of integration that is done to make it stationary. It can be written as: ARIMA (p, d, q) (P, D, Q) where p represents the AR order, d is differencing value, q is MA order, P, D, and Q represents the respective values for seasonal component.
Mathematically it can be written as: where S represents the duration interval, ϕ implies AR parameter of order p, Φ represents seasonal AR parameter of orderP, Z t is observed value at t, θ is the MA operator of order q, Θ is the seasonal MA parameter of order Q, ∇ d is the differencing operator, ∇ D S is seasonal differencing operator and a t at is the noise component [16].

Exponential Smoothing Model (ETS)
ETS model is special cases of ARIMA models. The latest observations are given exponentially more weight than older observations. ETS provides larger model class and each model is labeled as pair of (T, S) defining the type of 'trend' and 'seasonality', and it allows model selection via AIC (Akaike Information Criteria) i.e., triple exponential smoothing and all ETS models are nonstationary. It is a triplet (E, T, S) where E stands for error, T for trend and S seasonality components. It uses STLM (Seasonal adjustment) via STL (Cleveland-style loess).

Holt Winter Model
Holt

Forecasting System
where Y t is observed series, α, β and γ are the smoothing parameter (0 ≤ α, β, γ ≤ 1), L t called as smoothed level at time t, b t is the change in the trend at time t, s t is seasonal smooth at time t, p is the number of seasons per year, and h is the periods ahead forecast [17]. Holt-Winters model makes use of heuristic values for the starting state and then by optimizing the mean squared error (MSE), it calculates the smoothing parameter. In contrast to this ETS model optimizes the likelihood function for the evaluation of smoothing parameters as well as the initial states. As a result, it is seen that for a particular time series Holt-Winters gives improved result [17]. But in normal cases, ETS is preferred since it optimizes the starting states.

Data Collection
The initial step is the collection of the disease data related to the conjunctivitis disease cases in Hong Kong city. Conjunctivitis cases weekly data is collected from the Hong Kong government website https://www.chp.gov.hk [18].

Data Preprocessing
The second step is data preprocessing, which deals with the mechanism of cleaning and imputation of the invalid or missing values by zero or by some value like median, mean, etc. [19]. Because data is in decimal number format so to make it a whole number, we multiply it by 10. So, before the weekly conjunctivitis cases were per 1000 and now become per 10000. In order to reduce the number of features, PCA and decision trees are applied.

Time Series Decomposition
The third step of the methodology is to convert conjunctivitis data into the form of time series. The time series formatted information holds a few imperative components, as explained below [19]:

Trend
It is also known as non-stationarity. It is mainly a long-term increasing or decreasing inclination of data. If the data contains a trend, then it should be eliminated from the data. Further, it can be of linear or nonlinear type. Linear trend represents the trend in a particular direction i.e., either increasing or decreasing, whereas in nonlinear trend changes do not follow a straight line. It is a mix of increasing and decreasing waves.

Heteroskedasticity
It mainly shows the randomness or irregularity of the data.

Seasonal Component
For fix and known time spam data show the same behavior, then data called seasonal data.

Stationarity
If the variance and the mean of the time series data is steady, then series known as stationary.

Time Series Analysis
The next step is to analyze the time series because a time series contains several types of patterns. So, to understand and analyze the time series, it is important to decompose the time series into its essential components. The three vital components of a time series are the trend-cycle, seasonality, and random or irregular [19]. Let y i is a time series with its three basic components. Accordingly, for additive time series and multiplicative time series, equations are given in Eq. (14) and Eq. (15) For equation can be represented as: where y i is time series data at the period i, S i is a seasonal factor, T i is trend-cycle, and E i is the reminder components at period i .

Stationarity Testing
The next step is stationarity testing that checks the stationarity or non-stationarity of the time series, which is performed with the help of L-Jung and Augmented Dickey-Fuller (ADF) tests, and further, if time series is found to be stationary then time series forecasting model can be directly applied otherwise, there is need for conversion of the nonstationary series into stationary one. If there is a time series as Z t (t = 1, . . . .., n), then it will be stationary when its mean and variance are constant and its auto covariances does not depend on time t [20]. Non-stationary time series can be converted into stationary by different types of processes such as smoothing, transformation and differencing. The need for transforming the variable is to stabilize the variance or mean. logarithmic transformation [2/0] is used here to reduce the variances of conjunctivitis time series and to make it stationary, it can be described as Eq. (16):

Model Building
The last step involves the application of the proposed model on the time series data. The stacked generalization ensemble model as described in previous section works in two phases. In the first phase, three models namely auto arima, NNAR and ETS are applied. The result of these models is averaged and passed to the meta learner. After those predictions are made and finally the results are evaluated based on error metrics explained below:

Root Mean Squared Error (RMSE)
RMSE is evaluated as the square root of the average of square of difference in predicted and actual values and formula can be defined as Eq. (17) [21].

Mean Absolute Error (MAE)
MAE is measure of error, which is the mean of the absolute error, that is the average of forecasting error without direction. Forecasting error if calculated by the difference of actual and predicted values.

Mean Absolute Percentage Error (MAPE)
It is measuring the magnitude of error compared to the magnitude of actual data, as a percentage. For measuring the accuracies of forecasted data, MAPE is used. It is also known by the name of Mean Absolute Percentage Deviation (MAPD). MAPE is average of absolute percentage error, depicted in Eq. (19) as:

Auto Correlation Function (ACF) Error
It is also a means to find accuracy which depicts the interrelationship of actual time series with time series of lag 1.

Result and Discussion
In this manuscript, a statistical tool called R is employed for the Conjunctivitis disease forecasting. Conjunctivitis data are taken from the Hong Kong website of "The Centre for Health Protection, Department of Health" (https://www.chp.gov.hk). Collected information tells the weekly rate per 1000 of Acute Conjunctivitis of GOPC (General Out-patient Clinics) and PMC (Private Medical Practitioner) Clinic for the duration of 8 years and 1 month i.e., from the first week of January 2010 to last week of December 2019. Here in this data, the sum of GOPC rate and PMPC rate per 1000 is taken as a univariate variable. Then the preprocessing i.e., cleaning, and imputation process applied on Hong Kong conjunctivitis data and because data is in decimal number format so to make it a whole number, it is multiplied by 10. So, before the weekly conjunctivitis cases were per 1000 and now it becomes per 10000. Further, the data is divided into two parts i.e., training and testing in the fraction of 88% and 12% respectively. So, conjunctivitis data from the first week of 2010 to last week of 2017 is taken as training dataset and rest part of data as the testing dataset. After that data is converted in time series objects like ts_conjunctivitis_data, train_ts and test_ts objects for the total conjunctivitis data, training data and test data respectively. Then after this time series conversion, the time series was plotted. Fig. 6 show the time series plot of conjunctivitis data.  As demonstrated in the above-mentioned graph, it is evident that the time series plotted in the graph has elements of trend and seasonality, therefore we can easily conclude that the series is nonstationary. That necessitates us to convert to a stationary one. The conversion of non-stationary series into a stationary is done by logging the series using the log()function. Fig. 8 shows the time series plot of training data with log i.e., plot of log(train_ts).  Autoregressive neural network forecasted and fitted graph on actual training data shown in Fig. 10 describes that the forecasted graph doesn't have a similar trend as actual test dataset. This NNAR result obtains with neural network hyperparameter tuning and tuned parameter are as: set.seed(1234), and parameter P of nnetar() function is set to zero. Fig. 11 shows fitted and predicted graph of ETS (Exponential Smoothing) with seasonality factor on actual training data, this shows that predicted graph is trying to follow the similar trend as test data but still there is too much diversity in results.
Holt Winter predicted graph on actual training data is shown in Fig. 12, here the predicted graph looks more promising and the result shows that it is better than the ETS and auto-Arima.     13 and 14. Here used training data is the mean of the fitted values of three base models named as NAR, ETS and auto-arima.
From the graph depicted in Figs. 13 and 14, it can be visualized that proposed stack ensemble model's predicted graph approximately follows a similar trend as test data set. So, predicted data is much closure to the actual number of cases of conjunctivitis for the period of January 2018 to December 2019. Different error metric from the ensemble model also decreases in comparison to the standard model. Tab. 1 depicts the error values obtained after applying the proposed ensemble model.

Conclusion
The main purpose of this research work is to present a novel forecasting model for conjunctivitis disease prediction. In this manuscript for conjunctivitis historical data from period 2010 to 2019, the available forecasting model are applied, then design a novel stack ensemble model by the combination of the used models in which three model used as the base model and one model is used as meta model of stack ensemble. The fitted value of all three base model given as training data to meta model then prediction made by meta model. After that, the final model is selected based on the comparison of trend depicted and error values of each model.
Here on the comparison, it can be safely concluded that the proposed novel stack ensemble has better prediction trend and less errors like RMSE, MAE, MAPE, ACF1 of the proposed ensemble is decreased significantly. Considering the RMSE for instance, it is 0.23717 for ensemble model which is 39.23%, 9.12%, 20.48%, and 17.23% less in compare of auto-Arima, Neural Network Autoregression, Exponential Smoothing, and Holt Winter model respectively. Therefore, the proposed stack ensemble model adopted as an optimal model for conjunctivitis disease prediction with promising results than another model. In future, the model can be extended by including other contributing factors such as rain, humidity, wind, etc.