Forecasting Multi-Step Ahead Monthly Reference Evapotranspiration Using Hybrid Extreme Gradient Boosting with Grey Wolf Optimization Algorithm

: It is important for regional water resources management to know the agricultural water consumption information several months in advance. Forecasting reference evapotranspiration (ET 0 ) in the next few months is important for irrigation and reservoir management. Studies on forecasting of multiple-month ahead ET 0 using machine learning models have not been reported yet. Besides, machine learning models such as the XGBoost model has multiple parameters that need to be tuned, and traditional methods can get stuck in a regional optimal solution and fail to obtain a global optimal solution. This study investigated the performance of the hybrid extreme gradient boosting (XGBoost) model coupled with the Grey Wolf Optimizer (GWO) algorithm for forecasting multi-step ahead ET 0 (1–3 months ahead), compared with three conventional machine learning models, i.e., standalone XGBoost, multi-layer perceptron (MLP) and M5 model tree (M5) models in the subtropical zone of China. The results showed that the GWO-XGB model generally performed better than the other three machine learning models in forecasting 1–3 months ahead ET 0 , followed by the XGB, M5 and MLP models with very small differences among the three models. The GWO-XGB model performed best in autumn, while the MLP model performed slightly better than the other three models in summer. It is thus suggested to apply the MLP model for ET 0 forecasting in summer but use the GWO-XGB model in other seasons.


Introduction
Reference evapotranspiration (ET 0 ) as well as the soil and plant canopy characteristics are usually the main input parameters for soil water balance models. Accurate estimation of ET 0 plays an important role in water resource management and crop irrigation requirement determination, supporting the irrigation scheduling, regional water management decisions, drought early traditional empirical models. Karimi et al. [29] evaluated GEP and SVM as well as empirical models for predicting ET 0 in a humid region of Korea. They found that GEP was superior to SVM when cross-station application of the developed models was applied. Kiafar et al. [30] compared the GEP models and different types of empirical models for prediction of ET 0 in distant humid and arid regions. They found the new model outperformed the other three models, i.e., ANN, M5 and ANFIS models. The GEP model has also been applied in other areas, e.g., island of Iran [31] and Egypt [32], and appropriate results have been achieved.
The kernel-based model is another powerful soft computing method, such as support vector machine (SVM), least squares support vector machines (LSSVM), extreme learning machine (ELM) and kernel-based nonlinear extension of Arps decline model (KNEA). Abdullah et al. [33] firstly introduced ELM as a new method for ET 0 prediction. The highest R 2 yielded by the ELM model was 0.991 in Iraq and the corresponding value was 0.985 by the ANN model. For the same purpose, Gocic et al. [34] evaluated the performance of ELM for predicting ET 0 , as well as three empirical models in Nis and Belgrade stations of Serbia. Their results confirmed that ELM was a stable and reliable model. Feng et al. [35] compared three soft computing approaches (ELM, GANN and WNN) for predicting ET 0 in Sichuan of China. The result showed that ELM model outperformed the other models under different input combinations. Feng et al. [36] also compared the ELM with GRNN model with only temperature data as input in the same region. They found the ELM model performed better than GRNN in the local application scenario and the opposite result was obtained in the cross-station scenario, both of which were superior to the empirical Hargreaves model. Wen et al. [37] evaluated the performance of SVM in extreme arid regions of China and compared it with ANN and three empirical models. It was concluded that SVM was superior to the other models. Since the parameters of ELM and SVM models are not always easy to obtain optimal values, the evolution algorithms have been used to optimize the parameters, and the results of the optimized models were better than the original models [38,39].
Decision tree-based model is another type of soft computing method, which is supported by mathematical theory. These methods are based on the idea of binary tree, which incorporate the theories of bagging or boosting and has been widely used in many fields. Pal et al. [40] firstly used M5 model for ET 0 estimation at Davis station, USA. Rahimikhoob [41] assessed the accuracy of ANN and M5 models for ET 0 estimation with temperature data, air humidity and extraterrestrial radiation as inputs. The results showed that ANN had slighter better accuracy than M5 model and they both performed well. On this basis, the accuracy of the former two models using the temperature data from satellite images as inputs was further evaluated, and M5 was proved to be superior to ANN model [42]. In general, the accuracy of mass transfer methods was low. In order to solve this problem, Shiri [43] used the hybrid of wavelet transfer with RF model to improve the accuracy. Wang et al. [44] used data from 24 stations in a karst region of Southwest China to develop generalized model based on RF and GEP models. They found RF model was superior to GEP model under different input combinations. Similar work has also been done in Central Florida, where four soft computing models, i.e., M5, Bagging, RF and SVM were used [45]. Fan et al. [14] evaluated the performances of four tree-based model (M5, GBDT, XGBoost and RF) as well as ELM and SVM in different climatic regions of China. The results showed that the tree-based models had less computing time and suitable accuracy [46]. However, the above studies have mainly focused on ET 0 estimation based on the known data in the past, and these methods are difficult to be used for future water resources planning. Many scholars are interested in how to estimate future ET 0 , so as to plan and allocate water resources in advance to deal with potential water crisis. In recent years, outputs of numerical weather prediction outputs have been used in estimating ET 0 in Australia [47]. Traore et al. [48] compared four kinds of ANN models for forecasting ET 0 in large irrigation areas of Texas and MLP achieved the highest accuracy. Similar work has been conducted in six cities of China, where the daily outputs of public weather forecasts were empirically converted into PM models, which yielded RMSE values of 0.65-1.08 mm d −1 and MAE values of 0.63-0.84 mm d −1 in the future 7 days. Zhao et al. [49] attempted to use the GCMs output to forecast ET 0 for the next 1-3 months in Australia, but the accuracy of this method needed to be improved. I drawback of this approach is that it requires a lot of meteorological knowledge, because the output of numerical meteorological forecast usually includes dozens of variables. How to select effective meteorological factors is a complicated and arduous work. On the other hand, using time-series data to forecast possible future results is also an interesting way to solve this problem. Karbasi [50] coupled GPR with wavelet transfer technology to forecast 1-30 days ahead ET 0 in Zanjan, Iran. The results showed that new approach was better than the single GPR model. Mehdizadeh [51] compared the accuracies of history weather data-based and lagged ET 0 data-based models based on MARS and GEP models in Iran. The found the lagged ET 0 data-based models outperformed the former one. Mohammadi et al. [52] developed a couple model based on SVM coupled with the whale optimization algorithm (WOA) for ET 0 prediction at three stations of Iran and found that it was better than other models.
Floods and droughts have brought severe challenges to the sustainable economic development in southern China, often causing losses to the personal safety and property of millions of people. In terms of time, flood occurs mainly in spring and summer, while drought occurs mainly in late summer and autumn. For example, in June 2019, rainfall-triggered floods have affected a total of 201.4 million people in the nine cities divided into districts in Jiangxi, with 231,000 people in need of emergency relocation and living assistance. Crops have been affected by 137.2 thousand ha 2 , with a total area of 15.6 million ha 2 of crop died. Meanwhile, thousands of houses collapsed. However, the droughts occurred in autumn and winter, which caused more than 5 million people to be affected. Due to the drought, 845,000 people needed to be saved due to the lack of drinking water. Crops were affected by 472.5 million ha 2 , and 85.6 million ha 2 were lost, resulting in a direct economic loss of 0.7 billion dollars.
As the best knowledge of the authors, studies on forecasting ET 0 for the next season have not been reported yet. The XGBoost model has multiple parameters that need to be tuned, and traditional methods can get stuck in a regional optimal solution and fail to obtain a global optimal solution. However, reports on optimizing XGBoost model parameters are also very limited, especially those using gray wolf optimization (GWO) [53], which is a powerful optimization algorithm and has been used to optimize MLP [54], ANFIS [55], SVM [56], ELM [57] and Grey model [58].Therefore, the objectives of this study are: (1) To evaluate whether it is feasible to use historical time-series ET 0 data for forecasting ET 0 in the next 1-3 months in a humid region of China based on four soft computing models, i.e., ANN, M5, XGB and GWO-XGB; (2) To verify whether the GWO algorithm can improve the accuracy of standalone XGBoost model for forecasting ET 0 .

Case Study
In this study, monthly meteorological data were collected from nine weather stations in South China operated by the China Meteorological Administration (CMA) (Fig. 1). This region is characterized by a subtropical monsoon climate [8]. Meteorological data including air temperature, relative humidity, sunshine hours and wind speed were used to calculate ET 0 using the FAO No. 56 Penman-Monteith formula. Detailed information on meteorological data at the nine weather stations can be found in Tab. 1. It is not difficult to see from Tab. 1 that the average ET 0 at Station 58238 fluctuated greatly during training, testing and validation periods. Especially, the validation period was significantly lower than the other two periods, which may affect the prediction accuracy of the model. In addition, ET 0 in South China has undergone drastic changes from 1966 to 2015, mainly showing that ET 0 was significantly lower than that in the 1960s and 1970s before and after 1990, and rebounded after 2000 [59], which may also have some impact on ET 0 forecasting.

Extreme Gradient Boosting (XGBoost)
XGBoost, proposed by Chen et al. [60], is a scalable soft computing algorithm for CART type tree boosting. XGBoost model consists of multiple decision trees, each of which pays attention to the residuals of the previous tree and USES the gradient algorithm to find a new decision tree establishment method to reduce the residuals of model training. As seen in Fig. 2, previous predictors are redeveloped to decrease the residuals during each iteration. The algorithm can independently determine the types of loss functions used for model evaluation. To reduce the risk of overfitting, different types of regular terms, e.g., L1 and L2 can be selected. The mean score of each tree is used as the predictive value for classification or regression. For the m-th decision tree, its calculation formula can be expressed as: where m is the number of CART trees; f m is a function in the functional space W, and W is the space of all CART trees.
The objective function of the model at the t-th iteration is written as follow: 705 where n is the n-th prediction, andŷ i can be given as: The regularization term Ω (f k ) for a CART tree is added by Chen et al. [60] as follows: where γ is the complexity of each leaf; T is the number of leaves in a tree; λ is a parameter to scale the penalty and w is the vector of scores on the leaves. Then, the first-order as well as the second-order Taylor expansions are taken to the loss function in XGBoost. Assuming that the loss function is MSE, then the objective function can be written as: where q (·) is a function can assign data points to corresponding leaves; g i and h i are the first and second derivative of MSE loss function respectively. In the formula (6), the loss function is determined by the sum of loss values for each data sample. Since each data sample corresponds to only one leaf node, the loss function can also be expressed as the sum of loss values for each leaf node, which is: where I j represents all data samples in the leaf node j. Obviously, the objective function is equivalent to finding the minimum value of the quadratic function. In short, the model performance change caused by a node splitting in the CART tree can be evaluated according to the change of objective function value. That is to say, if the model performance of decision tree after node splitting is improved, then it is adopted; otherwise, the splitting will stop.

Gray Wolf Optimization
The gray wolf optimization (GWO) algorithm is a meta-heuristic optimization algorithm proposed by Mirjalili et al. [53], which reflects the social system and group of gray wolf families in nature body hunting behavior. The gray wolf community has a very strict social hierarchy. In the system, wolves are usually divided into four levels: α, β, δ and ω, in which α wolf is the first level, mainly responsible for overall decision-making; β is the second level, assisting α wolf to make decisions; δ wolf is the third level, and needs to obey the decisions of α and β. The lowest ranking wolves ω in the pack are the d wolves and they have to obey the higher ranking wolves. According to the social hierarchy and hunting process of wolves, the mathematical model can be defined and the optimal solution can be found.
The optimal solution is defined as α and the second and third optimal solution are defined as β and δ, respectively. The rest candidate solutions are defined as ω. The distance between the individual and the prey in the Wolf pack is defined as D.
where D is the current number of iterations, C is step length coefficient, X P (t) is prey location, X is the location of a grey Wolf, r 1 is a random number ranged in (0,1).
To shorten the distance between themselves and their prey, individuals in a pack are constantly updated according to the following formula: where A is the convergence influence factor, which increases with the number of iterations from 2 to 0 by linear decrease; r 2 is a random number ranged in (0,1).
Because α, β and δ have a high level in the wolves, so that they can carry more prey location information, lead the pack gradually close to the hunting. They will now have three optimal solutions to save and ignore other solution, and according to the three optimal solutions on the location information to update the wolves, gradually finds the global optimal solution, the process of updating is defined as follows: Based on the calculated distance α, β and δ carry out itself by the following formula which correction of position as: The remaining individuals in the pack will then be based on the joint decision of α, β and δ. The next step to move the position shown in: To sum up, Grey Wolf Optimizer algorithm continuously updates the location search solution space during the optimization process, and finally finds the optimal solution (Fig. 3).

Multi-Layer Perceptron (MLP)
Neural Networks architectures are the general term for a series of machine learning methods based on the activation and transmission of information by simulated neurons. Most neural networks have an input layer, certain hidden layers and an output layer. Among them, the most popular is the single-hidden-layer neural networks, and one layer to another layer using activation function and summation function connection. In this study, we used a multilayer perceptron (MPL) neural networks model, and the activation function used the sigmoid function [61].

M5 Model Tree (M5)
M5 model tree is a kind of decision tree that adopts linear regression function at leaf node. This technique is very successful in predicting continuous values. It can be implemented by adopting a standard method of transforming a classification problem into a functional optimization problem. The model tree represents a piecewise linear function. Like a typical regression equation, it predicts the value of a variable (called a class) by a set of independent variables called attributes. The training data in table form can be directly used to construct the decision tree [62]. For a given data set, a typical linear regression algorithm can only give a single regression equation, but the model tree divides the sample space into rectangular areas with parallel edges, and determines a corresponding regression model for each partition.
The structure of the model tree is generated recursively, starting with the entire training sample set. At each level of the model tree, the most discriminative attribute is selected as the root node of the subtree, and the samples arriving at this node are divided into several subsets according to the value of node attributes. Model tree algorithm is a global model combined with a series of piecewise linear models. It differs from linear regression in that the input space is divided automatically by the algorithm. It has the advantages of high efficiency, good robustness, can be effective learning, can handle the input attributes up to several hundred dimensions.

Model Setup and Parameter Optimization
In this study, the data were first divided into three groups, the first (1966-1995) for training the models, the second (1996-2005) for testing the models, and the third (2006-2015) for validation and forecasting. The input meteorological combination consisted of T max , T min and ET 0 during the previous period. Different from finding extremum of curve, error of ET 0 simulation cannot approach 0 in most cases, which will lead to overfitting problem. Normally, with other parameters unchanged, the accuracy of XGBoost model will gradually decrease during the training period as the number of trees increases, while that of testing will first decrease and then increase. Therefore, we chose the minimum MSE error during the test period as the objective function to establish the model, and considered that the corresponding parameter value was the most suitable parameter value at this time. Then, we used the data in the third part (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) to evaluate the prediction ability of the model. We used the R3.4.2 platform with packages named RSNNS, Rpart, and xgboost for model implementation. Three parameters were optimized, including the number of trees (nrounds), [50,1500]; the ratio of sub datasets to all data for training the model (subset), [0.5, 1]; and the minimum sum of instance weight needed in a child (min_child_weight), (1,15). The GWO algorithm had only two parameters, namely the number of population and the number of iterations, which were set as 50 and 500, respectively.

Performance Criteria
The performance of the four soft computing model's efficiency, including accuracy and agreement, was evaluated using statistical criteria, such as RMSE, NSE and MAE. The evaluation criteria of RMSE and MAE are common mean error indicators that indicate how close data points are to a best fit line. The main difference between RMSE and MAE is that the point with large error has more weight than the point with small error, while MAE pays more attention to the average error performance. According to Nash et al. [63], "the NSE is defined as the sum of the absolute squared differences of the observed and estimated data normalized by the variance minus one." As determined by Krause [64], the NSE value ranges 1 to −∞. When NSE is close to 1, it means that the model performs well.
where ET 0s , ET 0PM are model estimated ET 0 and ET 0 calculated by the FAO56 PM equation, respectively. ET 0PM is the average of ET 0 calculated by the FAO56 PM equation.

Results and Discussion
The forecasting ability of the hybrid GWO-XGB model for reference evapotranspiration at different time steps (1, 2 and 3-month) was evaluated. The GWO-XGB model was compared by three models, i.e., M5, MLP and XGB models.

1-Month Ahead Forecasting
Tab. 2 shows the statistical indicators of the GWO-XGB hybrid model as well as the three single soft computing models (M5, MLP and XGB) for forecasting 1-month ahead reference evapotranspiration. The results during training and testing periods were compared to observe whether the model was over-fitted or under-fitted. The validation period was used to evaluate It can be seen that all the four models well described the fluctuation of ET 0 in different seasons. The errors of each model mainly came from the underestimation of peak values. It should be noted that this underestimation was common in all four models, with an error of up to 1.5 mm d −1 in extreme years, and may increase water stress in use. Overall, it can be concluded that the GWO-XGB model performed better than the other models, followed by the MLP model.

2-Month Ahead Forecasting
Tab. 3 shows the statistical indicators of the GWO-XGB hybrid model as well as three single soft computing models (M5, MLP and XGB) for forecasting 2-month ahead reference evapotranspiration. As shown in Tab

Model Evaluation in Each Month
Plants have different water requirements in different seasons, especially in summer and autumn. Therefore, taking RMSE value as an example, we evaluated the performances of various models for forecasting ET 0 in summer and autumn at the nine stations (Fig. 8). As shown in Fig. 8, the data fluctuated in different models and at various forecasting time steps of monthly ET 0 . I MLP model performed best in summer, especially for forecasting 1-month ahead ET 0 , where the median values and lower-quartile values of RMSE were significantly lower than those of the other models. In addition, the median values of RMSE for forecasting 2-and 3-month ahead ET 0 were also lower than those of the other models. However, the MLP model performed worst in summer; particularly, the upper quartile line of RMSE exceeded 0.7 mm d −1 , followed by the M5 model. The GWO-XGB model attained the lowest RMSE for forecasting 2-month ahead ET 0 in autumn. In other cases, the GWO-XGB model showed similar results to the XGB model. The inspiration from the above was that different models could be considered for different seasons, namely, MLP model in summer and GWO-XGB model in autumn. Huang et al. [65] evaluated the CatBoost model, random forest and support vector machines for prediction of ET 0 . In their study, meteorological data from 2001 to 2015 were collected from 12 stations in humid regions of China and eight input combinations were investigated to find the suitable model with limited data. The RMSE was 0.35 mm d −1 with maximum temperature, minimum temperature, global solar radiation as inputs and 0.52 mm d −1 with maximum temperature, minimum temperature, wind speed and relative humid as inputs. It is worth mentioning that historical meteorological data are used as inputs, and the errors will further increase if numerical weather forecasting data are used as inputs. This indicates that the error of this study is within the appropriate and applicable range.

Discussion
Accurate forecasting of ET 0 in the future is of great significance for hydrological simulation, water resource management and agricultural water management. A higher accuracy can be obtained by using future weather forecasts; however, this method usually increases the error significantly after 7d [47,66,67]. Karbasi [50] developed a hybrid model of the Gaussian Process Regression (GPR) and Wavelet-GPR to forecast multi-step ahead daily (1-30 days ahead) reference evapotranspiration in Iran. I obtained RMSE values by the models ranged from 0.068 mm/d to 0.816 mm/d from 1 to 30 days. Zhao et al. [49] proposed a post-processing method for 1-3 month ET 0 forecasting based on GCM outputs, which has obvious advantages over the original method. The RMSE dropped from 0.83-0.97 mm/d (GCM) to 0.34-0.76 mm/d with the new approach at Aero Station in Townsville, Australia. In this study, we proposed a method to predict 1-3 month ET 0 ahead based on historical ET 0 in a humid region of China, and the method also obtained good accuracy.
Boosting model is a new method based on decision tree, which uses boosting thought to integrate decision tree. Compared with the earlier random forest model based on bagging method, XGBoost model had a great advantage in running speed and a slight improvement in accuracy and controlling overfitting [14,[68][69][70][71].
The machine learning model needs parameters based on the dataset. A lot of research showed that heuristic algorithms can improve the accuracy and stability of machine learning models [7,72,73]. Similar results have been obtained in this study, where GWO improved the stability of XGBoost models and slightly improve the accuracy of the GWO model. The disadvantage is that it takes more time to adjust parameters. If the standalone XGB model uses grid search to determine the most appropriate model parameters, only a few hundred parameters are needed. However, the GWO-XGB model needs 25,000 parameters in this study, which takes hundreds of times more than the standalone XGB model. The next research goal is to use new techniques to speed up the efficiency of parameter tuning, such as parallel algorithms, to save the time required for parameter tuning.
Meteorological factors such as temperature and solar radiation change dramatically in four seasons of the year, resulting in huge differences in ET 0 among the seasons. For models like MLP, the algorithm itself has only one model, which may be more prone to the problem of unbalanced performance in different seasons. Similar results also occurred in Sichuan, China as found by Feng et al. [36]. Models such as XGBoost method, which have multiple sub-models, can build more specific models for different seasons, and combine all of them together so that the differences among the seasons are more balanced.

Conclusions
The forecasting of several months ahead reference evapotranspiration is helpful in water resources management and allocation for irrigated areas. This study investigated and compared the performance of the XGBoost model hybridized with the Grey Wolf Optimization algorithm, along with three traditional models, e.g., single XGBoost, ANN and M5 models for forecasting 1-3 month ahead ET 0 . The meteorological data obtained from nine stations in different subtropical zones were used as inputs for training, testing and validating the above models. The results showed that the newly developed GWO-XGBoost was a reliable and stable approach for ET 0 forecasting. To forecast future 1-3 month ahead, the GWO-XGBoost model had the