Stock-Price Forecasting Based on XGBoost and LSTM

Using time-series data analysis for stock-price forecasting (SPF) is complex and challenging because many factors can influence stock prices (e.g., inflation, seasonality, economic policy, societal behaviors). Such factors can be analyzed over time for SPF. Machine learning and deep learning have been shown to obtain better forecasts of stock prices than traditional approaches. This study, therefore, proposed a method to enhance the performance of an SPF system based on advanced machine learning and deep learning approaches. First, we applied extreme gradient boosting as a feature-selection technique to extract important features from high-dimensional time-series data and remove redundant features. Then, we fed selected features into a deep long short-term memory (LSTM) network to forecast stock prices. The deep LSTM network was used to reflect the temporal nature of the input time series and fully exploit future contextual information. The complex structure enables this network to capture more stochasticity within the stock price. The method does not change when applied to stock data or Forex data. Experimental results based on a Forex dataset covering 2008–2018 showed that our approach outperformed the baseline autoregressive integrated moving average approach with regard to mean absolute error, mean squared error, and root-mean-square error.


Introduction
Stock-price forecasting (SPF) is an attractive and challenging research area in quantitative investing and time-series data analysis [1,2]. Stock prices are affected by many factors, such as inflation, seasonality, economic policy, company performance, economic shocks, and political shocks. Such factors can decrease the accuracy of any forecasting system. Nevertheless, accurate SPF can bring benefits to companies, shareholders, and investors; it can also be used as a key measurement for assessing economic performance.
To overcome the drawbacks of conventional SPF approaches, machine learning and deep learning have recently been introduced to analyze time-series data . Since deep learning SPF approaches depend only on the dataset and do not require stochasticity data or financial knowledge, we can build highperformance SPF systems without expert knowledge. Machine learning and deep learning models that have been proposed to improve SPF system performance include artificial neural networks (ANNs) [7,10], convolutional neural networks (CNNs) [13][14][15], and recurrent neural networks (RNNs), such as long short-term memory (LSTM) [21][22][23][24][25][26][27].
One study [7] analyzed the performance of ARIMA and ANN using a dataset of the Korean stock market; the ARIMA model achieved higher accuracy than the ANN model. The authors also found that the LSTM approach outperformed traditional ARIMA. Another study [8], meanwhile, proposed using a Bayesian median autoregressive model-in contrast to a mean-based method-for time-series forecasting. Tsai et al. [9] used multivariate adaptive regression splines, stepwise regression, and kernel ridge regression as featureselection methods for a time-series forecasting model. Others have combined support vector regression and genetic algorithms to increase forecasting accuracy. One study [10], for example, compared the performance of ensemble methods (random forest, AdaBoost, and Kernel Factory) with other classifiers (neural networks, logistic regression, support vector machine, and k-nearest neighbor) to predict the direction of changes in stock prices; random forest was found to yield the best accuracy.
A study [13] that compared RNN, LSTM, and CNN-sliding window models to forecast NSEI-listed stocks reported that the CNN model had the best performance. Hoseinzade et al. [14] proposed a CNNPred model to extract feature vectors from stock data for prediction. Another study [15] used a CNN model combined with two fully connected layers to capture the spatial time-series structure to predict stock market trends; compared to traditional methods, the proposed method increased prediction accuracy by 4%-7%.
Others [16][17][18][19] have proposed deep learning approaches based on CNN and RNN for SPF; deep learning approaches were found to outperform traditional machine learning approaches. Another study [20] compared the performance of ARIMA and LSTM models for forecasting time-series data. Meanwhile, one study [21] used LSTM regression models to forecast a stock price dataset from India's NIFTY 50 index; the deep learning-based LSTM model performed better than traditional machine learning approaches. A study [22] that used ARIMA, LSTM, and bidirectional LSTM (BiLSTM) models to forecast financial time-series data found that the BiLSTM model obtained the best results. Combining RNN and AdaBoost models, another study [23] proposed an RNN-Boost model to forecast prices in the Chinese stock market; the proposed model yielded better accuracy than the baseline RNN model. Baek et al. [24] introduced a new framework, ModAugNet, that includes two LSTM modules: overfitting prevention LSTM and prediction LSTM; they found that the ModAugNet model significantly outperformed a baseline model. Other studies [25][26][27] that applied LSTM networks to SPF have found that LSTM models outperformed classification methods such as random forest, logistic regression, multiple kernel learning, and support vector machines.
The present study proposes a method based on machine learning and deep learning to enhance the performance of SPF. We combined a feature selection-based extreme gradient boosting (XGBoost) model and a deep learning-based LSTM model. The XGBoost model automatically selects the most important features from a high-dimensional time-series dataset and discards redundant features. Then, we exploit the power of LSTM regression by using extracted features from the XGBoost model to forecast stock prices. We compared our approach to the performance of ARIMA using Forex data from 2008 to 2018. Our method was found to maintain generality when applied to both stock and Forex data.

Proposed Method
Here, we introduce two approaches for SPF. An ARIMA model is used as a baseline for comparison with our approach.

ARIMA Model for SPF
ARIMA [3] has been widely used for time-series forecasting. It combines autoregressive (AR) and moving average (MA) processes. Given a stationary variable Y t , we assume u t is a Gaussian white noise series with zero mean and variance r 2 u ðr 2 u > 0Þ. The ARIMA model of the order (p, d, q) is given by where h is constant; f i 6 ¼ 0 are autocorrelation coefficients at lags i ¼ 1; . . . ; p (p denotes AR order); and h j 6 ¼ 0; j ¼ 0; . . . ; q are weighted coefficients applied to the current and prior values of a stochastic term in the time series (q denotes MA order). The ARIMA model based on the Box and Jenkins method is suitable for dealing with nonstationary time series because of its integrated component. The integrated component involves differencing that is used to make the nonstationary time series stationary. The term of difference (parameter d) measures the difference in observations at different times.
The parameters d, p, and q need to be effectively selected for a reliable ARIMA model. We determined suitable parameters p and q based on an autocorrelation function (ACF), partial autocorrelation function (PACF), and several criteria, such as log-likelihood, Bayesian information criterion (BIC), and Akaike information criterion (AIC). The parameter d was determined based on the augmented Dickey-Fuller test. In our experiment, the parameters p, d, and q of the ARIMA model were determined based on the experimental dataset. The ARIMA model was estimated based on maximum likelihood estimation.

SPF Based on XGBoost and LSTM Models
We first applied extreme gradient boosting (XGBoost) as a feature-selection method to select important features for the purposes of prediction from high-dimensional time-series data and discarded redundant features. The selected features were fed into the LSTM model to forecast stock prices. Fig. 1 presents an overall block diagram of the proposed method.
XGBoost [28,29] is a robust machine learning algorithm for structured or tabular data. It can improve speed and performance based on the implementation of gradient-boosted decision trees. XGBoost is widely used for feature selection because of its high scalability, parallelization, efficiency, and speed.
Þ denote the training features and the observed value/target, respectively. We assume there are K numbers of gradient-boosting iterations, and M additive functions are used to predict the output. Assumeŷ i denotes the prediction value of the ith feature vector at the mth boost, f m , as an independent tree structure, q, with leaf weight ω (ω j represents the score on the jth leaf in the tree).  Given an input feature vector x i , we computed the final predicted output by summing up the scores across all leaves as follows: where denotes the space of regression trees, q denotes the structure of each tree that maps an input to the corresponding leaf index, and T represents the number of leaves in the tree. The idea of gradient boosting is to minimize the objective function (or loss function) as follows: where l denotes the difference between predictionŷ i and target y i . While calibrating the gradient-boosting model, some hyperparameters related to the tree structures (e.g., subsample, max leaves, max depth) were considered to reduce overfitting. Furthermore, to reduce the model's adaptation rate for the training dataset, the learning rate or shrinkage factor was added to the model. Adding a penalty factor or regularization term, f m ð Þ, that penalizes the model's complexity to the objective function in Eq. (3), the generalized objective function of XGBoost is described as follows: We fed the selected features based on XGBoost into the LSTM model for SPF. The LSTM model is an extension of RNN, reducing the effect of the vanishing gradient problem. The model significantly captures contextual information within a sequence or series; it can also capture the information of a sequence output based on past and future contexts. Note that the model is executable on sequences of arbitrary lengths. It learns the long dependencies of the inputs, captures important features from the inputs, and preserves the information over a long period. Fig. 2 illustrates the structure of a basic LSTM unit for calculating cells. A standard LSTM unit comprises a memory cell, an input gate, an output gate, and a forget gate. The past information stored in the memory cell is as important as future information. The input and output gates allow the cell to store and retrieve information over long periods. The input gate decides whether to add new information to the memory; the output gate decides what part of the LSTM unit memory contributes to the output. The forget gate is used to clear memory in the cell. Since this gate decides which information is discarded from memory, it properly captures the long-term dependency that occurs in time series.

Figure 2: A basic LSTM unit
Given a frame x t in the feature sequence x = x 1 ,…,x T , each time the LSTM unit receives x t into the sequence, it updates the hidden state, h t , with a nonlinear function that takes both current input x t and previous state h t-1 . Specifically, given frame x t at current state t, h t-1 is the hidden state at previous state t-1, and c t-1 is the cell state at previous state t-1. The LSTM first calculates the forget gate f t , the input gate i t , the output gate o t , and the candidate context e c t as follows: where W and b are the weight matrices and bias vector parameters, respectively, that need to be learned during training. Parameter r g is a sigmoid function while r c is a hyperbolic tangent function. Then, the cell state c t and hidden state h t at current time t are determined as follows: where denotes the Hadamard product (element-wise product), and r h is the hyperbolic tangent function. The LSTM model is directional and is used to reflect the temporal nature of the input time series; it helps to fully exploit future contextual information. Given the higher stochasticity of financial time-series data, deep LSTMs capture more stochasticity within stock prices because of their more complex structure. Fig. 3 shows the architecture of the deep LSTM model for SPF.

Dataset
We evaluated our proposed method using a dataset collected from the Forex market [30]. The dataset contains information covering 01/01/2008 to 03/19/2018 and has 709,314 total observations. Forex is different from the stock market because of its unique global market characteristics. A price may remain unchanged without a single trade for several minutes, or even hours, and then move dramatically as people start to trade more frequently. The Forex dataset contains a bid price of EUR/USD, and each 5 min price has over 200 features, including pricing, volatility, and volume information.
Tab. 1 shows a summary of the statistical values of the Forex market. We used closing price as the prediction target. We chose a subset of 59,094 observations with the 60 min price from the original dataset to evaluate the ARIMA model's performance. The original dataset was used to assess the performance of the LSTM model.

Parameter Analysis
We randomly split the subdataset into two groups-approximately 70% for training and 30% for testing-to analyze the ARIMA model. Specifically, 41,365 observations were used as training data and 17,729 as test data. The training data were used to find the best parameters (p, d, q) for the ARIMA model. We used the augmented Dickey-Fuller test to determine d and found that the observations were stationary at d = 1. We also used ACF, PACF, and some criteria such as log-likelihood, BIC, and AIC to determine p and q. Tab. 2 shows the ACF and PACF values of the closing prices from the training data at various lags. Additionally, Tab. 3 presents the statistical results of different ARIMA parameters for the Forex market.
We chose the best model based on minimum BIC and AIC values and maximum log-likelihood values. Therefore, the ARIMA (0,1,1) was considered the best model for the Forex market.   Fig. 4 presents some important features selected based on XGBoost. We realized that 10 important features selected from XGBoost and fed into the LSTM model gave the best accuracy. We used Adam optimization as an optimizer and 50 epochs to train the LSTM model in Keras.
Finally, we used MAE, MSE, and RMSE as the metrics to evaluate the accuracy of the SPF system. The lower the values, the more accurate the system.

Results
Tabs. 4 and 5 show the prediction results of the ARIMA model and our approach for the test dataset. The predicted closing price values obtained using both approaches were very close to the target values. Therefore, both the ARIMA model and our approach yielded high forecasting accuracy.
Tab. 6 presents a comparison of the performance of our approach and the baseline ARIMA approach. It shows that the proposed approach performed better than the ARIMA model and achieved the best accuracy. As for long-term time-series prediction, the LSTM model has the advantage of selecting important and relevant information, thereby enhancing predictive performance. Therefore, our proposed approach can be considered a promising method for improving the accuracy of SPF.

Conclusion
This study proposed an improved SPF system by combining XGBoost and LSTM models. We first introduced the construction of important features from a high-dimensional dataset using XGBoost as the feature-selection method. Then, the features were fed into deep LSTM models to evaluate the performance of the forecasting system. The experimental results verified that the proposed approach significantly improved the accuracy of the SPF system and outperformed the baseline ARIMA approach.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors have no conflicts of interest to declare regarding this study.