Time series forecasting and analysis are widely used in many fields and application scenarios. Time series historical data reflects the change pattern and trend, which can serve the application and decision in each application scenario to a certain extent. In this paper, we select the time series prediction problem in the atmospheric environment scenario to start the application research. In terms of data support, we obtain the data of nearly 3500 vehicles in some cities in China from Runwoda Research Institute, focusing on the major pollutant emission data of non-road mobile machinery and high emission vehicles in Beijing and Bozhou, Anhui Province to build the dataset and conduct the time series prediction analysis experiments on them. This paper proposes a P-gLSTNet model, and uses Autoregressive Integrated Moving Average model (ARIMA), long and short-term memory (LSTM), and Prophet to predict and compare the emissions in the future period. The experiments are validated on four public data sets and one self-collected data set, and the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) are selected as the evaluation metrics. The experimental results show that the proposed P-gLSTNet fusion model predicts less error, outperforms the backbone method, and is more suitable for the prediction of time-series data in this scenario.
Univariate and multivariate time series data forecasting (TSF) and analysis are widely available in various fields of production and life, such as energy [
The traditional time series forecasting method fits the historical time trend curve by establishing an appropriate mathematical model, and predicts the trend curve of the future time series according to the built model. Common models include Auto-Regression Moving Average (ARMA), Vector Auto-Regression (VAR), prophet, etc. [
In order to improve the accuracy of prediction, machine learning algorithms and deep learning are introduced into time series prediction. These methods select features that may affect the predicted value according to specific application scenarios, introduce these features into the model, and apply machine learning classification or regression models to perform forecast [
Traditional time series forecasting methods | Machine learning methods | ||
---|---|---|---|
Univariate time series forecasting model | Multivariate time series forecasting model | Machine learning algorithm | Deep learning methods |
Auto-regression | Vector auto-regression | SVM |
RNN |
Moving average | Vector moving average | Bayesian network | Seq2seq-attention |
Auto-regression moving average | Vector auto-regression moving average | Random forest | DeepAR |
Auto-regression intergrated moving average | Prophet |
Transfer learning | CNN |
Seasonal auto-regression intergrated moving average | Ant colony optimization-AR | WaveNet |
In the traditional multivariate time series forecasting model, the prophet algorithm can not only deal with the case of some outliers in the time series, but also the case of some missing values [
As a variant of recurrent neural network (RNN), LSTM has better iterative and deep structure, and adopts a special gated structure to overcome the shortcomings of RNN. With the improvement of hardware performance and computing power, LSTM and its variants excel in time series forecasting. Therefore, this paper proposes a deep learning framework for multivariate time series prediction that combines Prophet and improved LSTM units, named P-gLSTNet. Expect this model to outperform traditional or general models like LSTM, ARIMA, and Prophet.
Prophet is a model for time series characteristics and change laws that was open sourced by Facebook in 2017 [
From previous experimental research, the Prophet algorithm is based on the fitting of decomposed time series (that is, traditional methods) and machine learning methods. The time tasks used for prediction often have the following characteristics: reflecting in certain data samples of relatively stable frequency with a relatively fixed frequency in the time period; reflecting the stable and repeating general seasonal changes or other changes; reflecting the unknown and irregular mutation of holidays; reflecting the rational existence of missing data and data Abnormal; reflecting a situation that exhibits a reasonable change in trend; reflecting a trend that exhibits a linear or non-linear growth curve, or may reach a natural limit or saturation. Based on the above characteristics, the model focuses on two key points to make up for the limitations, missing values and flexibility of traditional time series models in the data processing process. Its core is to analyze various time series data characteristics such as periodicity, trend, holiday effect, seasonality and so on. The satisfaction of the trend term is achieved by linear piecewise fitting, and mutation points can be added manually or automatically selected by an algorithm; the satisfaction of the periodic term can be achieved by using Fourier series to establish a periodic model, its basic form is expressed as
Among them, Trend term: Trend growth in the Prophet model is similar to racial growth. Facebook employs a modified Periodic term: The Prophet model constructs a periodic model by introducing Fourier series, and its basic form is expressed as
where
where
Compared with other time series models, the main advantage of the Prophet model is that it can flexibly adjust the periodic trend to meet the assumption of trend items in the experimental process; it can accommodate the diversity of data forms, and accommodate irregular intervals and missing values; it can be fitted quickly; it can have good modifiability, allowing analysts to improve its internal parameters.
Taking a common time series scenario as an example, black represents the original time series discrete points, the dark blue line represents the value obtained by fitting the time series, and the light blue line represents a confidence interval of the time series. This is the so-called reasonable upper and lower bounds. What prophet does is:
Step1: Enter the timestamp and corresponding value of a known time series; Step2: Enter the length of the time series to be predicted; Step3: Output future time series trends. Step4: The output result can provide necessary statistical indicators, including fitting curve, upper bound and lower bound. It is also possible to increase seasonal processing. As shown in
Compared with RNN, LSTM can solve the problem of gradient disappearance and gradient explosion, but it is still not a more ideal unit structure [
LSTM input gate, output gate and forget gate protect and control cell state. The improved LSTM unit flows to the next unit containing the input data at the current time, the output of the hidden layer at the previous time, and the unit state from the previous time, and then the data is mapped to 0 to 1 by using the activation function (Sigmod). The expression of the forget gate is as described in formula
The output result of formula
Then the new data expression for adding candidates is expressed as formula
Since the input ratio and the new candidate data are known, the new unit state can be obtained, expressed as formula
The input gate represents the data flowing out of all input states after transformation, expressed as formula
Since the status of the unit and the data of the output gate have been updated, the output of the hidden layer can be obtained at this time, as described in
In order to progress the two tasks in this application scenario and perform accurate and efficient predictive analysis of high-dimensional time series data in existing datasets, this chapter constructs and proposes a P-gLSTNet fusion model. The fusion model mainly consists of three parts:
A data input module for forming a high-dimensional data flux N * T * F;
In order to clarify the structure of gLSTNet constructed in this article, the data input situation should also be clarified. The constructed high-dimensional data flux is a data cube based on multiple samples and multiple features at different moments after adding the time axis, as shown in
gLSTNet and prophet modules trained in parallel;
The prediction model of gLSTNet constructed by the improved LSTM unit is divided into input layer, hidden layer and output layer, the model structure is shown in
A module for weighted combination output based on the Particle Swarm Optimization (PSO) algorithm;
Time series data contains a lot of uncertain information, and the forecasting effect of applying a single model is often not very satisfactory [
The PSO algorithm is different from other methods of solving regression parameters. It regards the combination coefficients
Among them,
Its overall framework is shown in
The monitoring data of the self-collected data set comes from the high-emission vehicle installed in each actual working condition. The hardware monitoring terminal device integrated with the sensor is integrated on each vehicle. In this study, the data monitoring and analysis platform is used to realize remote monitoring, which is designed and developed on the basis of public cloud service software and hardware resources to monitor and manage the massive and multi-source non-road mobile sources generated in a fixed period. A structured database of multi-dimensional heterogeneous emission data, which forms a data resource library after preprocessing the collected monitoring data from multiple terminals. The purpose is to allow users or analysts to intuitively perceive real-time dynamic information through a large amount of data visualization, and to grasp hidden time clues through these time-series information [
The monitoring data comes from the integrated sensor monitoring device installed on each high-emission vehicle operating under actual working conditions. By integrating the sensor control unit, concentration acquisition module, Beidou positioning module, wireless communication module, and power management module packaged inside the sensor body, it is convenient to collect exhaust emission factors such as particles and nitrogen oxides in real time during the actual engineering operation of the vehicle. The convenience of exhaust gas monitoring is improved, and the On Board Diagnostics (OBD) reserved interface set on the upper surface of the sensor body is integrated to facilitate access to the OBD interface according to the needs of the vehicle circuit connection. The integrated sensor sends data back to the data platform every 30 s. We select the data in the time interval from 0:00:00:00 on January 1, 2021 to 23:59:59 on October 31, 2021. Through screening, a table containing 64 data is prepared for univariate time series data prediction. Each data table records the working data of one vehicle. The 64 vehicles are located in the urban or suburban areas of Beijing. The emission data can be used as a sample. A data set representing a total of 1,289,807 pieces of data in Beijing’s pollution samples was recorded as dataset1; for multivariate time series data prediction, a data table containing 10 data tables was prepared, and each data table recorded the working data of one vehicle. The cars are distributed in Bozhou, Anhui, a data set with a total of 247,244 data, denoted as dataset2.
We choose 9 monitoring features as input data, namely speed, Diesel Particulate Filter (DPF) differential pressure, DPF post pressure, DPF pre-temperature, DPF post-temperature, urea level, front NOx, rear NOx, PM. The following
Based on the research field of this paper, this paper also selects 4 public datasets for progress experiments. The source of the dataset is the University of California, Irvine (UCI) machine learning public dataset. The details of the self-collected dataset are shown in The Air Quality Data Set [ Beijing PM2.5 Data Set [ PM2.5 Data of Five Chinese Cities Data Set [ Beijing Multi-Site Air-Quality Data Set [ The self-collected Non-road mobile machinery exhaust emission time series forecast data (NrMM-TSF) is collected and organized by relying on the characteristics of 9 types of monitoring attributes of the subject.
Dataset name | Type of data | Type of task | Property type | Quantity | Number of properties | Year |
---|---|---|---|---|---|---|
Air quality |
Multivariate Time-series | Regression | Real | 9358 | 15 | 2016 |
Beijing PM2.5 |
Multivariate Time-series | Regression | Integer |
43824 | 13 | 2017 |
PM2.5 data of |
Multivariate Time-series | Regression | Integer |
52854 | 86 | 2017 |
Beijing |
Multivariate Time-series | Regression | Integer |
420768 | 18 | 2019 |
NrMM-TSF | Multivariate Time-series | Regression | Integer |
600000+ | 9 | 2022 |
The key work in this section is to conduct time series prediction experiments based on four models, namely ARIMA [
Step 1: Use the average method to fill in the small part of the missing data in the data, combining the data trend of the past month and the data of the same period of each quarter;
Step 2: Normalize the raw data processed in step 1, so that the range of the training data is as small as possible. Then use formulas
Step 3: Log processing the raw data processed in Step 1, to make it conform to the normal distribution as much as possible. Then use
Step 4: By constructing the fitness function
Step 5: Combine the prediction results of the two models to obtain the combined model prediction value
The combined training parameters are set to convert the time series data into a supervised learning problem in a development environment, and apply the improved LSTM model to the training set (60%), test set (30%) and validation set (10%). Obtain various specific parameters under a single model, and set the time series data of the input vector dimension according to the forecast demand; apply prophet to the data set after log processing respectively, obtain various specific parameters under a single model, and set according to the forecast demand Set the time series data of the input vector dimension, the trend term is an improved logistic growth model, the period term is the Fourier series, and the fitting function is the Fit function.
There are various evaluation systems and evaluation indicators for time series data prediction. The simplest evaluation method is to use the image fitting degree method. The test result and the input data curve are drawn on the same data graph, and then the fitting degree of the test result and the input data curve is observed to judge. The simplicity of this method is considerable, but it is relatively general, and the objectivity of the results cannot be clearly reflected by non-numerical results. Therefore, this paper evaluates the performance of P-gLSTNet and related comparison methods by combining the three prediction evaluation indicators of mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) [
Mean Absolute Error (MAE) is the average of all absolute errors.
Root Mean Square Error is the square root of the ratio of the square of the deviation between the predicted value and the true value to the number of observations.
The formula for Mean Absolute Percentage Error is as follows.
Among them,
Through the analysis of the experimental data set, we can conclude that the length of the time series and forecast hours are determined by the sample target ratio. In this article, a 4:1 ratio is used [
The learning rate is a hyperparameter that controls the degree to which the model is changed according to the estimated error each time the model weight is updated [
There are still several parameters to be determined in the model. The number of units in the module and the fully connected module represents the dimensionality of the output data. Through experiments, setting the number of neurons in the input layer to 8, the hidden layer to 1 layer, the number of neurons to 128, and the hidden layer activation function to RELU function. The activation function is the basis for the artificial neural network to extract and learn complex features [
In order to prove the superiority of the model, four algorithms of ARIMA, Prophet, LSTM and fusion model P-gLSTNet are used to conduct four experiments with different time series lengths on four public datasets and one self-collected data. The prediction performance of the four methods on each dataset is shown in
NrMM-TSF | |||
---|---|---|---|
Model category | RMSE | MAE | MAPE |
ARIMA | 4.73 | 22.36 | 18.6 |
LSTM | 3.74 | 14.02 | 22.6 |
Prophet | 4.05 | 16.43 | 13.2 |
Air quality data set | |||
---|---|---|---|
Model category | RMSE | MAE | MAPE |
ARIMA | 5.15 | 26.54 | 26.8 |
LSTM | 3.98 | 15.85 | 15.55 |
Prophet | 4.06 | 16.45 | 16.01 |
Beijing PM2.5 data set | |||
---|---|---|---|
Model category | RMSE | MAE | MAPE |
ARIMA | 4.73 | 22.36 | 19.1 |
LSTM | 3.74 | 14.02 | 15.6 |
Prophet | 4.29 | 18.43 | 17.24 |
PM2.5 data of five Chinese cities data set | |||
---|---|---|---|
Model category | RMSE | MAE | MAPE |
ARIMA | 4.73 | 22.36 | 20.1 |
LSTM | 4.36 | 19.02 | 17.6 |
Prophet | 4.27 | 18.2 | 16.2 |
Beijing multi-site air-quality data set | |||
---|---|---|---|
Model category | RMSE | MAE | MAPE |
ARIMA | 5.45 | 29.66 | 26.64 |
LSTM | 4.06 | 16.5 | 19.6 |
Prophet | 3.89 | 15.13 | 15.24 |
As analyzed previously, ARIMA performed the worst of the four time series prediction models when faced with multivariate and long-term dependencies. On the self-collected dataset, when the time series length processing is integrated to 1000 h, the RMSE, MAE and MAPE of the predicted results reach 4.73, 22.36 and 18.6, respectively, which are much higher than the 2.92, 8.57 and 4.3 predicted by the fusion model P-gLSTNet. The results show that, consistent with the previous studies, the traditional prediction model based on ARIMA is not suitable for time series data prediction in the fields of atmosphere and energy environment.
Compared to ARIMA’s model, the fusion model P-gLSTNet outperforms other experimental models on every metric we use, followed by LSTM and Prophet. When the time series length and prediction hours are set to 1000 h, the RMSE, MAE and MAPE values of the two backbone models reach 3.74, 14.02, 22.6 and 4.05, 16.43, 13.2, respectively, which are the best prediction results in all experiments. As the ratio of time series length to forecast hours increases, although the advantage of the fusion model P-gLSTNet gradually diminishes, it still outperforms ARIMA, LSTM and Prophet models.
Such good and bad performances are also shown on the other 4 public datasets. In order to visually display the prediction results of the model, we randomly selected the predicted value of 1000 h to compare with the actual monitoring value.
As shown in
P-gLSTNet combination factor | ||
---|---|---|
Dataset | α | β |
Air quality data set | 0.6544 | 0.4103 |
Beijing PM2.5 data set | 0.9236 | 0.3503 |
PM2.5 data of five Chinese cities data set | 0.7515 | 0.2890 |
Beijing multi-site air-quality data set | 0.8256 | 0.2678 |
Based on the good time series data characteristics, the pressure difference item in the multi-dimensional attribute is selected for the prediction process. The pressure difference also represents the difference between the front pressure and the back pressure of the DPF, which can characterize the working intensity of the hardware terminal equipment and facilitate the supervision of operation and maintenance. To verify the prediction accuracy of the model, you can see the comparison effect of the fitting degree, as shown in
From the fitting effect of ARIMA prediction results, it can be seen that the accuracy of this method is poor. In the range of 1000 h of densely integrated data, short-term multiple fluctuations cannot be well fitted, and it cannot cope with a single long-term downward fluctuation. The reason is because the method requires that the time series data is stable, or is stable after being differentiated. In addition, the method can only capture linear relationships by nature, but not nonlinear relationships. Most of the time series data in the scene in this paper are nonlinear and do not exist in isolation. In addition, the semantic relationship between sensors needs to be explored in the next step.
From the fitting effect of LSTM prediction results, it can be seen that the accuracy of this method is poor. In the range of 1000 h of densely integrated data, short-term multiple fluctuations can be properly fitted, and a single long-term falling fluctuation can be handled, but peaks in short-term time intervals are lost. This result shows that the use of LSTM and its variants needs to consider the appropriate time interval span, and the network is very deep, so this is also the focus of future work.
From the fitting effect of Prophet’s prediction results, it can be seen that the accuracy of this method is better. In the range of 1000 h of densely integrated data, short-term multiple fluctuations can be properly fitted, and a single long-term falling fluctuation can be well handled, but a certain amount of peaks in short-term time intervals are lost. Confirms the extensive evaluation of Prophet-efficient but imprecise, and will use this method for model checking in subsequent experiments.
From the fitting effect of P-gLSTNet prediction results, it can be seen that the accuracy of this method is the best among the four methods. In the range of 1000 h of densely integrated data, multiple short-term fluctuations can be properly fitted, and a single long-term downtrend can be dealt with, only missing peaks in a small number of short-term time intervals. The backbone of the P-gLSTNet model comes from the LSTM method and the Prophet method, which enables it to make full use of the advantages of the Prophet and long short-term memory networks. The model prediction accuracy is higher.
Aiming at the problem of data supervision and prediction analysis of non-road mobile source tail gas high emission, this paper studies the algorithm of time series data prediction and analysis. Based on two innovation points, theoretical research and experimental verification have been progressed:
The self-collected data set NrMM-TSF is constructed, which is the first actual data set in China under the actual working conditions of non-road mobile sources; The fusion network model P-gLSTNet is proposed. By improving the LSTM unit and Prophet to train separately and make a weighted combination output, the two can fully pay attention to the long-term, short-interval, trend and periodic data in the scene, and it can be well adapted to the high-dimensional data throughput of the self-collected dataset NrMM-TSF in experiments.
This paper explicitly focuses on time series visualization and time series forecasting. The experiments are supported by the fusion model of the improved artificial neural network method and the traditional time series forecasting method. The experimental results serve the application scenario of high-emission vehicle exhaust pollution emission prediction and supervision. Through the analysis and mining of the periodicity, trend, data anomalies and jumping rules of the time series data set, it is possible to realize the improvement or development of air pollution control of high-emission vehicles in Beijing, Bozhou and other places for a period of time. Trend prediction can well serve the supervision and decision-making of the government and relevant environmental protection departments.
First of all, in terms of data sets, this research has made important accumulation and exploration, which is also an important innovation point. Relying on practical engineering topics, research objects such as non-road mobile machinery and high-emission vehicles that are closely related to the production and life of important domestic cities, dynamically and real-time collection of time series data sets that can be continuously maintained and enriched-Application in the fields of atmosphere, energy and environment. But the downside is that from the attribute value of the exported data table, the expected 10-dimensional attribute is not achieved. This is mainly due to the fact that many sensors under actual working conditions have not yet been connected to the central control unit to start working, resulting in the lack of data dimension. From another perspective, it can be expanded in the future.
Secondly, in terms of fusion algorithms, this research, based on the review of different time series data forecasting methods in various application scenarios, combines Prophet, LSTM model and PSO algorithm, and proposes a P-gLSTNet time series forecasting model, which can make full use of Prophet and LSTM. The advantage is that the prediction accuracy of the model is significantly improved on the basis of good response to data missing, mutation, abnormal mutation factor, seasonality and trend. The shortcoming of experimental verification is that under the multi-dimensional attribute, the training and predictive analysis are not enough, and the comparative analysis with traditional and other general or non-general models is insufficient.
In the next stage of work, we will continue to improve the comparative experiments on the LSTM variant by combining various general models or advanced algorithms, and try to introduce the attention mechanism and Transformer to improve the model; we will continue to collect and improve self-built time series datasets, and expand research, to transfer the data preprocessing method to other fields, so that the deep learning method can be better applied to the fields of atmosphere, energy and environment.