Nowadays, air pollution is a big environmental problem in developing countries. In this problem, particulate matter 2.5 (PM2.5) in the air is an air pollutant. When its concentration in the air is high in developing countries like Vietnam, it will harm everyone’s health. Accurate prediction of PM2.5 concentrations can help to make the correct decision in protecting the health of the citizen. This study develops a hybrid deep learning approach named PM25-CBL model for PM2.5 concentration prediction in Ho Chi Minh City, Vietnam. Firstly, this study analyzes the effects of variables on PM2.5 concentrations in Air Quality HCMC dataset. Only variables that affect the results will be selected for PM2.5 concentration prediction. Secondly, an efficient PM25-CBL model that integrates a convolutional neural network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) is developed. This model consists of three following modules: CNN, Bi-LSTM, and Fully connected modules. Finally, this study conducts the experiment to compare the performance of our approach and several state-of-the-art deep learning models for time series prediction such as LSTM, Bi-LSTM, the combination of CNN and LSTM (CNN-LSTM), and ARIMA. The empirical results confirm that PM25-CBL model outperforms other methods for Air Quality HCMC dataset in terms of several metrics including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).
Machine learning and deep learning have been developing very rapidly and have been applied in many fields such as economics [
In recent years, there have been many studies on air pollution using machine learning and deep learning [
Among all the particulate matters of air pollution problem, PM2.5 is of particular concern. Therefore, there is also a lot of research on PM2.5 concentration prediction. Feng et al. [
The major contributions of this study are summarized as follows.
This study analyzes the effects of variables on PM2.5 concentrations in the Air Quality HCMC dataset. PM25-CBL model that integrates CNN and Bidirectional Long Short-Term Memory (Bi-LSTM) is developed. This study conducts the experiment to evaluate the predictability of the proposed approach and LSTM, Bi-LSTM, CNN, CNN-LSTM, CNN-Bi-LSTM, and ARIMA models. The results indicate that the PM25-CBL model outperforms other experimental methods for the Air Quality HCMC dataset in terms of MSE, RMSE, MAE, and MAPE metrics.
The remainder of this article is organized as follows. The detail of the Air Quality HCMC dataset and PM25-CBL model that integrates CNN and Bi-LSTM for PM2.5 concentration prediction, are presented in Section 2. The experiments are conducted in Section 3. Finally, Section 4 gives the conclusion of this study. Several future works are introduced in this section.
The Vietnam Air Quality Data Series [
#No | Variable | Description | Min | Max | Median |
---|---|---|---|---|---|
1 | Date | Date time | 2020-01-03 | 2021-01-20 | 2020-07-20 |
2 | Temperature | Median of temperature | 23 | 31 | 27.5 |
3 | Humidity | Median of humidity | 47 | 100 | 78 |
4 | Wind speed | Median of wind speed | 0.5 | 5.4 | 2.3 |
5 | PM25 | Median of PM2.5 concentrations | 5 | 171 | 65 |
6 | Dew | Median of dew | 14.5 | 26.5 | 24 |
7 | Pressure | Median of pressure | 1003 | 1014 | 1009 |
Before further analysis, this study preprocesses the data first. To avoid the effects of dimensional difference and improve processing time, six variables including temperature, humidity, wind speed, PM2.5 concentrations, dew, and pressure are normalized by the Min-Max Scaler as the following equation:
Then, the correlations between temperature, humidity, wind speed, dew, and pressure with PM2.5 concentrations in Air Quality HCMC dataset are shown in
This section presents the overall architecture of a hybrid deep learning approach named the PM25-CBL model for PM2.5 concentration prediction in the Air Quality HCMC dataset. This model shown in
For detail, the left of
The forget gate in a memory cell provides which cell state information will be discarded. In
In
The input gate identifies how much of the current moment input
The output gate determines how much of the current cell state will be discarded. The output information,
Then the cell state (
LSTM network only analysis one directional of a sequence which leads to reduce its effectiveness. Meanwhile, both forward and backward directional information on the sequence may contain interesting patterns. Therefore, Bi-LSTM which considers both forward and backward directions in the sequence [
#No | Layer type | Neurons | Parameters |
---|---|---|---|
1 | 1D Convolution | (None, None, 5, 64) | 192 |
2 | 1D Max pooling | (None, None, 5, 64) | 0 |
3 | 1D Convolution | (None, None, 4, 64) | 8256 |
4 | 1D Max pooling | (None, None, 4, 64) | 0 |
5 | Flatten | (None, None, 256) | 0 |
6 | Bi-LSTM | (None, None, 128) | 164,352 |
7 | Dropout | (None, 128) | 0 |
8 | Bi-LSTM | (None, 64) | 41,216 |
9 | Fully connected layer | (None, 1) | 65 |
This section evaluates our approach and several state-of-the-art models for time series prediction including LSTM, Bi-LSTM, CNN, CNN-Bi-LSTM, CNN-LSTM and ARIMA for Air Quality HCMC dataset. The above methods are implemented in the Keras framework and executed in the Ubuntu computer with an Intel Core i7-4790 K (4.0 GHz × 8 cores), 32 GB of RAM, and GeForce GTX 1080 Ti. To demonstrate the effectiveness of PM25-CBL model, the Air Quality HCMC dataset is divided into 5-folds with 20% for testing set and the remaining of this dataset for training set. This study utilizes four following common performance metrics to compare the experimental methods.
The first metric, MSE, is the average squared difference between the predicted values given by the machine learning model and the actual values. Meanwhile, the second metric namely RMSE is the square root of the MSE. They are determined by the following equations.
Next, MAE gives the average magnitude of the prediction errors and ignores their directions by the absolute operator. Meanwhile, MAPE provides prediction accuracy in the percentage of a forecasting method. The following equations can obtain them.
Firstly, the loss values during training and testing phases were tracked, which are shown in
Secondly, this section reports the performances of the experimental methods, including LSTM, Bi-LSTM, CNN, CNN-LSTM, CNN-Bi-LSTM, ARIMA and the proposed approach for the Air Quality HCMC dataset in terms of MSE, RMSE, MAE, and MAPE. The experimental results in
#No | Model | MSE | RMSE | MAE | MAPE |
---|---|---|---|---|---|
1 | LSTM | 1.48 ± 0.34 | 1.21 ± 0.13 | 1.02 ± 0.09 | 3.63 ± 0.27 |
2 | Bi-LSTM | 1.31 ± 0.64 | 1.13 ± 0.18 | 0.94 ± 0.22 | 3.36 ± 0.4 |
3 | CNN | 1.34 ± 0.36 | 1.15 ± 0.16 | 0.94 ± 0.12 | 3.37 ± 0.4 |
4 | CNN-LSTM | 1.41 ± 0.4 | 1.17 ± 0.22 | 0.97 ± 0.13 | 3.46 ± 0.4 |
5 | CNN-Bi-LSTM | 1.51 ± 0.5 | 1.2 ± 0.25 | 1.0 ± 0.21 | 3.57 ± 0.7 |
6 | PM25-CBL | ||||
7 | ARIMA | 1.97 ± 0.2 | 1.40 ± 0.07 | 0.99 ± 0.05 | 16.19 ± 0.002 |
Finally, this study evaluates the processing time of the experimental methods for the Air Quality HCMC dataset.
#No | Model | Training phase (s) | Predicting phase (s) |
---|---|---|---|
1 | LSTM | 3.302 | 0.124 |
2 | Bi-LSTM | 5.679 | 0.234 |
3 | CNN | ||
4 | CNN-LSTM | 3.255 | 0.107 |
5 | CNN-Bi-LSTM | 4.9 | 0.17 |
6 | PM25-CBL | 3.404 | 0.17 |
7 | ARIMA | 4.39 | 0.112 |
The experimental results on processing time indicate that the proposed method has not achieved the best training and predicting time. It has twice the training time as the best model (CNN). However, training time of the proposed method only takes 3.4 s. This is not significant when hardware has evolved dramatically. In addition, the proposed method achieves the best results in accuracy with MSE, RMSE, MAE, and MAPE metrics. Obviously, with the improvements in terms of performance, the processing time of the proposed method is acceptable compared with CNN, LSTM, Bi-LSTM, CNN-LSTM, and ARIMA models. With the above analysis, the advantage of the proposed model is about predictability while the limitation of the model is the training time. However, the time difference between the methods is not significant., therefore, the proposed method is recommended to be integrated in smart environmental monitoring to automatically provide forecasts for citizens when PM2.5 concentrations reach dangerous thresholds in smart city.
Currently, this study developed a deep learning model for prediction PM2.5 concentration prediction applied in Ho Chi Minh City, Vietnam. Therefore, it is necessary to conduct the study to interact with this model to the system to be able to use it in practice. In addition, the collection of features is still very limited, leading to low accuracy. Therefore, it is necessary to collect more information regarding the environment in the future.
This study developed a hybrid deep learning approach named PM25
For future work, we focus on improving the PM2.5 concentration prediction model’s performance by applying several advanced techniques such as evolutionary algorithms to the proposed approach. In addition, the Air Quality datasets in several cities in Vietnam are collected to verify the proposed model.
The authors received no specific funding for this study.
The authors declare that they have no conflicts of interest to report regarding the present study.