Time series forecasting has become an important aspect of data analysis and has many real-world applications. However, undesirable missing values are often encountered, which may adversely affect many forecasting tasks. In this study, we evaluate and compare the effects of imputation methods for estimating missing values in a time series. Our approach does not include a simulation to generate pseudo-missing data, but instead perform imputation on actual missing data and measure the performance of the forecasting model created therefrom. In an experiment, therefore, several time series forecasting models are trained using different training datasets prepared using each imputation method. Subsequently, the performance of the imputation methods is evaluated by comparing the accuracy of the forecasting models. The results obtained from a total of four experimental cases show that the

The recent emergence of cutting-edge computing technology such as the internet of things (IoT) and big data, has resulted in a new era in which large-scale data can be generated, collected, and exploited. By combining unstructured data created from various data-generating sources with well-structured data that are primarily used for data analysis, not only the data volume but also information and knowledge that were previously difficult to obtain can now be acquired more easily.

Among the wide array of data classes, time series, which is a sequence of data arranged in chronological order (

To address the missing data problem, which is inevitable in real-life data analysis, various imputation techniques for reconstructing missing values have been primarily investigated in the field of statistics [

The experiment performed in this study was designed to avoid the shortcomings mentioned below. First, incomplete data are filled with estimated values by imputing the actual missing values instead of introducing artificial missing parts. Next, individual forecasting models were trained from different datasets recovered by each imputation method. Subsequently, the performances of the forecasting models were measured on a common test set, and the effectiveness of each imputation method was evaluated and compared with those of other methods. We exploited six imputation methods provided as a Python library that is easily accessible and used by non-technical people. In addition, four datasets of multivariate time series including actual missing data were used in the experiments.

The remainder of this paper is organized as follows. In Section 2, the basic missing types are defined, and related studies involving the imputation of missing values within a time series are summarized. Section 3 provides definitions of missing value imputation and the main related techniques. Section 4 presents the experimental results for the evaluation and comparison of the imputation methods. Section 5 provides the conclusions and future research directions.

Missing values are often encountered in many real-world applications. For example, when obtaining data from a questionnaire, many respondents are likely to intentionally omit a response to a question that is difficult to answer. As another example, when collecting data measured by machines or computer systems, various types of missing values can occur owing to mechanical defects or system malfunctions. Because missing values have undesirable effects on data availability and quality, handling such missing values should be considered in data analysis. To devise an optimal strategy for deciding how to handle missing values, the underlying reasons contributing to the occurrence of missing values must be understood. The primary types of missing values identified in previous studies related to the field of statistics are as follows:

Missing completely at random (MCAR): This indicates that the missingness of data is independent of both observed and unobserved variables. The MCAR assumption is ideal in that unbiased estimates can be obtained regardless of missing values; however it is impractical in many cases of real-world data [

Missing at random (MAR): Missingness is related to observed but not unobserved variables. A dataset that holds the MAR assumption may or may not result in a biased estimate.

Missing not at random (MNAR): Missingness is related to unobserved variables,

Whereas analysis on a dataset with MCAR outputs unbiased results, a dataset with MAR or MNAR, which comprise the majority of real-world data, requires the appropriate treatment to alleviate estimate biases. This can be solved using several methods, however, in this study, we focused on imputation methods that replace missing values in an automated manner.

In terms of time series data, missing values might be the primary cause of distortion in the statistical properties of the data. In particular, for time series that are highly correlated with themselves in the past, the improper handling of missing values may result in inaccurate results in analysis tasks (

Various univariate imputation methods [

As correlations between variables generally exist in real-world data, multivariate imputation is likely to be more effective than univariate imputation. In this regard, the k-nearest neighbors (k-NN [

With the advent of big data, imputation techniques based on deep neural networks {[

In this study, we analyzed the effect of the imputation process on the time series forecasting performance using several imputation techniques that are suitable for time series data and easily accessible. In the benchmark studies [

This section introduces the concept of missing data imputation in the time series and the imputation methods used in the experiments. We assume a multivariate time series

The imputation of a missing feature is defined as

For an incomplete time series, the simplest method to handle missing features is to exclude observations that contain a missing feature (

If the data provides missing parts of non-trivial size, then the missing values must be estimated using an elaborate procedure rather than the simple approaches mentioned above. In general, two types of imputation scenarios exist for replacing missing values with plausible values: univariate and multivariate imputation. The most intuitive technique for univariate imputation is the LOCF method, which carries forward the last observation before the missing data. Recalling the example above, the missing feature of

For a multivariate time series, the correlations between variables must be modeled. The EM algorithm [

In this study, the effect of missing value imputation in the time series forecasting problem was evaluated experimentally according to the procedure shown in

We used four datasets of multivariate time series as experimental data to train and validate the time series forecasting models. The datasets were obtained from [

Dataset | Time period | Unit of sampling | # of instances | # of features | Target variable | Missingness in target variable | Missing rate (%) |
---|---|---|---|---|---|---|---|

Air quality | 03/10/2004–04/04/2005 | Hourly | 9,357 | 12 | Carbon monoxide | Yes | 13.35 |

GECCO2015-A | 11/19/2013–05/21/2014 | Minutely | 264,900 | 3 | Return temperature | No | 5.46 |

GECCO2015-B | 05/22/2014–11/21/2014 | Minutely | 264,900 | 3 | Return temperature | Yes | 15.69 |

CNNpred | 12/31/2009–11/15/2017 | Daily | 1,984 | 81 | Closed price | No | 1.86 |

The dataset (Air Quality) was originally provided to predict benzene concentrations to monitor urban pollution [

The dataset was originally provided by the Genetic and Evolutionary Computation Conference (GECCO) Industrial Challenge 2015 [

The CNNpred dataset was first published in a study for predicting stock prices using convolutional neural networks [

Depending on whether the target variable contains missing values, we applied different imputation processing to each dataset. For the datasets shown in

Long Short-Term Memory (LSTM) is a deep learning model for learning sequence data that can be applied widely, such as in natural language processing [

An input gate decides the new information to be stored in

Subsequently, the older cell state

An output gate produces the final output of an LSTM cell. First, through a sigmoid activation function,

Based on LSTM neural networks, we constructed time series forecasting models that use different training sets generated by each imputation method.

Dataset | # of hidden layers | # of hidden nodes | Batch size | Sequence length | Learning rate | Epoch |
---|---|---|---|---|---|---|

Air quality | 3 | 30 | 100 | 24 | 0.001 | 200 |

GECCO2015-A | 3 | 30 | 1,000 | 10 | 0.001 | 200 |

GECCO2015-B | 3 | 30 | 1,000 | 10 | 0.001 | 200 |

CNNpred | 3 | 30 | 30 | 7 | 0.001 | 200 |

To evaluate the performance of the forecasting models, two loss functions,

where

Method | Air quality | GECCO2015-A | GECCO2015-B | CNNpred | |||||
---|---|---|---|---|---|---|---|---|---|

MAE | WMAPE | MAE | WMAPE | MAE | WMAPE | MAE | WMAPE | ||

Mean substitution | |||||||||

LOCF | |||||||||

NOCB | |||||||||

EM [ |
|||||||||

k-NN [ |
|||||||||

MICE [ |

In summary, we confirmed that k-NN outperformed the other imputation methods in three among four datasets. It ranked second only in GECCO2015-B, whose target variable contained many missing parts. If missing values are few and the target variable is complete in a specified dataset, then k-NN would be an attractive imputation technique for achieving stable time series forecasting performance. In addition, we conclude from the results that multivariate imputation methods are generally superior to univariate imputation methods because most time series in real-life are multivariate and include relationships between variables.

In terms of threats to validity, we investigated only a small set of conventional imputation methods, not including many state-of-the-art imputation techniques (

Missing values are a significant obstacle in data analysis. In time series forecasting, in particular, handling missing values in massive time series data is challenging. In this study, we evaluated the effects of imputation methods for replacing missing values with estimated values. We attempted to indirectly validate the imputation methods based on the performances of time series forecasting models, instead of using an approach that generates virtual missing data by simulation. The experimental results show that k-NN yielded the best model performance among the selected imputation methods.

Owing to the limitations of the results, we plan to conduct a more sophisticated benchmark study that can be extended to more imputation approaches, including machine learning techniques, while considering several missing-data scenarios. In addition, by conducting an experiment to investigate the efficacy of the imputation process in reconstructing missing values introduced by simulation, we hope to investigate the imputation effect more comprehensively.

The authors would like to thank the support of Contents Convergence Software Research Institute and the support of National Research Foundation of Korea.