Machine Learning and Classical Forecasting Methods Based Decision Support Systems for COVID-19

: From late 2019 to the present day, the coronavirus outbreak tragically affected the whole world and killed tens of thousands of people. Many countries have taken very stringent measures to alleviate the effects of the coronavirus disease 2019 (COVID-19) and are still being implemented. In this study, various machine learning techniques are implemented to predict possible confirmed cases and mortality numbers for the future. According to these models, we have tried to shed light on the future in terms of possible measures to be taken or updating the current measures. Support Vector Machines (SVM), Holt-Winters, Prophet, and Long-Short Term Memory (LSTM) forecasting models are applied to the novel COVID-19 dataset. According to the results, the Prophet model gives the lowest Root Mean Squared Error (RMSE) score compared to the other three models. Besides, according to this model, a projection for the future COVID-19 predictions of Turkey has been drawn and aimed to shape the current measures against the coronavirus.


Introduction
After the appearance of the COVID-19 in December 2019 in Wuhan, China, it quickly spread to almost all countries and ceased to be a problem of China alone. While the world is trying to recognize the COVID-19 virus, an unknown enemy, has to fight it, and it is understood that managing a process in such obscure is as much as fighting the virus is difficult activity. An all-out struggle with the virus whose epidemiological characteristics are not yet fully known has led to the emergence of economic problems as well as health problems. The rapid and easy transmission of the virus, the lack of a proven treatment process, has brought the healthcare systems of countries to a standstill. Many countries try to maintain economic activities in addition to intensive care capacity, equipment and staff shortages, trying to keep the virus infection under control with restrictions and prohibitions. The timing and degrees of restrictions imposed by states to societies can be cited as a vital factor in controlling the transmission of the virus.
To be able to provide quality service to patients while optimizing time and limited resources such as medical personnel, medical supplies and protective equipment. is one of the key points of the struggle. Therefore, it is vital to make successful predictions within the scope of the fight against the virus and disease. In this study, the efficient use of limited resources by estimating the number of future infected patients and planning resources for the future in accordance with the number of patients is intended. As a result of estimates, how many patients will have on which date and how much resources will be needed can be calculated. By achieving, resources can be reorganized in a country and can be transferred to another country in which there is a lack of resources. As a result, our work can be described as a fundamental study that forms the infrastructure of a global resource allocation activity. The number of studies in the literature, from the spread of the COVID-19 virus to the present day, is limited. In particular, the studies in which techniques such as machine learning and deep learning have been applied have remained in the background a little more. Here, we have briefly explained studies related to COVID-19 emphasized how this study can contribute to the literature. Jenny at al. [Jenny, Jenny, Gorji et al. (2020)] studied on six different scenarios. Increasing test numbers and maintaining social distance will decrease the number of infected people and deaths compared to the scenario where there is no mitigation activity is envisaged. The developed model suggests that test strategies have an equal effect with the concept of social distance, but economic costs will be less. Liu et al. [Liu, Magal, Seydi et al. (2020)] showed the effects of implementing major government public policy measures in the model they developed using constant propagation rate in early exponential growth of the COVID-19 epidemic. Rossa et al. [Rossa, Lee, Luo et al. (2020)] used the generalized logistic growth model (GLM), the Richards model and sub-epidemic wave methods that are used for short-term estimation of infectious diseases such as SARS, ebola, pandemic influenza, and dengue in order to estimate the near-future values of COVID-19 case numbers in different provinces in Hubei and China as of February 9, 2020, including 5 days, 10 days and 15 days later. Funk et al. [Funk, Camacho, Kucharski et al. (2018)] used 2013-2016 West African Ebola epidemic data in their study and combined flexibility with mechanistic models to incorporate uncertainty about epidemic dynamics into the model and presented a model for combating future epidemics. Pirouz et al. [Pirouz, Haghshenas and Piro (2020)] used Group Method of Data Handling (GMDH) algorithm and regression analysis methods to predict approved cases and achieved successful results in their case study with data from Hubei, China. Li et al. [Li, Qin, Xu et al. (2020)] suggested a deep learning model for detection of COVID-19, they have used 4356 volumetric chest CT exams as dataset. Wu et al. [Wu, Leung and Leung (2020)] used data from Dec 31, 2019 to Jan 28, 2020, on the number of cases exported from Wuhan internationally and suggested a Markov Chain Monte Carlo based forecasting model for the potential domestic and international spread of the COVID-19. Allam et al. [Allam and Jones (2020)] pointed universal data sharing in scope of smart city network and benefits of artificial intelligence for pandemic disasters. Dep at al. [Dep and Majumdar (2020)] used auto-regressive integrated moving average (ARIMA) method with time-dependent parameters in order to estimate reproduction number of COVID-19. Randhawa et al. [Randhawa, Soltysiak, El Roz et al. (2020)] have combined supervised machine learning and digital signal processing for genome analysis, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis to validate the proposed methodology. They have achieved 100% accurate classification of the COVID-19 virus sequences. Barstugan et al. [Barstugan, Ozkaya and Ozturk (2020)] have used 150 CT images to correctly identify the COVID-19 patients. They have applied different feature selection methods and achieve 99.68% accuracy with the 10-fold cross validation techniques and Grey-Level Size Zone Matrix. Jiang et al. [Jiang, Coffee, Bari et al. (2020)] have proposed an artificial intelligence (AI) framework to provide rapid clinical decision-making support. They have used AI with the predictive analytics to estimate severe cases. They have achieved to estimate severe patients with the accuracy rate from 70% to 80%. In this study, we have aimed to create a prediction model to correctly estimates future of the COVID-19 in top-seven countries in terms of confirmed and death cases. By doing this, the countries can take extra measures against the virus or they can make a better planning for when the measures taken can be loosened. The remainder of this study is structered as follows. In Section 2, we have briefly explained used forecasting models. In Section 3, we have given the results of each method for the chosen countries. Finally, in Section 4, we have conclude the study with the conclusion and discussion part.

Methodology
In this study, we have used for different regression model called Support Vector Machines (SVM), Holt-Winters' forecasting, Prophet forecasting, and LSTM. In this section, the methodological foundation of each method is briefly explained.

Support vector machines (SVM)
Support Vector Machines are commonly used for the linearly and non-linearly separable classification problems [Chauhan, Dahiya and Sharma (2019)]. Also, it can be extended for the regression problem and named as Support Vector Regression. In order to understand the methodology behind it, assume we have given a training dataset {( 1 , 2 ), … . , ( , )}, where each ∈ , the decision function is given by Eq. (1). ( ) = ( ) + (1) with respect to ∈ and ∈ , where denotes a non-linear mapping from to higher dimensional space. To ensure ( ) is as flat as possible, it is needed to find it with the minimum magnitude of weights as shown in Eq. (2).
Subject to all residuals having a value less than ; or, in equation form (see Eq. (3)): It can be seen that it is not possible to meet this condition for all points. Thus, we can add slack variables + and − to provide some flexibility and reformulate it as shown below in Eq. (4): where is a constant fixed value that controls the penalty value imposed on the variable which lies outside the margin and helps to avoid being overfitting. Ultimately, one can calculate the loss function that ignores the error if the predicted value is less than or equal to Thus, it can be formulated as shown in Eq. (5).
For mathematical convenience, the optimization problem described above can be solved in dual form.

Holt-Winters' forecasting
Holt-Winters' forecastin model is developed to capture seasonality effect on time series [Hansun, Charles and Indrati (2019)]. Holt-Winters' forecasting model contains prediction equation and three smoothing functions one for the level , one for the trend , and one for the seasonal component , with the corresponding smoothing parameters , * , and . And, denotes the number of seasons. For example, for the weekly data, m is equal to 52. There exist two different variations of this method. They are additive and multiplicative. The additive model is used when the seasonal variations are almost constant throughout the series. And, the multiplicative method is used when seasonal variations are changing proportionally to the level of series. In our study, we have used the additive model, which is formulated as in Eqs. (6)-(9).
where is the integer part of (ℎ − 1)/ , which ensures that estimated of the seasonal indices are coming from the last season of the data sample. The level equation shows a weighted average between the seasonally adjusted observation and the non-seasonal forecast, which are − ( − ) and ( −1) + ( −1) , respectively, for the time . The seasonal formulation shows a weighted average between the current season index, − ( −1) − ( −1) , and the previous season index. The formulation for the seasonal component can be written as in Eq. (10).
If is substituted from the smoothing equation for the level of the component form, we can get seasonal formulation as shown in Eq. (11).
Eq. (11) is identical to the seasonal component equation as shown in Eq. (9) with = * (1 − α). From this, will be greater than or equal to 0 and less than or equal to 1 − α.

Prophet forecasting
The prophet forecasting model is based on the idea of fitting the Generalized Additive model. The prophet is published by Facebook's Core Data Science team and the main study can be found in Taylor et al. [Taylor and Letham (2018)]. Its software is available in Python and R for forecasting time series data. The prophet is based on a model in which non-linear weekly and annual seasonality are taken into account, as well as during holidays. Some of the strengths of the Prophet model are its strengths against lost data, large outliers, and the shifts in the trends. Besides, it can produce well enough estimate of the mixed data without spending manual effort. The prophet software has its special data structure in order to handle with the time series. To create estimates, it needs two main columns called "ds" and "y". "ds" is the actual times of the time series and "y" is the corresponding values. It predicts two main things i) �, estimates of the model ii) the lower limit of �, and iii) the upper limit of �.

Long-short term memory (LSTM)
Long Short-Term Memory (LSTM) networks are created based on an extension for recurrent neural networks (RNN). As different from the traditional neural network, LSTM is designed to take important things learned from experiences previously occurred into consideration. The more detailed mathematical foundation of the LSTM model can be found in Lipton et al. [Lipton, Berkowitz and Elkan (2015)]. However, the main formulation of the model is given in Eqs. (12)- (17).
where ̂< > refers to the value of the memory cell. In other words, it is the "important" information from the previous time step. is the weight parameters for the memory cell. Γ , Γ , and Γ refers to the update, forget, and output gate and, respectively, , , and weights parameter of them. If Γ takes the value of 1, ̂< > is going to be updated. In another word, the new "important" information will be stored. Based on weight parameters, Γ , Γ , and Γ will be recalculated and the output of a neuron can be calculated based on the Eqs. (13)-(15).

Computational results
In this section, we have used classical regression models and machine learning models to create a prediction model for the daily COVID-19 cases of the seven countries. In order to compare used methods, we have evaluated the model performances with the Root Mean Squared Error (RMSE) evaluation metrics. The data used in the study was obtained from the COVID19 website of John Hopkins University (https://coronavirus.jhu.edu/map.html).

COVID cases in the top seven countries
Through the study, the top seven countries in terms of the number of cases were considered. These countries are the USA, Spain, Italy, France, Germany, UK, and Turkey. As of April 28, 2020, there is a total of 2,919,40 confirmed cases worldwide, and 1974244 of them are coming from these seven countries. The following Tab. 1 shows the basic numbers regarding those seven countries. Following Fig. 1 illustrates the confirmed cases in all chosen seven countries. As shown, the total number of cases is increasing every day, but towards the end of April, the rate of increase is relatively decreasing. The historical flow of this situation is shown in Fig. 1, from the day of the first appearance of the virus until April 28. Also, in parallel with the total number of confirmed cases, the number of recovered patients are increasing more and more. The historical flow of this situation is shown in Fig. 2, from the day of the first appearance of the virus until April 28.

Figure 2: Recovered Cases over date for the top seven countries
And, Fig. 3 shows the weekly progress of confirmed, recovered, and death cases for the top-seven countries. The cases where the rate of increase is minimal concerning this figure are cases where the virus leads to death. Cases resulting in death are followed by the recovered and total number of cases, respectively. This may be due to the fact that the virus revealed in statistics has a more lethal effect on those over 65 years of age (i.e., according to Verity et al. [Verity, Okell, Dorigatti et al. (2020)], about 81% of the patients who died are over 60 years old). In other words, a small part of the total population is affected much more. Therefore, increases in mortality rate are expected to be less than in total and recovered cases.

Figure 3: Weekly progress of confirmed, recovered, and death cases
Finally, the mortality rate and recovered rate of the COVID-19 virus in those seven countries is shown in the following Fig. 4. The mortality rate is calculated by dividing the total number of deaths to the total number of confirmed cases and the recovery rate is calculated by dividing the total number of recovered cases to the total number of confirmed cases. As can be seen from Fig. 4, the recovery rate is progressing much faster than the mortality rate and is much higher than average from the beginning of April. This is the main source of optimistic scenarios for the future.

Clustering countries
To cluster seven countries, we have first tried to find the optimum number of clusters. To do this, we have used the elbow method and the silhouette score. While doing this, confirmed, recovered, and death cases are used as the features of the data. As shown in Fig. 5, both methods choose the optimum number of clusters as 3.

Figure 5: The results of Elbow and Silhouette methods
By setting the number of clusters as 3, we have applied the k-means clustering algorithm and results are shown in Tab. 2. Based on Tab. 2, Spain, Italy, and France in one cluster, Germany, UK, and Turkey in one cluster, and the US is in another cluster.

Prediction of confirmed cases and deaths
In this section, we have created multiple prediction models for confirmed cases and deaths for the top seven countries. The following Tab. 3 gives the RMSE scores of SVM, Holt's Winter, Facebook's Prophet, and LSTM methods for those seven countries. To train the models we have used 95% of the dataset as the training set and 5% of it is used for testing. Based on Tab. 3, the best methods are the Prophet forecasting model with the lowest RMSE score for all countries. In addition to this, we have used the same methods for future predictions. By using trained models, we have forecasted possible deaths until May 2, 2020. As of the preparation of this paper, we know the actual number of deaths until April 28. Thus, the actual number of columns have values until April 28.

Deep analysis of Turkey
Turkey is one of the seven countries most affected by the virus epidemic. Compared to other countries, it managed to prevent the spread of the virus much faster by taking nationwide measures against the virus earlier. In this part, Turkey's virus data are examined in more detail and a projection for the future is revealed. As of April 26, 2020, there are 110130 confirmed cases, 29140 recovered patients, 2805 deaths nationwide. Fig. 6 shows the growth rate for different types of cases in Turkey. One of the most remarkable points is possible to see the partial effect of curfew implemented in Turkey as of April 22 from the change of confirmed cases during the last 4 days. This suggests that social seizure is one of the most important factors in preventing the spread of the virus.

Figure 6: Growth rate for different types of cases in Turkey
On the other hand, Fig. 7 shows the rate of death and recovery from the date of the first death in Turkey. It suggests that by the end of April, the recovery rate increased much faster than the mortality rate. As shown in Tab. 2, Turkey is the country with the lowest mortality rate of 2% among seven countries. It is also ranked three by 23.73% compared to the average recovery rate. Besides, when we compare the total number of cases in other countries, as shown in Fig.  8, the date of occurrence of the first case in Turkey is much later than the other six countries. It took 61, 66,61,77,71, and 74 days in Italy, USA, Spain, UK, Germany, and France to reach confirmed cases equivalent to Turkey. One of the points to be considered here is that the number of cases has increased rapidly in a very short time after the first case in Turkey. One of the reasons may be that Turkey's daily tests have reached about forty thousand in a very short time (https://covid19.saglik.gov.tr/).

Figure 8: Comparison of Turkey with the other six countries in terms of confirmed cases
Another comparison is the comparison between Turkey and six other countries in terms of number of deaths. As shown in Fig. 9, Turkey appears to be one step ahead of other countries in this regard. The total number of mortal cases in Turkey is 2805 from the date of occurrence of the first mortal case in Turkey until April 26. And these deaths took place in a total of 41 days. It took 26, 30, 21,28,34, and 44 days in Italy, USA, Spain, UK, Germany, and France to reach confirmed cases equivalent to Turkey. One of the points to be considered here is that the number of cases has increased rapidly in a very short time after the occurrence of the first case in Turkey. As in confirmed cases, it may be the result of that the daily test numbers in Turkey have reached about forty thousand in a very short time. Measures against the coronavirus in Turkey have been firmly implemented by the government since mid-March and are still underway. Recently, however, a little loosening of these prohibitions has begun to be discussed in the national press by creating a prediction model for the future by the Prophet forecasting model, which is the lowest margin of error among the methods applied, we would like to see how applicable and how risky those prospective decisions are and aimed to predict what period the mortality rate would go toward zero. Fig. 10 below shows the 150-days forecasting model for Turkey. Based on Fig. 10, when considering the lower limit of the model, it is projected that the number of deaths in Turkey will approach zero from mid-September to early October. Besides, Fig. 10 shows the prediction of the number of cases for future 150-days. Based on Fig. 10, when considering the lower limit of the model, it is projected that the total number of confirmed cases in Turkey will approach zero from the beginning of August.
Obviously, future predictions are based on predicted values after some certain point. That's why the error rate might be higher than real values. Thus, the estimates of the short term, such as weekly, might give better guidance to policymakers. Based on those estimates, new measures can be implemented, or existing measures can be revised to minimize the effect of the COVID-19.

Conclusions and discussions
In this study, we have analyzed the COVID-19 data of top-seven countries in terms of confirmed cases. We have first analyzed all seven countries in terms of basic descriptive statistics. Then we have clustered those seven countries. Based on the elbow method and silhouette scores US differs from the other six countries while Spain, Italy, and France in one cluster and UK, Germany, and Turkey in another cluster. Then, we have used different prediction models for those top-seven countries. These models are Support Vector Machines (SVM), Holt-Winters, Facebook's Prophet, and Long-Short Term Memory (LSTM) Among those models, the Facebook's Prophet forecasting model gives the lowest RMSE score for all countries. Besides, by using the Facebook's Prophet method, we estimated Turkey's next 150 days of deaths and confirmed cases. Based on these predictions, it is projected that the number of deaths in Turkey will approach zero from mid-September to early October and it is projected that the total number of confirmed cases in Turkey will approach zero from the beginning of August. In the light of these predictions, loosening the measures taken to minimize the effects of coronavirus epidemic in Turkey in a short period might cause the second wave of epidemic in Turkey.
In future studies, using these forecasting models, a worldwide planning can be planned to decide how resources can be distributed across countries. For example, health workers or health materials such as respirators in a country where the epidemic is predicted to end early, can be transferred to a country where the epidemic is expected to end later. By doing this, the effects of the virus can be alleviated or finished all over the world.

Funding Statement:
The author(s) received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.