Coronavirus disease 2019 (COVID-19) has been termed a “Pandemic Disease” that has infected many people and caused many deaths on a nearly unprecedented level. As more people are infected each day, it continues to pose a serious threat to humanity worldwide. As a result, healthcare systems around the world are facing a shortage of medical space such as wards and sickbeds. In most cases, healthy people experience tolerable symptoms if they are infected. However, in other cases, patients may suffer severe symptoms and require treatment in an intensive care unit. Thus, hospitals should select patients who have a high risk of death and treat them first. To solve this problem, a number of models have been developed for mortality prediction. However, they lack interpretability and generalization. To prepare a model that addresses these issues, we proposed a COVID-19 mortality prediction model that could provide new insights. We identified blood factors that could affect the prediction of COVID-19 mortality. In particular, we focused on dependency reduction using partial correlation and mutual information. Next, we used the Class-Attribute Interdependency Maximization (CAIM) algorithm to bin continuous values. Then, we used Jensen Shannon Divergence (JSD) and Bayesian posterior probability to create less redundant and more accurate rules. We provided a ruleset with its own posterior probability as a result. The extracted rules are in the form of “
First appearing in 2019, COVID-19 has resulted in a total of 96,877,399 infections and 2,098,879 deaths worldwide as of this writing [
The proposed model consisted of three steps: selecting features, making items by binning continuous data, and generating rules. In the feature selection stage, we used partial correlation to obtain a more accurate dependency value. In the rule generation stage, we used confidence-closed itemset mining, JSD (Jensen-Shannon Divergence), and Bayesian posterior probability to extract an accurate and non-redundant ruleset. Confidence-closed itemset mining removed many useless itemsets. Since JSD deleted rules fast and precisely with distribution distance, we reduced the calculation time. We used Bayesian posterior probability to identify more effective rules in classification and obtain the interpretability of rules. The results showed that this model created a small but accurate ruleset. This model provided rules of “
As a result, 14 rules were ultimately selected. The AUC score obtained using these 14 rules was 96.77% on average for the validation data. The F1-score was 92.8% for the test data. The results confirmed that our model had better performance than a previously published model [
The rules presented in this study are statistical rules. All predictions are completely extracted using data. The final goal of this study is to help medical staff make the best decision, rather than perform an absolute judgment. The contributions of this study are as follows:
Creates a new rule extraction method which has the fast, accurate and interpretable results. Identifies important blood factors and their split points. Represents a combination of important elements and their influence on the result in the form of a ruleset. Proposes a mortality prediction model with better performance and generality than previous studies.
The rest of this paper is organized as follows. In Section 2, we discuss related studies. Section 3 presents the dataset used in this study. Section 4 is focused on the mortality detection model, and mainly on introducing the proposed rule extraction model. In Section 5, we present the experiment and evaluation. Finally, we conclude our study in Section 6.
Most prior studies examining COVID-19 mortality using blood samples have focused on the risk of mortality using elements of blood data, such as D-dimer and lymphocytes, rather than making a prediction model [
Rule extraction using Bayesian posterior probability is a probability-based rule extraction model that can extract items with high posterior probability to generate a list of rules. It stems from the previously proposed Bayesian Rule List (BRL) [
The Bayesian approach has been proposed for use as a rule mining model [
The dataset used in this study was a dataset consisting of blood samples of patients infected with COVID-19 that was provided by Yan
As shown in
Feature extraction involved two tasks. First, we extracted important features by reducing the dependency between features and analyzing the interdependency between the target variable and features. To reduce the dependent features, we used partial correlation, which is a type of correlation that can control the subset of variables with an additional effect; it analyzes the correlation between certain features. Correlation typically indicates the relationship between features without controlling other features. This cannot differentiate between direct and indirect effects [
In the feature selection stage, we used the mutual information between the target variable and the features. Mutual information was the probability that event X and Y would occur simultaneously among event X and event Y. If X and Y frequently occur together, it means they have a high interdependency. Features were removed if the mutual information value was below the mutual information threshold. Discrete features were calculated with
In the item generation step, the selected features were binned. If the value was smaller than or equal to the split point, it was set to
In this stage, the itemset with a high posterior probability was extracted, and rules were generated. The rule extraction consisted of three stages: mining the itemset, filtering with JSD, and filtering with Bayesian posterior probability.
First, we created an itemset with filtering for support and confidence using FP-growth [
The confidence-closed itemset is closed itemset, which is focus on confidence. A closed itemset is typically focused on support. If the itemset and superset of the itemset have the same support, then the itemset is removed [
In this stage, we calculated the distance between the mined itemsets and the training dataset. We used JSD to calculate the distance. JSD is a variation of Kullback–Leibler (KL), which represents the distance between two distributions. In general, KL refers to the expectation of an information loss value between two distributions. It is used to measure the similarity of two distributions. However, for KL, the differences between the two distributions were not symmetrical. They are not marked as distances, which is equal to
We calculated the distance between the distribution of each itemset and the distribution of the dataset. We made each itemset's multivariate normal distribution using the mean and covariance values of its features. We also made multivariate normal distributions of death (class = 1) and discharged (class = 0)'s using the features of the itemset. Here, we call
We calculated the Bayesian posterior probability of each itemset and removed them when they did not exceed the threshold. We used this stage to achieve an accurate probability. Without sampling, we obtained the probability from our dataset alone, which can lead to a biased probability. Thus, we created an approximated posterior distribution using a large number of samples, and we used this distribution for explanation and classification.
Posterior probability was calculated by the Bayesian theorem. Since it was a binary classification, the binomial distribution was used as a likelihood distribution. The parameter
The denominator is expressed as follows.
Using the denominator
The posterior probability was updated by multiplying the likelihood and the prior probability. Markov chain Monte Carlo (MCMC) sampling was performed to generate an approximated Bayesian posterior probability distribution for each item [ Pick a distribution g(x) for sampling. Choose X0 which is the start point of the Markov chain. Sample y from g(x) which is the candidate of Xt+1 when this chain is X=xt. Calculate the acceptance ratio using Sample u from the uniform distribution U(x) and get Xt+1 using the following process.
Continue this process (3,4,5) until the stationary distribution appears.
In process 5,
p-beta← |
p-binom← |
posterior-distribution← |
|
probability← |
return probability |
The itemsets were filtered based on the calculated posterior probability. Through this step, the itemsets with high posterior probability were extracted. Since our model was based on a ruleset and we were certain that the itemset was accurate with posterior probability, we filtered the superset of the itemset as described in Section 4.3.1.
The experiment was conducted using a COVID-19 patient blood dataset. The results and threshold values of the proposed model are described here. The threshold values mentioned in this study are determined through several experiments.
First, we deleted features with a dependency value above 0.7, which was our dependency threshold. Next, we deleted features with a value lower than 0.3, which was the mutual information threshold. We then selected 10 features
In the item generation step, the continuous features were binned. First, we used the CAIM algorithm [
Features | Split point |
---|---|
339.0 | |
79.4 | |
42.3 | |
13.0 | |
4.2 | |
0.095 | |
2.04 | |
1.13 | |
305.0 | |
31.9 |
After creating the items using each split point, we filtered each item using support and confidence thresholds. The support threshold was 0.1 and the confidence threshold was 0.3. After filtering, we extracted a total of 10 items
In the mining process, we filtered the itemsets with a support value lower than the support threshold (0.15) and the confidence threshold (0.9). Next, we calculated the distance between the distribution of the itemset and 1(death) or between the distribution of the itemset and 0 (discharged). We deleted the itemset with a distance higher than 0.035. Using JSD, we selected 65 itemsets from 171 itemsets.
After filtering based on distance, we calculated the Bayesian posterior probability for each filtered itemset. The threshold of each itemset's posterior probability was 0.96.
Through the Bayesian posterior probability filtering, we selected 20 rules. After the superset filtering, we deleted 6 redundant rules. Therefore, in total, 14 rules were extracted.
42.3<Hypersensitive C-reactive protein and 79.4<neutrophils(%) and 339.0<Lactate dehydrogenase => High_Risk 0.9701269092135602 | |
(%)lymphocyte<=13.0 and 339.0<Lactate dehydrogenase => High_Risk 0.9698961739048645 | |
(%)lymphocyte<=13.0 and 42.3<Hypersensitive C-reactive protein => High_Risk 0.9693271238901066 | |
albumin<=31.9 and 0.095<procalcitonin and 339.0<Lactate dehydrogenase => High_Risk 0.9781133691560419 | |
albumin<=31.9 and 42.3<Hypersensitive C-reactive protein => High_Risk 0.9670296292557119 | |
1.13<International standard ratio and 42.3<Hypersensitive C-reactive protein and 305.0<Amino-terminal brain natriuretic peptide precursor(NT-proBNP) => High_Risk 0.974622922099393 | |
1.13<International standard ratio and albumin<=31.9 and 339.0<Lactate dehydrogenase => High_Risk 0.9865627254211912 | |
2.04<D-dimer and 305.0<Amino-terminal brain natriuretic peptide precursor(NT-proBNP) and 0.095<procalcitonin => High_Risk 0.9665619146313702 | |
2.04<D-dimer and 42.3<Hypersensitive C-reactive protein and 339.0<Lactate dehydrogenase => High_Risk 0.9660343990274075 | |
2.04<D-dimer and 1.13<International standard ratio and 339.0<Lactate dehydrogenase => High_Risk 0.9768106539506374 | |
2.04<D-dimer and 1.13<International standard ratio and 42.3<Hypersensitive C-reactive protein => High_Risk 0.9865986602150896 | |
monocytes(%)<=4.2 and 339.0<Lactate dehydrogenase => High_Risk 0.9781133691560419 | |
monocytes(%)<=4.2 and albumin<=31.9 and 0.095<procalcitonin => High_Risk 0.9727626191036757 | |
monocytes(%)<=4.2 and 2.04<D-dimer => High_Risk 0.9745469068256394 |
We evaluated the performance using the validation data set divided at a 20% ratio; the test data set was provided separately. Since early prediction is important in real situations, the test performance is evaluated using data from 7 days before the result (survival or death) is released. The performance evaluation of the validation dataset was conducted through 100 rounds of five-fold cross-validation for evaluation; we executed our model five times and each fold of the ruleset was performed over 100 rounds. The ROC curve, precision, recall, F1-score, AUC score, and accuracy were used as performance metrics.
Metric | XGBoost [ |
Fuzzy Model [ |
Proposed Model |
---|---|---|---|
Accuracy | 0.95283 | 1.0 | 0.96693 ± 0.01773 |
Precision | 0.96491 | 1.0 | 0.95218 ± 0.03035 |
Recall | 0.94827 | 1.0 | 0.97484 ± 0.02349 |
F1-score | 0.95652 | 1.0 | 0.963001 ± 0.01959 |
AUCscore | 0.9506 ± 0.0221 | 1.0 | 0.96778 ± 0.01738 |
Metric | XGBoost [ |
Fuzzy Model [ |
Proposed Model |
---|---|---|---|
Accuracy | 0.97 | 0.949 | 0.98181 |
Precision | 0.81 | 0.9 | 0.86666 |
Recall | 1.0 | 0.75 | 1.0 |
F1-score | 0.9 | 0.81818 | 0.92857 |
AUCscore | - | 0.868 | 0.98969 |
To separately evaluate the performance of the rule extraction algorithm, we compared the model of the BRL paper [
The results revealed that our model showed improved performance, especially with the test dataset. It shows our model has great generality.
Metric | BRL [ |
Proposed Model |
---|---|---|
Accuracy | 0.95755 ± 0.01939 | 0.96693 ± 0.01773 |
Precision | 0.98113 ± 0.01677 | 0.95218 ± 0.03035 |
Recall | 0.94352 ± 0.03078 | 0.97484 ± 0.02349 |
F1-score | 0.96163 ± 0.01799 | 0.963001 ± 0.01959 |
AUCscore | 0.95971 ± 0.01838 | 0.96778 ± 0.01738 |
The results show that the proposed model has higher accuracy and improved interpretability. It provides posterior probability with rules. It gives a good explanation for why it makes the predictions it does and that can lead to user reliability. Further, it provides various insights; for one, it provides the probability agrees with the argument that probability modeling is important in clinical practice [
Thanks my mother, leading professor with lab colleague, my friends and my cats for endless support.