|Computers, Materials & Continua |
Mutation Prediction for Coronaviruses Using Genome Sequence and Recurrent Neural Networks
1Bharati Vidyapeeth’s College of Engineering, New Delhi, 110063, India
2Department of Natural and Exact Sciences, Samarkand State University, Samarkand, Uzbekistan
3Graduate School, Duy Tan University, Da Nang, 550000, Viet Nam
4Faculty of Information Technology, Duy Tan University, Da Nang, 550000, Viet Nam
5Department of Information Technology, College of Computers and Information Technology, Taif University, Taif, 21944, Saudi Arabia
*Corresponding Author: Anand Nayyar. Email: firstname.lastname@example.org
Received: 18 December 2021; Accepted: 08 April 2022
Abstract: The study of viruses and their genetics has been an opportunity as well as a challenge for the scientific community. The recent ongoing SARS-Cov2 (Severe Acute Respiratory Syndrome) pandemic proved the unpreparedness for these situations. Not only the countermeasures for the effect caused by virus need to be tackled but the mutation taking place in the very genome of the virus is needed to be kept in check frequently. One major way to find out more information about such pathogens is by extracting the genetic data of such viruses. Though genetic data of viruses have been cultured and stored as well as isolated in form of their genome sequences, there is still limited methods on what new viruses can be generated in future due to mutation. This research proposes a deep learning model to predict the genome sequences of the SARS-Cov2 virus using only the previous viruses of the coronaviridae family with the help of RNN-LSTM (Recurrent Neural Network-Long Short-Term Memory) and RNN-GRU (Gated Recurrent Unit) so that in the future, several counter measures can be taken by predicting possible changes in the genome with the help of existing mutations in the virus. After the process of testing the model, the F1-recall came out to be more than 0.95. The mutation detection’s accuracy of both the models come out about 98.5% which shows the capability of the recurrent neural network to predict future changes in the genome of virus.
Keywords: COVID-19; genome sequence; coronaviridae; RNN-LSTM; RNN-GRU
COVID-19 is a contagious and extremely infectious disease that is caused by SARS-CoV-2 virus . The virus originated from the Wuhan province of China when the first case was reported on 17 November 2019 in Hubei . Since then, the virus is spreading at an alarming rate that has resulted into a global pandemic as declared by World Health Organization (WHO) on 31 January 2020 . More than 15.2 million cases have been identified across 188 countries and territories out of which 623,000 have resulted in death as of 23 July 2020 . Apart from this, the pandemic has caused a lot of economic and social disruption  including the largest global recession  and global famines . Some of the common visible symptoms also include cough, fever, shortness of breath and lack of smell as well as taste [8–10]. Although most of the cases end with meek symptoms, some gradually develop into acute respiratory distress syndrome known as ARDS that can precipitate through a cytokine storm [10,11], blood clots and multi-organ failure . The virus spreads through tiny droplets released during coughing, sneezing, and talking [13,14]. Therefore, the virus is transmitted more frequently amongst individuals in close contact.
COVID-19 virus medically termed as SARS-CoV-2 is a positively single-stranded Ribonucleic Acid (RNA) coronavirus that comes under the Betacoronavirus genus [15,16]. The first genomic sequence was discovered in China on 10 January 2020 and was kept in the National Center for Biotechnology Information GenBank (NCBI). The gene sequence is Deoxyribose Nucleic Acid (DNA) type that is Uracil (U) is substituted by Thymine (T) although the virus is an RNA type for understanding. The virus mutates during replication of genomic information which is caused because of some errors while duplicating the RNA into a new cell [17,18]. Therefore, it becomes necessary to study the genome sequence of the virus as the mutation takes place with every replication. Artificial intelligence subfields such as machine learning (ML) and deep learning are playing very important role in health care sector recently [19–22]. Various Machine Learning and Deep Learning (DL) are being used to analyze the data on COVID-19 to help in preparing the vaccines, evaluating the drug responses, genome sequencing and predicting the proximity of the disease in the patients. In this paper, LSTM-RNN and GRU-RNN based models are proposed and implemented to predict the genome sequence of viruses in the future caused by mutation in the genome sequence of the coronavirus so that necessary preparation and treatment are taken care of in advance. Such studies can help not only study changes in the genetic nature of these viruses but also help in knowing what effect a vaccine or a medicine cause on their genome and hence can be used to run different simulations before live testing.
The research paper addresses following research objectives:
Objective 1: In-depth study of research work done within the fields of genome sequencing and Neural Networks for predicting sequences.
Objective 2: Design and Development done for RNN-LSTM and RNN-GRU models according to shape of pre-processed & time-sequenced genomic data collected and combined from NCBI repository.
Objective 3: Using Training data collected for respective models to compare their performance in terms of accuracy and F1-score, as well as use the models to predict an existing genome sequence and compare with to find mutation accuracy for respective models.
Objective 4: The results of mutation accuracy were analyzed as well as compared with contemporary works over its performance and algorithm used to predict genome sequence.
Organization of the Paper
Section 2 covers related works and studies done in genome sequencing of coronavirus. Section 3 gives an introduction on genome sequencing and discusses all the methodologies used in both the models. Section 4 describes the flow and implementation process of the proposed model including data description and data preprocessing. Section 5 analyses the results of both the models and their comparison. Section 6 concludes the paper with future scope.
2 Related Works
Several researchers have worked on the genome sequence of the SARS-CoV-2 virus to achieve fruitful results. This section discusses such existing works in detail.
Alejandro Lopez-Rincon et al.  proposed an assisted detection approach to solve this issue by integrating the molecular analysis with machine learning and Artificial Intelligence (AI). The method uses a deep convolution neural network that extracts features from the genome sequence.
Studies with Novel Coronavirus Tool (2019nCoVR) have shown that the proposed system is sufficiently capable of correctly classifying SARS-CoV-2, differentiating it from other coronavirus mutants, such as Middle East Respiratory Syndrome (MERS-CoV), Human Coronavirus (HCoV-229E), HCoV-NL63, HCoV-OC43, HCoV-HKU1 and the virus’s predecessor SARS-CoV, regardless of insufficient description as well as sequential noises or errors.
Biswas  (2020) presented a phylogenetic analysis of the SARS-CoV-2 virus. In the study, complete genomic sequence of the virus was mentioned. The authors established the endemicity component of the virus and then worked on discovering the next SARS-CoV-2 source and disclosed that all sequences of this virus were formed in a single group with no branching but did not support the results with a comprehensive mathematical analysis.
Mooney  (2014) and Roach  (2010) addressed the assembly, stoichiometry, and composition of RNA synthesizing complexes. One practical outcome of reverse genetics, according to the authors is the development of stable coronavirus-based replicated reservoirs for vaccines and other biomedical uses.
The development of an in vitro method of replication of viruses such as the one used for poliovirus in which complete replication of viruses in cell lysates can be achieved is still ongoing. This approach will allow for a far more in-depth analysis of the requirements for gene replication, beginning with the transcript of an infectious gene.
Lauber  (2012) optimized genome design conservation to segment of all the genome into five non-overlapping regions: 59 untranslated regions (UTRs) as well as open reading frames (ORFs) 1a, ORF1b, 39 ORFs (including 39 proximal ORFs) and 39 UTRs. Under different models, each area was examined for its contribution to shifts in genome scale. Statistically, the non-linear solution outperformed the linear model and obtained 0.92% of the data variance.
Examination of the SARS-CoV-2 gene signature was conducted by Das and Ghate  (2020). They measured the ancestry rate of the European genome using the qpAdm statistical tool. Then, with the help of GraphPad Prism v8.4.0, GraphPad Program, Pearson applied the coefficient of association between different ethnicity levels of the European genome and conducted a statistical study of the death to recovery ratio.
Yadav  (2020) researched on the SARS-CoV-2 genome sequence of three cases that had a positive record of travel from Wuhan, China. Almost complete genomes of case 1, case 3 (29,851 nucleotides) as well as a partial genome of case 2 were found. It has been noted that the Indian SARS-CoV-2 sequences shared almost 99 percent identification with the pneumonia virus contained in the Wuhan seafood industry. They proposed that genome sequencing of cases from India will be performed to establish whether the virus is developing.
Ye  (2013) proposed inclusion of AI in solving the problems proposed by the coronavirus and did an extensive literature review on the models such as extreme machine learning, generative adversarial networks etc. For monitoring patients, diagnosing the disease, and predicting the spread of the virus in different stages.
Based on the above literature survey, it is quite evident that there is extensive research work done on genome structure of coronavirus or related viruses and genome sequencing. With the use of prediction power of recurrent Neural Networks., change and nature of genome of virus can be detected and analyzed. It can be concluded that the development of a model that could the predict the new mutated genome sequence of SARS-CoV-2 would benefit the world to avoid the situation such as the ongoing pandemic soon by developing potential treatments for example vaccines and medicinal drugs after medically examining the genome sequence of the predicted virus.
3 Deep Learning Algorithms
In this section, a general overview of Deep Learning algorithms is performed.
3.1 Genome Sequencing
Genome sequence is a full series of nucleotides which composes all chromosomes of an organism. In a population, the huge percentage of nucleotides are similar among organisms, although the analysis of several organisms is important to explain genetic diversity.
The method of establishing the full DNA sequence of the genome of the organism at a particular time that includes the decoding of all the chromosomal DNA of the individual, and the DNA found in the mitochondria and in the chloroplast for plants is known as Genome Sequencing . Genome is an organism’s genetic code which contains DNA and RNA for viruses. It contains both mitochondrial DNA as well as chloroplast DNA. In general, gene sequences that are completely total are often considered entire genome sequences . Genome sequencing is primarily used as a testing method but has been extended to clinics in 2014 . In the succeeding phase of precision medicine, full genome sequence data can be an essential resource to direct clinical involvement . It can therefore lay the groundwork for predicting disease severity and proximity as well as drug reaction.
The SARS-CoV-2 genome contains between 26,000 to 31,500 base pairs, which sound like a long sequence, with 31,500 positions filled with one nucleotide. As per the analysis performed by Woo et al. , in this virus, the number of Gs & C’s single nucleotide polymorphism (SNPs) ranges up to 32 percent. Fig. 1 shows around 43 percent of the specific numbers of ORFs are found in genes such as ORF1ab, shell, membrane, spike, and nucleocapsid . In this paper, genome sequence of future viruses is predicted using the existing genome sequence of the coronavirus and applying the RNN models on it.
3.2 Recurrent Neural Network (RNN)
A recurrent neural network (RNN) is a feedforward neural network where the relationship between nodes forms a directed graph along a time series or sequence that enables temporal contextual actions to be seen. RNNs are based on feedforward neural networks . Therefore, they can process variable duration sequence of inputs using their internal state or memory.
The word “recurrent neural network” is known to point to two large groups of neural networks with a common framework, one which is a finite i.e., known impulse and the other is an infinite i.e., unknown impulse. All network groups view time sensitive behavior . The finite impulse recurring network is a supervised learning model that could be unfolded and replaced by a purely feed-forward neural network, while the infinite impulse recurring network is a cyclical nature graph that cannot be unfolded. These impulses have specific storage states that are directly controlled by the network. These form the basis of the LSTMs and GRUs.
Each data element is taken as a token after tokenization. During the forward propagation as being shown in Fig. 2 in which, at specific time(t), the output at each hidden layer (h) is calculated using an activation function by multiplying the input (x) with the weights (U and V) initialized. However, in this contrary to other neural networks, one more term included in the function that is the output of the previous layer multiplied by some different weights (W) initialized due to which after each subsequent layer, the next output of one layer depends on the output of the previous at a time, therefore, maintaining the sequence in which the input is fed to the network . The final output is calculated using a different activation function (mostly SoftMax) and to get the loss function. Similarly, in the backward propagation, the weights are updated according to the outputs of the previous layer to get a global minimum during gradient descent to reduce the loss in each iteration .
3.3 Long-Short Term Memory (LSTM)
RNNs are not particularly effective if the input sequences are long as it is difficult to preserve and carry the information from the previous steps to the next ones due to which they might not include the information from the start. Moreover, RNNs suffer from vanishing gradient problem during backpropagation i.e., if the gradient value goes on to become lesser and lesser, it cannot help in training and learning of the model. LSTM tends to solve all the shortcomings of the typical recurrent neural network.
Long short-term memory is a recurrent neural network (RNN) based architecture [40–42]. Unlike normal neural feedforward networks, LSTM has regulated feedback loops. It does not only read singular data points like images, even applying to whole data sequences like audio or video stream. For instance, it is useful for applications like handwriting recognition, speech recognition, time series prediction , sign language translation [44,45] and many more. Some of the applications in healthcare sector include predicting subcellular localization of proteins  and various prediction in medical care pathways [47–50].
LSTM based RNN model is based on supervised learning that trains by utilizing gradient descent algorithm which is an optimization technique on a series of training sequences over time in order to measure the gradients required to optimize the model so that the weights of LSTM model are revised in proportion to the error derivative with reference to the corresponding weight using backpropagation. The issue of vanishing gradient with RNNs is resolved by LSTM as when the erroneous values are carried from specific output layer, they persist in the cell due to which it simultaneously returns error until the model is trained to cut off the value in all the gates. In this paper, LSTM based RNN model is implemented.
3.4 Gated Recurrent Unit (GRU)
Gated recurrent unit (GRU) uses a gating function in RNNs as it avoids usage of cell state and uses hidden state as a tool to transmit information . It has only namely two gates i.e., a reset gate and an update gate. It is like LSTM which has less parameters in forget gate as it requires an entry gate . The performance of GRU in some activities of natural language processing and polyphonic music modeling came out to be close to that of LSTM [53,54]. Nevertheless, it has shown stronger performance on relatively small and less regular datasets .
Like LSTM, it solves the issue of vanishing gradient in standard RNN by using update gate and reset gate. There are two primary sequences that determine which data will be carried on to the output. The remarkable aspect about GRU is that it can be learned to retain information even for a large amount of time, without wiping it over time or deleting information that is unrelated to the prediction.
However according to Weiss , the LSTM is better and stronger as compared to GRU because it performs unbounded counting whereas the GRU cannot be due to which it cannot learn common languages that are learned easily by the LSTM.
Fig. 3 shows the complete flow of our research study, and each block is separately explained. We first start with collecting our data and implementing suitable data preparation techniques to convert the raw data into the form that can be passed through the deep learning model. The steps include tokenization, feature extraction and feature selection of the data. After that the data is passed through the models and then they are optimized and evaluated with suitable metrics such as accuracy and F1 score. After getting the best performances of these models their results are compared and analyzed for to get further insights about our study.
4.1 Model Architecture
In this paper, LSTM-RNN and GRU-RNN models are used for training and learning. Figs. 4 and 5 displays the model architecture of LSTM and GRU models respectively with the input and output dimension after each layer. The first layer is input layer in which all shaped data sequences are fed to the model. The second layer is the LSTM recurrent layer in the case of the LSTM model and GRU recurrent layer in the case of the GRU model with the activation function, tan h. The next layer is Dropout layer which is used for regularization initialized using dropout rate, 0.2, followed by a Dense layer which is the standard fully connected neural network layer with activation function, sigmoid. The next two layers are also Dropout and Dense layers. However, the activation function is SoftMax in the last Dense layer which is the output layer.
4.2 Data Collection and Processing
The data used for training the model is collected from six different genome sequences of coronavirus family. The first is the complete genome sequence of coronavirus HKU1 (CoV-HKU1) cultured from a 71-year-aged male person with pneumonia who recently returned from Shenzhen province, China [57–60]. The second is the genome sequence of novel HCoV-229E that was isolated from a man diagnosed with case of acute pneumonia along with failure of renal functions in 2012. The third is the genome sequence of a fourth HCoV-NL63 which was isolated from a 7-month-old suffering from both conjunctivitis as well as bronchiolitis [61–64]. The viral genome sequence consisted of unique characteristics that included a distinctive N-terminal fragment. The fourth is the genome sequence of SARS Coronavirus (SARS-CoV). The fifth is the complete genome sequence of MERS-CoV. The last one is the genome sequence of SARS-CoV-2 i.e., COVID-19 responsible for the ongoing pandemic [64–66]. It was isolated from a patient who used to work at the Huanan seafood marketplace in Wuhan and had to be admit in the Wuhan Central Hospital on 26 December 2020. All these six genome sequences were collected and compiled into one dataset for the further learning and training of the model. Tab. 1 gives the complete Dataset description.
After the data was collected and compiled, the next step was to preprocess the raw text data so that it can be used as an input to the model for training and learning. The data included the genome sequences as a data stream of letters. Using TensorFlow and Keras library , firstly the genome sequences were tokenized into a series of integers where each integer was the index of the token. In the second step, the letters present in the genome sequences were tokenized. Finally, all the data was cleaned and checked for any null or duplicate values present in the final dataset.
4.3 Feature Extraction and Labelling
First, tokenization of the genome sequences was done. The converted tokens of the characters in the genome sequence are in the form of integers ranging from 1 to 5. Furthermore, the integer 5 always occurs at the end of the token array and hence has no greater significance in the feature extraction process.
The process of extracting the features from the preprocessed dataset was done by iterating over the token list of the sequences. A fixed length of the token was taken at a time and combined with the next sequence of tokens of length T till the end of the token list was reached using the append method. It was later converted into an array. At the end of the process, the feature array is created with shape (y, T) where y is represented in Eq. (1).
During the feature extraction process, an index was given to each feature array, and a label array of the length of dataset size was created which was then multiplied by the indices of unique character that came out to be 8 in this case. This array was enumerated or defined by iterating over its indices and giving each element value of either 0 for all features not matching the label array indices or 1 for all features matching the label array indices giving us a label array of shape (y*8,8).
4.4 Model Optimization
The fitting of the RNN-LSTM and RNN-GRU model on the dataset was done by using the Adam optimizer which uses stochastic gradient descent algorithm. The learning rate and decay rate were manually set and adjusted to avoid overfitting in case of extremely small learning rate or underfitting in case of very high learning rate. Instead of using gradients, partial derivatives were used which are very helpful since in the dataset comprised of multiple tokens that were being propagated in the RNN based models.
5 Results and Discussion
5.1 Experimental Setup
The model is generated using a TensorFlow v2.0 Environment, with the system using NVIDIA GeForce MX110 GPU, 16GB Ram and Python version 3.7. Although the genome sequences of viruses are of same length, their width is different, so they need to be trimmed to same dimension before pre-processing to make them uniform i.e., 72 X 395 so that it easier to give dimensions in neural network layers.
5.2 Experimental Parameters
For Metrics we have Chosen Accuracy and F1 score which require true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) need to be defined.
TP: When actual sequence is “ABCD” (in sequence) and predicted sequence is also “ABCD”.
TN: When actual sequence is “CBAD” (not in sequence) and predicted sequence is also “CBAD”.
FP: When actual sequence is “ABCD” (in sequence) and predicted sequence is “CBAD” (not in sequence).
FN: When actual sequence is “CBAD” (not in sequence) and predicted sequence is “ABCD” (in sequence).
5.3 Experimental Results
The dataset containing the genome sequences after preprocessing is split into training and validation datasets to prevent overfitting and for evaluating the performance of respective LSTM and GRU models. The metrics of evaluation used in this paper are accuracy and F1 score. F1 score is an indicator of the precision of the dataset’s accuracy. The more the number of true positives, the greater would be the F1 score.
Tab. 2 highlights the accuracy and F1 score for both the models after epoch intervals of 5, 10, and 15. The accuracy in the case of both the models is approximately equal but in the case of the F1 score, LSTM-RNN performs better than the GRU-RNN model implying that the former gives a greater number of positive results as compared to the latter. Further insights can be drawn after plotting the training and validation graphs for both the modes.
For further analysis, the training and validation accuracy as well as loss per epoch is plotted. Figs. 6 and 7 displays the training and validation accuracy of both the models. Although the validation accuracy is almost the same in both graphs, the training accuracy differs as in the LSTM-RNN model it achieves high accuracy after five epochs which is much earlier than the GRU-RNN model in which it achieves high accuracy after 10–12 epochs. Similarly, Figs. 8 and 9 display the training and validation loss per epoch. From both the graphs, it is visible that the loss in the LSTM-RNN model is less than the loss in the GRU-RNN model.
Now we try to implement our model on an existing piece of genome sequence and try to predict it mutations, i.e., change in its genome and calculate mutation rate and compare it with the mutated genome to calculate the mutation accuracy. Fig. 10 demonstrates the genome sequences of the virus.
It was then passed through the same data preprocessing procedure as in the training of the two models Fi.
The data we get is in the adjusted vector form which we change to the genome sequence by following the data preprocessing in reverse direction. Fig. 12 highlights the output coversion from token.
When both the models are applied on the test set shown above following results regarding the mutation rate and mutation detection accuracy were observed.
Tab. 3 gives the mutation percentage as well as the mutation accuracy for both models. The mutation percentage in the GRU-RNN model is greater than the LSTM-RNN which could be due to larger number of true negative results. From all the analysis and comparison of the results of both the models, it can be concluded that LSTM-RNN works better than the GRU-RNN model for this research.
For Comparison analysis two contemporary works have been taken into consideration.
Pipek  uses a filtering algorithm using IsoMut tool and identifies the changes in the genome structure using changes in the filtering parameters in those sequences. Its advantage over conventional statistical approaches is its visualization power of the changes such as False Positive Rate and True Positive Rate threshold is highly informational. On comparison with other tools based on single core performance and time taken to complete the task being 7 min, it performed better than other tools with the next best one being 1 h 20 min.
Thireou  in their work used bi-directional LSTM for predicting subcellular localization in eukaryotic proteins and were able to get an accuracy of 93% on plant proteins and 88% on non-plant proteins. Their work showed the ability of deep learning networks in predicting and analyzing bioinformatics. Our work has been an approach to learn patterns in the sequences of family of coronaviruses using LSTM and GRU also helped us to treat the genome sequences as time series data due to their different instances of identification and giving us the opportunity to predict possible changes in their genome sequence .
6 Conclusion and Future Scope
Viral genome structures provide an opportunity as well as a challenge to dig deeper into the biology of the viruses. This research was a simple attempt at creating a model to learn the pattern changes of genome sequences in these viruses and predict the mutation that took place in the most recent virus of concern that is SARS-CoV2. A neural network especially an RNN based network suits best for these types of studies as they keep on learning and improving their results on increasing the training cycles. In this paper, two models were implemented: RNN-LSTM and RNN-GRU. Although RNN-GRU provided slightly higher accuracy for the initial cycles, RNN-LSTM provided a considerably higher F1 score, hence making the model more suitable for the prediction.
Although the two models performed very well on the sequenced dataset, in case of very huge datasets of viruses which can contain significant outliers and hence, may end up getting overfitted by these models. Therefore, in future well designed neural networks like modular neural networks can help in these cases where multiple neural networks work simultaneously and can avoid overfitting. Furthermore, not only nucleotide data but ribosome data can also be included for not only predicting the mutation but also predicting the complete genome structure of the virus that can be possibly encountered soon.
Also, these models can be used to run simulations on change in genetic data of viruses with respect to any change in surrounding condition, or any chemical compound from any under-development vaccine or medicine to get additional insight on the change in biology of the virus.
Acknowledgement: Authors would like to thank for the support of Taif University Researchers Supporting Project number (TURSP-2020/211), Taif University, Taif, Saudi Arabia.
Funding Statement: Taif University Researchers are supporting project number (TURSP-2020/211), Taif University, Taif, Saudi Arabia.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|