A Vicenary Analysis of SARS-CoV-2 Genomes

Coronaviruses are responsible for various diseases ranging from the common cold to severe infections like the Middle East syndromes and the severe acute respiratory syndrome. However, a new coronavirus strain known as COVID-19 developed into a pandemic resulting in an ongoing global public health crisis. Therefore, there is a need to understand the genomic transformations that occur within this family of viruses in order to limit disease spread and develop new therapeutic targets. The nucleotide sequences of SARS-CoV-2 are consist of several bases. These bases can be classified into purines and pyrimidines according to their chemical composition. Purines include adenine (A) and guanine (G), while pyrimidines include cytosine (C) and tyrosine (T). There is a need to understand the spatial distribution of these bases on the nucleotide sequence to facilitate the development of antivirals (including neutralizing antibodies) and epitomes necessary for vaccine development. This study aimed to evaluate all the purine and pyrimidine associations within the SARS-CoV-2 genome sequence by measuring mathematical parameters including; Shannon entropy, Hurst exponent, and the nucleotide guanine-cytosine content. The Shannon entropy is used to identify closely associated sequences. Whereas Hurst exponent is used to identifying the auto-correlation of purine-pyrimidine bases even if their organization differs. Different frequency patterns can be used to determine the distribution of all four proteins and the density of each base. The GC-content is used to understand the stability of the DNA. The relevant genome sequences were extracted from the National Center for Biotechnology Information (NCBI) virus database. Furthermore, the phylogenetic properties of the COVID-19 This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 3478 CMC, 2021, vol.69, no.3 virus were characterized to compare the closeness of the COVID-19 virus with other coronaviruses by evaluating the purine and pyrimidine distribution.


Introduction
The coronavirus disease pandemic  is an ongoing global public health crisis caused by the severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) [1,2]. The disease originally started in Wuhan, China, and quickly spread to the rest of the world, infecting millions of people worldwide [3][4][5]. The rapid spread of the virus has overwhelmed the most advanced healthcare systems and has so far resulted in the death of over 2.5 million people worldwide. In January 2020, the World Health Organization (WHO) affirmed COVID-19 as a public health emergency of international concern [6][7][8][9]. In order to control the disease and hence give time for the healthcare systems to cope with the sudden demand, lockdowns were ordered in many countries worldwide, leading to major economic and social disruption. In view of this, there is an urgent need to understand the genomic of the virus so as to limit the spread, reduce mortality from the disease, and develop new effective treatments.
The SARS-CoV-2 was found to be connected to two bat-derived stern acute respiratory syndrome-like coronaviruses; bat-SL-CoVZC45 and bat-SLCoVZXC21 [10]. On the 11th February 2020, the WHO formally named the disease COVID-19. From that day onwards, the coronavirus research group of the International Committee on Taxonomy of Viruses called the virus SARS-CoV-2 [11]. The National Center for Biotechnology Information (NCBI) has a complete genomic sequence of the CoV evolutionary basis and molecular uniqueness [12]. Ceraolo et al. [13] identified a good sequence relationship (above 99%) between all sequenced 2019 CoVs genomes and the bat CoV genomes, with the closest bat CoV sequence sharing 96.2% of the sequence identity. This confirmed the zoonotic origin of the virus.
Coronaviruses are enclosed RNA viruses that circulate amongst humans, other mammals, and birds, causing respiratory, enteric, hepatic, and neurologic diseases [14,15]. A total of 89 nucleotide sequences of SARS-CoV-2 are accessible from the NCBI virus database [16,17]. All these sequences consist of nearly about 29000 bases. These bases can be classified into purines and pyrimidines according to their chemical composition. Purines include adenine (A) and guanine (G), while pyrimidines include cytosine (C) and tyrosine (T). There is a need to understand the spatial distribution of these bases on the nucleotide sequence to facilitate the development of antivirals (including neutralizing antibodies) and epitomes necessary for vaccine development. Various quantitative metrics can be used to understand the spatial distribution of purines and pyrimidines, including; Hurst exponent (HE), Shannon entropy (SE), and the nucleotide guanine-cytosine content (GC-content). The HE is used to identify the auto-correlation of purinepyrimidine bases even if their organization differs. SE is used to identify closely associated sequences. Different frequency patterns can be used to determine the distribution of all four proteins and the density of each base. The GC-content is used to understand the stability of the DNA. Therefore, this study aimed to evaluate all the purine and pyrimidine associations within the SARS-CoV-2 genome sequence by measuring mathematical parameters including; SE, HE, and the nucleotide GC-content.
The rest part of this paper is organized as follows. In Section 2, the specification of the database is explained. Definition of different fundamental parameters and their effectiveness with respect to the database have been explained in Section 3. Experimental results and illustrations are demonstrated in Section 4. Section 5 concludes the article, emphasizing the critical factors of the entire analysis.

Specifications of the Used Database
All the CoV nucleotide sequences were acquired from the NCBI Virus Database (http://www.ncbi.nlm.nih.gov/labs/virus/vssi/). This dataset contains 89 complete SARS-CoV-2 nucleotide sequences from the 15 th March 2020. For the purpose of the study, each DNA sequence has been converted into a binary sequence of "10s" and "00s" as per Eq. (1).
Eq. (1) corresponds to purine and pyrimidine nucleotide bases encoded as 1 and 0 correspondingly into the changed binary sequence. All the 89 complete SARS-CoV-2 nucleotide sequences were labeled according to their accession ID as listed below in Tab. 1. Seq Accession ID Seq Accession ID Seq Accession ID Seq Accession ID Seq Accession ID S1 NC_045512 S19 MT159712 The length of these complete 89 sequences varied between 29783 to 29981 nucleotides, and the range was 198 bp long. The smallest complete SARS-CoV sequence was S2 with a length of 29783, and the largest one was S47, with a length of 29981. Two sequences had a length of 29867, Thirty-nine sequences had a length of 29882, and 11 sequences were 29903 long.

Generation of Gene Clusters
Different quantitative parameters, including; SE, fractal dimension (FD), HE, and the distribution of purines-pyrimidines contents, were used to describe the spatial distribution of the bases of the SARS-CoV-2 sequences.

FD of the Indicator Matrices
FD is a key for characterizing fractal patterns or sets whereby D = {0, 1} is a set of two symbols characterizing the purine and pyrimidine bases of a nucleotide sequence, and S(l) is the binary sequence corresponding to a nucleotide sequence with the repetition of two characters from D to length l. All the binary sequences in our study were transferred into the indicator matrices [18][19][20][21]. The patterns were demonstrating self-similarity in the fractal dimension point to which the fractal objects filled a particular Euclidean space in which it was entrenched. Several methods have been described in the literature to determine the self-organizing configuration of the DNA sequences throughout an indicator matrix. The indicator function for each sequence was then defined as shown in Eq. (2) [22]: such that the indicator matrix: whereby ϑ hk is a matrix with the values 0 and 1. A binary image is generated through this matrix to understand the correlation between the sequences. Similarly, we can also depict the autocorrelation between the purine and pyrimidine for the same sequence. The image will be generated by assigning a '1' to black dots and '0' to white dots. The purine and pyrimidines have been distributed like a fractal. The indicator matrix FD has been calculated as the average number of σ (p) of 1, acquired from the P × P indicator matrix with p × p randomly. Using σ (p), the FD is defined in Eq. (3).
The self-organization of purine and pyrimidine bases for all the SARS-CoV-2 sequences can be obtained through the indicator matrix FD. The box-counting method is the most commonly used to determine the FD.

HE
The autocorrelation of purine-pyrimidine bases for all the SARS-CoV-2 sequences was obtained through the HE. The HE was applied during the time series investigation to infer the autocorrelation [23,24]. The HE values range between 0 to 1. An HE value of 0.5 indicates the absolute randomness of the time series data, while a value below 0.5 indicates a negative correlation, and a value above 0.5 indicates a positive correlation. The HE of a binary sequence s n is defined as below. where and

SE
The SE was used to measure the uncertainty of the binary sequence. Primary protein sequences were generated through different combinations of amino acids ranging from 30 to 3000. Some protein sequences were kept as a substring like AAAAAAAG and AAAAAAAAATTTTTTTT, which resulted from coding of one or an assortment of amino acids. Such proteins are less likely to encode functional proteins. Therefore, the amount of information or the sequence uncertainty concerning a base pair was measured using the SE. The SE was used to measure the Bernoulli process entropy with the probability (p) of the two outcomes (0/1) defined as below; where p 1 = k 2 l and p 2 = l−k 2 l ; here l is the length of the binary sequence, and k is the number of 1's in the binary sequence of length l [25,26].
If the probability p = 0, the event will never occur; otherwise, if p = 1, a certain result will be generated with entropy 0. When p = 0.5, the uncertainty is at a maximum, and consequently, the SE is 1.

GC Content and Nucleotides Density
In molecular biology, the GC-content is usually calculated as a percentage and is sometimes called G + C ratio or GC-ratio [25][26][27][28]. The percentage of GC-content and GC-ratio of the DNA sequences s used to measure several resources. One of the simplest procedures is to measure the melting point of the DNA sequences using spectrophotometry. A higher GC-content indicates a more stable DNA structure.
The GC-content percentage was calculated by the formula [29,30]. In addition to the GC-content, the density of the nucleotides A, T, C, and G were acquired separately in the present study [31,32].

Results and Illustrations
The frequencies of several nucleotides in the SARS-CoV-2 sequences were not selected randomly. In this study, we, therefore, tried to evaluate the purine and pyrimidine spatial distribution organizations among the SARS-CoV-2 sequences through the parameters as defined in the previous section. In addition to the investigation of the purine-pyrimidine distribution, we also explored the density of each of the nucleotides and GC−content, which has a significant role on the stability of the sequence.

Classification based on the FD of the Indicator Matrices
Three distinct FDs (0.3, 0.4755, and 0.6) were identified, indicating that only three clusters within the sequences are turned up. Tab. 2 demonstrates the sequences and their corresponding FD. The histograms of all the SARS-CoV-2 sequences that were plotted according to the FD are illustrated in Fig. 1.

Classification Based on SE
For all the 89-binary purine-pyrimidine sequences of the SARS-CoV-2, the SE was first determined, and then ten different clusters were formed based on the SE obtained for all the sequences, as shown in Tab. 4. The SE and the histograms of all the SARS-CoV-2 sequences are illustrated in Fig. 3.
An SE ranging from 0.9999 to 1 indicates that the length of the range is too small, and therefore the SE is precisely the same for all the sequences. The SE for all sequences was 0.9999 except for sequence S30, which was 29945. This indicates the maximum level of uncertainly for the S30 sequence with a probability of a purine-pyrimidine occurrence of 0.5. This means that although this sequence was not randomly composed of nucleotide bases as they are positively autocorrelated with an HE 0.6553, the purine and pyrimidine bases are composed with equal probability.  After evaluating all the SE of the binary purine and pyrimidine represented for the SARS-CoV sequences, only three clusters were formed using the k-means clustering technique. The cluster-1 contained 21 sequences S68, S78, S88, S71, S13, S69, S15, S42, S74, S67, S39, S40, S41, S60, S1, S14, S89, S12, S57, S3, and S4 having SE centered at 0.999940381147619. The other 67 sequences belonged to cluster-2 and were all centered at 0.999930184068656. Therefore, these two clusters can be considered the same. Cluster-3 contained only one sequence (S30) with an SE of 0.9999585474 (approximately 1), as already mentioned before.
The distribution of SE for all the purine and pyrimidine distributions among the SARS-CoV-2 sequences was mostly linear. This is crucial for the SARS-CoV-2, unlike other sequences obtained in previous studies made [35][36][37]. The uncertainty level reached the maximum, which means that the probability of purine and pyrimidine bases occurring across the sequences among all the SARS-CoV-2 is equal.

GC, A, T, C, and G Density in the SARS-CoV-2
The sequences were classified according to the GC, A, T, and G densities, as follows. Tab. 5 shows the percentage density of the GC-content among all the SARS-CoV-2 sequences. The histograms of all the SARS-CoV-2 sequences that were plotted according to the GC content as shown in Fig. 4. The percentage density of the GC-content was around 37.5% meaning that the SARS-CoV-2 sequences is A and T rich. This means that A (30) was the most common purine base nucleotides, and T (32) was the most common pyrimidine base nucleotide(T). The occurrence of purine and pyrimidine bases was equally probable based on their SE. This is an important specialty of the SARS-CoV-2 sequences. Based on the GC-content density in the SARS-CoV-2 sequences, ten different clusters are formed using the k-means clustering technique as shown in Tab. 5. These ten clusters (C) had their centers at 37.9460, 38.0143, 37.9826, 37.9952, 38.0002, 37.9714, 37.9888, 38.0230, 37.9561, and 37.9128. The density of all these sequences was located in the 37.91284, 38.02505 interval. Cluster-10 and cluster-9 contained only one sequence. The GC-content of the S30, S13, and S60 were 37.91284%, 37.94602%, and 37.95605%, respectively. As also previously explained, S30 also had an SE of 1. The A, T, C, and G intervals and their corresponding densities are summarized in Tab. 6. The histograms of all the SARS-CoV-2 sequences that were plotted according to the density of A, T, G and C are illustrated in Figs. 5-8 respectively. The spread of A, T, C, and G over the SARS-CoV-2 sequences were approximately 30%, 32%, 18%, and 19%, respectively. These findings further confirm that SAR-CoV2 is significantly AT rich and the density of the purine and pyrimidine bases is similar as shown by the SE.   All the sequences of SARS-CoV-2 sequences are clustered into different clusters. The position of each cluster center for all four bases differed by 0.01. The sequence S79 has the least percentage (29.85%) of the nucleotide base A, whereas sequence S30 has the lowest percentage of T, C, and G densities. It is also observed that S79, S47, and S2 have the highest percentages (32.13%) of T density, followed by G (19.65%) of C (18.40%).

Hamming Distance of the SARS-CoV-2
The similarity analysis of the SARS-CoV-2 sequences was measured by calculating the distance between the binary vectors of the binary strings encoded based on purines and pyrimidines nucleotide bases, as mentioned earlier. Several computing methods measure the distance between multidimensional vectors, such as Hamming distance (HD), Euclidean distance, Elastic-matching distance, Jeffrey and Matusita distance, Manhattan distance, and Minkowski norm. Reportedly, these methods have little effect on the vector similarity [38]. The HD between two binary strings is defined by the number of bits in which they vary [39,40]. However, here we had to take into consideration that the length of the different SARS-CoV -2 genome usually varies by some bases.
Suppose there are two SARS-CoV -2S 1 x and S 2 y with a length of x and y respectively (x > y), then HD(S 1 x , S 2 y ) = hd(S 1 y , S 2 y ) if the two binary sequences S x = 101011 and S m = 0010, have a minimum length of 4, from left to right the HDs are (101011, 0010) = 1. The two binary sequences, x, and y are identical if the HD= 0, which indicates a similar distribution of purines and pyrimidines over the SARS-CoV -2 sequences. Similarly, the distribution of purines and pyrimidines over the SARS-CoV -2 sequences are completely different when the HD = min(x, y). To measure the distance of the SARS-CoV -2 based on their purine-pyrimidine distribution, minimum HD was used. The larger the HD between the sequences the lower the probability that these two sequences are related to each other. The SARS-COV-2 virus sequences MT044258(S59), MN994468 (S84), NC_045512(S1), and MN039888(S60) were grouped together as a single cluster as the distance between them was almost negligible, indicating that they are closely related. Furthermore, the sequences MT152824(S38), MN996531(S76), MT012098(S36), and MT975262(S86) were closely related to each other and therefore treated as a single cluster. Similarly, the sequences MT163719(S13), MT007544(S74), MT03988(S62), MT188341(S2), MT188339(S3), MT188340 (S4), MN123290(S45), MT039873(S61), MT159721(S23), and MN072688(S54) also had similar HD and were therefore grouped together. After taking into consideration the HD between these sequences, it was observed that they were very closely related. This closeness (nearness) among the SARS-CoV-2 genomes makes it possible for future such genomes or other blasted results to analyze clusters quantitatively instead of only relying on sequential similarity.

Conclusions and Summary
The novel coronavirus has led to a worldwide public health emergency. One of the major reasons for such a global threat is the lack of quantitative and qualitative knowledge about this novel virus, including its genomic and proteomic levels. In this article, we evaluated the quantitative nature of the SARS-COV-2 complete sequences. This present study revealed the closeness amongst the 89 complete sequences in the purine-pyrimidine level descriptions through phylogenetic analysis. Based on this quantitative investigation, very interesting observations were made. The purine and pyrimidine were found to be evenly and equally spaced throughout all 89 SARS-CoV sequences. The GC− content was also significantly low. This quantitative data helps us to better understand the genomic sequences of the SARS-CoV-2 sequences and could potentially be used to reduce disease spread and to identify new therapeutic targets. However, these observations could be further strengthened by evaluating the SARS-CoV-2 proteins.