Tissue specific prediction of N6-methyladenine sites based on an ensemble of multi-input hybrid neural network

N-Methyladenine is a dynamic and reversible post translational modification, which plays an essential role in various biological processes. Because of the current inability to identify m6A-containing mRNAs, computational approaches have been developed to identify m6A sites in DNA sequences. Aiming to improve prediction performance, we introduced a novel ensemble computational approach based on three hybrid deep neural networks, including a convolutional neural network, a capsule network, and a bidirectional gated recurrent unit (BiGRU) with the self-attention mechanism, to identify m6A sites in four tissues of three species. Across a total of 11 datasets, we selected different feature subsets, after optimized from 4933 dimensional features, as input for the deep hybrid neural networks. In addition, to solve the deviation caused by the relatively small number of experimentally verified samples, we constructed an ensemble model through integrating five sub-classifiers based on different training datasets. When compared through 5-fold cross-validation and independent tests, our model showed its superiority to previous methods, im6A-TS-CNN and iRNA-m6A.


Introduction
There are more than 160 identified types of RNA posttranscriptional modifications. Among them, the 5' cap and 3' poly modifications play important roles in transcriptional regulation, while the function of internal modification is maintaining the stability of mRNA in eukaryotes (Cao et al., 2016;Yan et al., 2021). One of the most common internal modifications is N6-Methyladenine (m6A). Since discovered in the 1970s, it has been observed in a wide range of eukaryotes, including yeast, Arabidopsis thaliana, Drosophila, and mammals, as well as in the RNA of viruses (Cao et al., 2016;Yang et al., 2020). N 6 -Methyladenine is a dynamic, reversible post translational modification, and is essential in post transcriptional regulation, regulating gene expression, splicing, editing RNA and maintaining genomic stability (Cao et al., 2016). However, m6A modifications were considered static and unalterable, owing to both the ignorance of m6A demethylating enzymes and the short lifetime of most RNA species (median mammalian RNA half-lives are approximately 5 h) (Cao et al., 2016;Yan et al., 2021). The inability to identify m6A-containing mRNAs has also hindered investigation into their biological roles.
There are several representative examples of classification models based on shallow learning. Feng et al. (2019) integrated nucleotide physicochemical properties into PseKNC (Pseudo K-tuple Nucleotide Composition) and SVM (Support vector machine) so as to build a prediction tool called iDNA6mA-PseKNC. Another prediction model, SDM6A, was developed by Basith et al. (2019) to identify m6A sites in the rice genome. Basith et al. (2019) used numerical representations of nucleotides, mono-nucleotide binary encoding, di-nucleotide binary encoding, local positionspecific di-nucleotide frequency, ring-function hydrogen chemical properties and K-nearest neighbor in order to select features by F-score. Then, they tried four traditional machine learning (ML) classifiers, namely SVM, ERT (extremely randomized tree), RF (random forest) and XGB (extreme gradient boosting), to predict DNA N6-methyladenine. Finally, two classifiers were integrated to construct the model. Hasan et al. (2020) implemented five encoding schemes (mononucleotide binary, din-ucleotide binary, k-space spectral nucleotide, k-mer, and electron-ion interaction pseudo potential compositions) to build five single-encoding RF models for identifying the DNA m6A sites in the Rosaceae genome. They then combined the prediction probability scores of these five RF models and used a linear regression model to construct an i6mA-Fuse classifier.
Several predictors based on deep learning algorithms have also been developed. Tahir et al. (2019) built a deep learning automatic computing model, iDNA6mA, which could predict m6A sites by integrating one-hot encoding and a convolutional neural network (CNN). Nazari et al. (2019) used not only a convolutional neural network but also the natural language processing model Word2Vec in order to extract features from sequences automaticly, and succeeded in constructing the iN6-Methyl model, which was able to identify m6A sites in multiple species.
Moreover, with the deepening understanding of the spatial specificity of gene expression, there were two studies offering insight into distinguishing m6A modification sites in various tissues of human, mouse and rat. Dao et al. (2020) extracted three kinds of features, containing physicalchemical property, mono-nucleotide binary encoding and nucleotide chemical properties, and combined them with SVM to construct a predictor called iRNA-m6A. In another study, Liu et al. (2020) proposed a predictor called im6A-TS-CNN which employed one-hot encoding and CNN. But neither model gave satisfactory performance because of the limitation in the feature extraction and classifier architecture designation. These two studies did not consider location and context information, and did not pay attention to redundant information as well. In addition, the deep network architecture should be further explored and designed so that its deep feature learning ability should be promising.
To address these limitations, we proposed a novel computation model, considering three kinds of feature descriptors: one-hot encoding, sequence features derived from iLearn, and K-tuple nucleotide frequency pattern (KNFP), to characterize the nucleic acid sequence. To scale down the information noise caused by excessive unrelated features, we used the F-score and reduced feature dimension through a series of triple 5-fold cross-validation tests. Following this, we used a hierarchical deep learning network composed of a multi-channel convolutional neural network, a capsule network, and a bidirectional gated recurrent unit (BiGRU) with the self-attention mechanism to learn local and contextual information. Moreover, we randomly divided the positive and negative training datasets into five mutually exclusive parts of similar size, and then selected four parts combined as new training datasets, with the remaining one part adopted as cross-validation test set to optimize the model at each time. Finally, we built an ensemble model and gave the forecast labels according to the majority voting strategy. To evaluate the effectiveness of the ensemble model, we compared its performance with im6A-TS-CNN and iRNA-m6A through a 5-fold cross validation and an independent test. For all the 11 datasets, our model gave the best performance with measures of accuracy and Matthews correlation coefficient. In addition, we visualized the analysis results of the brain in human, mouse and rat using t-distributed Stochastic Neighbor Embedding (t-SNE). Fig. 1 demonstrates the design and optimization process of our model.

Datasets
In this study, we trained and evaluated our model on the benchmark datasets containing a total of 11 training and 11 testing datasets from human, mouse and rat, which were also used in iRNA-m6A and im6A-TS-CNN models (Dao et al., 2020;Liu et al., 2020). Each dataset contains the same number of positive and negative samples. Each sample is a 41nt-length RNA sequence with Adenine in the center. Detailed information about these datasets can be found in the work of Dao et al. (2020).

Feature extraction and feature selection
It is vital to extract efficacious features when developing new computational model based on machine or deep learning algorithms (Zhang and Liu, 2019). In this study, we extracted three categories of features from the sequence: one-hot encoding, sequence features, and order features.
One-hot encoding Given a DNA sequence D, its intuitive expression is where R I represents the i-th nucleic acid residue at position i in the DNA sequence. Each sequence with 41 nt is represented with a 41 × 41 vector, in which (1, 0, 0, 0) stands for G, (0, 1, 0, 0) stands for C, (0, 0, 1, 0) stands for U, and (0, 0, 0, 1) stands for A.

Sequence features
Transforming a DNA sequence sample into a vector based on its sequence characteristic composition is a simple but universal strategy, which can capture significant biological information (Zhen et al., 2020;Zou et al., 2019). iLearn is a comprehensive and versatile Python-based toolkit including a variety of descriptors for DNA, RNA and proteins (Zhen et al., 2020). We used iLearn to calculate and extract four types of features: nucleic acid composition, binary electronion interaction pseudopotentials, autocorrelation and crosscovariance, pseudo nucleic acid composition, and achieved a total of 1325 features. The names and dimensions of features used in this section are listed in Table 1. As for the specific definitions of these features, please refer to (Zhen et al., 2020;Zou et al., 2019).
K-tuple nucleotide frequency pattern KNFP integrates the information from K-mer as well as onehot encoding, and can compensate for insufficient short-range or local sequence order information effectively (Yang et al., 2020). It has been used to identify protein-RNA binding sites and protein-circRNA interaction sites (Yang et al., 2020). K-mer can map any DNA sequence to a vector with 4 k dimensions as follows: where ' u u ¼ 1; 2; Á Á Á ; 4 k À Á is the frequency of the u-th k-mer along the sequence (k = 1,2,3 in this work).
On one hand, R can be transformed to a diagonal matrix R D , if multiplied by the identity matrix. On the other hand, for a DNA sequence D of length L, the number of k-mer is LÀk + 1. Each k-mer can be encoded as a one-hot vector with dimension of 4 k . The product of the L À k þ 1 ð ÞÃ4 k matrix (M) and R D is KNFP.

Feature optimization
Experiments have shown that excessive feature information can interfere with the performance of classifiers. Therefore, feature selection methods should be applied to find the most informative features for training classifiers and thus reduce the dimensionality of the feature vector. F-score has been extensively applied in bioinformatics because of its effectiveness in balancing accuracy and stability (Bui et al., 2016;Li et al., 2018). The F-score of the j-th feature is defined as denote average values of the j-th feature in the combined positive and negative training datasets, the positive datasets, and the negative datasets, respectively. m þ is the total number of positive samples; m À is the total number of negative samples; x þ ð Þ k;j represents the j-th feature of the k-th positive sample and x À ð Þ k;j represents the j-th feature of the k-th negative sample. The higher the F-score is, the more useful the corresponding feature is for the classification.

Multi-input hybrid neural network
The model architecture consisted of a multi-channel CNN, a capsule network and a BiGRU network. Each of these three networks has, respectively, been shown to be effective in object detection (Li et al., 2018), protein post-translational modification site prediction (Wang et al., 2019) and social bots detection (Wu et al., 2020). The structure of the multiinput hybrid neural network is shown in Fig. 2.

Multi-channel CNN
The input of the multi-input hybrid neural network is the onehot encoding, sequence features and KNFP, respectively. For each type of features, we applied 32 convolution filters and performed batch normalization to readjust the data distribution. The input of a batch in the neural network is , where x i represents a sample and n is the batch size.
The mean value of elements in each batch is obtained by: Then, the variance of a batch is calculated by This allows us to normalize each element: The final output of the network is given by where e is a small positive number used to prevent the denominator from being 0. Finally, we merge three outputs using the Swish activation function r, which is defined as follows: Because of the limited size of our datasets, we add a 1 × 1 multi-channel CNN to enhance representational capabilities of the model. The 1 × 1 convolution is used to maintain the size of the feature map and integrate the information by linearly weighting the input feature map of each channel. With additional layers of such 1 × 1 convolution, the final extracted features would become more concise.

Capsule network
The capsule network (CapsNet) was proposed by Sabour et al. (2017) and applied in stance detection (Zhao and Yang, 2020), image recognition (Qian et al., 2020) and automated classification (Mobiny et al., 2019). Since the capsule network collects location information, it can learn a good representation from a small amount of data. We use the capsule network and focus on the hierarchical relationship of local features. The output of the multichannel CNN is adopted as the input vector of the capsule network. We make an affine transformation of the input vector as follows: where W ij is the weight matrix that needs to be trained and u I is the input vector of the capsule neural network. Next, the weighted sum is applied to all the prediction vectors as follows: where c ij is the coupling coefficient in the dynamic routing process. Finally, we obtain output vectors through a non-linear activation function as follows:

BiGRU network
The third segment of this model is the BiGRU network, which helps to extract deep-level features of sequences.
where GRUðÞ represents a nonlinear transformation of the input word vector; w t and v t are the weighed matrices; and b t is the bias term.

Parameter setting
Considering the number of datasets and the precision of the model, three feature maps were obtained from batch normalization and 1D convolution with 32 filters (kernel size = 3). The multi-channel CNN contained three 1 × 1 convolution layers and took the Swish as activation function. Considering time cost, we only employed one capsule layer with 14 num_capsule (dim_capsule = 41, routings = 3). The BiGRU had 32 hidden units followed by a fully connected layer and used the ReLU activation function. We also used dropout with a keep probability of 0.3 to prevent the model from over fitting. For stochastic gradient descent, we selected the Adam optimization algorithm. The entire program was written in Python 3.6.

Performance assessment
To evaluate the performance of our prediction model, we used four measurements including accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew's correlation coefficient (MCC) on 5-fold cross-validation and independent dataset tests. The formulas are provided as follows: where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.

Results
We observed a deviation in the process of 5-fold crossvalidation test (Table S1), probably due to the limited number of experimentally verified samples. A general strategy to solve the problem of insufficient samples is to construct an ensemble prediction model. On this purpose, we randomly divided the positive and negative training datasets into five mutually exclusive parts of similar size. And then, we selected the combination of four parts as a new training dataset, while the remaining one part was adopted as validation test dataset to train and optimize the three-layer hybrid neural network at each time. Thus, we got five sub models which were then integrated into a novel ensemble prediction model based on a majority voting strategy. In order to verify the effectiveness of the ensemble model, we compared it with two previous prediction methods: im6A-TS-CNN and iRNA-m6A. The same training and independent datasets were used for our model, im6A-TS-CNN and iRNA-m6A; therefore, both 5-fold crossvalidation and independent test could be used to evaluate these methods objectively. There was a total of 132 results from 11 datasets involving four indicators (Sn, Sp, Acc, MCC), among which our model achieved superior  Tables S2 and S3.

Discussion
Effectiveness of feature selection If too many features are extracted, the generalizability of the model will be weakened. Thus it is important to determine the appropriate step size for feature selection. As the dimension of the input matrix was 41 × N, we chose the step size as 41 and evaluated the performance of our model with feature matrices of different dimensions (41 × N, N 2 5; 6; 7; Á Á Á ; 24; 25 ð Þ ) on 5-fold cross-validation tests successively. To reduce the deviation caused by the fluctuation of the neural network, we ran each test three times for the 11 datasets from the brain, liver, kidney, heart, and testis of Homo sapiens, Musmusculus, and Rattusnorvegicus. The optimal feature subset was finalized according to the average accuracy. The detailed feature selection results are shown in Fig. 5. As should be noticed, dimensions of optimal features were various for different datasets; their corresponding dimensions are listed in Table 2. These results indicate that it is necessary to establish a specialized model for each tissue type in each species.

Parameter selection
Generally, there are two ways to select parameters, i.e., empirical choice and Bayesian optimization. With Homo sapiens as an example, we tried both methods to find the most suitable parameters. The initial parameters were set based on a previous work to compare the prediction results roughly. Then, if the prediction model was under fitting, we attempted to add more convolution kernels and neurons; otherwise if the prediction model was over fitting, we attempted to reduce the number of convolution kernels and neurons. In addition, batch normalization, dropout, and regularization were introduced to avoid over fitting during the optimizing process. Alternatively, Bayesian optimization (Snoek et al., 2012) was used to tune the key parameters including batch size, dropout rate, filter1, filter2, pool_size and etc. Finally, the optimal parameters wereselected according to the AUC value. The Baysian optimization of parameters in models for Homo sapiens and corresponding AUC values are provided in Table S4 and Fig. 6A. We compared the best results obtained by the two methods and concluded the result given by empirical adjustment parameters showed more advantages (Fig. 6B). Thus, for Musmusculus and Rattusnorvegicus, we used the empirical method to determine the parameters in neural network.

Comparison of different classifiers
To verify the effectiveness of hybrid neural network, we compared its prediction performance with several traditional classification algorithms including logistic regression, decision tree, support vector machine (SVM), random forest (RF), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) and light gradient boosting machine (LightGBM) for Homo sapiens. The average accuracies of 5-fold cross-validation tests obtained by the seven algorithms are listed in Table 3. On three datasets of Homo sapiens, our model achieves the mean accuracy of 74.29%, 1.28% higher than the second best algorithm, LightGBM, which demonstrates that hybrid neural network is capable of improving the recognition performance for m6A sites of various tissues in different species.

Visualization of feature representations
To observe the effectiveness of extracted features intuitively, we applied t-distributed stochastic neighbor embedding (t-SNE) to visualize the feature representations. We took the brain tissue as an example and demonstrated the features after mapped through the concatenate layer and the attention layer. Each dot in Fig. 6C represents a sample, with purple dots representing m6A sites and yellow dots representing non-m6A sites. The overlapping of the two sample types on the left side of Fig. 6C indicates that it is difficult to distinguish m6A sites from non-m6A sites. However, the features were relatively separated, as shown on the right side of Fig. 6C, after selected and processed by the deep hierarchical network, which was, multi-channel CNN, capsule network and BiGRU network with the self-attention mechanism. The t-SNE plots indicated that the hybrid deep hierarchical networks could learn sequence motif information from selected features. But there are still some overlaps between the two types, indicating that our model is not completely specific for all m6A sites. This fact is consistent with the need to establish specialized models for each tissue type.  In future work, we will extract more types of information and further optimize these models. Availability of Data and Materials: The data sets and source code used in this study are freely available at https://github. com/Dong7777/im6A.
Author Contribution: QZ and XW conceived the project, developed the prediction method, designed and implemented the experiments, analyzed the results, and wrote the paper. DJ and CZJ implemented the experiments, analyzed the results, and wrote the paper. All authors read and approved the final manuscript.
Funding Statement: This work was supported by the National Natural Science Foundation of China (Nos. 62071079 and 61803065).

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.