A deep learning framework for modeling structural features of RNA-binding protein targets

RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this paper, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp.


Hidden layer
Visible layer  Figure S1: Examples of the RBM and replicated softmax models. (a) An RBM model. (b) A replicated softmax model, in which "words" represent the input RNA base sequence or secondary structural profiles, while "latent topics" are hidden variables that represent unknown factors controlling the RNA sequence or secondary structural features.
To exemplify our notation defined in the main paper, let us consider an RNA sequence/document D = AAUCG and the corresponding RNA dictionary K = (A, U, C, G). The energy function and the corresponding joint probability density function are defined in the main paper (see Eqs. 3 and 4, respectively). Based on these definitions, the conditional distributions of visible and hidden variables are given by where sig(x) stands for the sigmoid function. Note that Eq. S1, which computes the probability of each RNA word appearing in the RNA document given a particular combination of latent topics (denoted as the hidden variables), is the so-called softmax function. Since we omit the occurrence order of the RNA words and only consider their frequencies, the bag of words (but not the sequence of words) can be generated by (replicated) sampling from the model. These two points interpret the model's name -replicated softmax.
S2 Featured examples of the RNA tertiary structural motifs Figure S2 shows four examples of RNA tertiary structural motifs derived from R3DMA [3]. S3 Descriptive statistics on the CLIP-seq datasets of 24 RBPs Table S1 provides descriptive statistics on the CLIP-seq datasets of 24 RBPs used in our tests, including the dimensions of individual sequence and structural profiles, the number of (positive and negative) samples and the size of the validation set.

S4 Supplementary details of cross-validation, model training and implementation
In this section, we provide the details on the 10-fold cross-validation, model training and implementation of our deep learning framework.
To evaluate the performance of our model, we performed a 10-fold cross-validation procedure for all 24 RBP datasets. In particular, when performing cross-validation, we further randomly splitted the nine non-testing subsamples into two parts, including eight subsamples for pure model training and one subsample for validation. We also verified that there were rarely nearly-identical sequences in any two validation sets, since the percentage of the pairwise homologous RNA sequences in each of the 24 RBP datasets was small (< 1%, computed by BLAST [6] using E-value of 0.001, see also Table S1 for details).
To train our deep learning framework, we first performed the layer-wise unsupervised learning strategy to pretrain the network (DBN) [7], during which the modality-specific information was eliminated, the sequence and structural profiles in different dimensions were integrated and the unified implicit representation that characterizes the RBP binding preference was constructed. To predict RBP binding sites, we further transformed the multimodal DBN to a standard neural network with the same architecture, in which Table S1: Descriptive statistics of the CLIP-seq datasets. "# of +" denotes the number of positive samples, while "# of -" denotes the number of negative samples. The column "# of val" reports the average size of the validation set used in the 10-fold cross-validation procedure. The percentages of the pairwise homologous RNA sequences of these 24 RBP datasets were all less than 1% (computed by BLAST [6] using E-value of 0.001).  Figure S3: The error curves of the training and test data of PTB. The cross entropy was used to define the error score [8].

RNA
the parameters were initialized via the aforementioned pretraining process, and then the error back-propagation strategy was conducted to fine-tune the model parameters. Note that to keep the bottom replicated softmax models interpretable, i.e., to make the hidden layers attached to the input base sequence and secondary structural profiles always represent the corresponding RNA topics, we did not change their parameters in the back-propagation process.
We implemented our deep learning framework based on deepnet 1 , a GPU-based python library that can be used to implement various deep learning models, e.g., deep belief networks (DBNs), deep Boltzmann machines (DBMs) and stacked autoencoders. The hyperparameters of the proposed deep learning model, including the learning rate, batch size and momentum, were set to be the default values in deepnet (which are also commonly used in the deep learning community), and were fixed for all RBPs. To train the RBM and replicated softmax models, we used the mean-field version of the contrastive divergence (CD) algorithm [9] and set the mini-batch size to be 10 and the learning rate to be 0.01. All the model parameters were initialized with zero-mean Gaussian distribution of standard derivation 0.01. We used the momentum method to accelerate the learning process [10], whose initial and final values were set to be 0.5 and 0.9, respectively. For a better regularization, the weight decay technique was conducted in the pretraining procedure by adding an extra penalty term to the loglikelihood function [10]. In particular, the L 2 penalty term was used and its coefficient was set to be 10 −4 . After the layer-wise pretraining, we stacked another logistic layer on the top of the DBN for RBP binding site prediction. This top layer was initialized with zero-mean Gaussian distribution of standard derivation 0.01. In the error backpropagation training process, we fine-tuned the whole network except the bottom two replicated softmax layers with learning rate 0.05. In particular, in each fold of the crossvalidation process, we first ran the error back-propagation algorithm for 200,000 epochs and then selected the "best" model with the best prediction performance (AUROC) in the validation set. After that, we calculated the AUROC score for the separate test dataset, which was regarded as the final prediction performance in this cross-validation fold. We also used the dropout technique [11] with the dropout probability 0.2 for the input layer and 0.5 for other layers.
To avoid confusion, here we further emphasize the subtle difference between the pretraining and fine-tuning (i.e., error back-propagation) processes in the standard deep learning framework. In the pretraining process, the whole network is regarded as a DBN, i.e., a hybrid (with both directed and undirected edges) generative model in which the elementary modules are restricted Boltzmann machines (RBMs). In both RBM and DBN models, all the vertices in both visible and hidden layers represent binary random variables, which is different from a neural network. Equation 2 in the main paper is the definition of the energy function for an RBM, in which vertices only take binary values {0, 1}. In fact, the neural network perspective is only taken in the fine-tuning process (i.e., error back-propagation), and the net parameters are initialized based on the pretraining. Here, the network is a (deterministic) feedforward neural net with the same structure as the aforementioned (probabilistic) generative DBN. However, as in the conventional neural networks, all the vertices are now treated as computational units calculating the sigmoid functions and outputting real values in [0, 1].
To show that our prediction model did not suffer from the potential overfitting problem, we first claim that the layer-by-layer unsupervised training manner can alleviate the issue of learning a large number of parameters in the current deep learning framework, since the parameters of individual layers during the layer-wise training process actually depend on each other and a certain amount of training data may be sufficient enough to learn accurate parameters in each layer. In addition, it has been well accepted that several techniques and empirical rules developed in the deep learning literature [12] are able to overcome the overfitting problem and yield excellent prediction results in numerous applications [12,13,14]. Those default settings of hyperparameters in deepnet, such as the coefficient of the L 2 penalty term and dropout probability, that we used in our deep learning framework, generally follow the commonly-used rules. It is true that using different hyperparameter settings for modeling distinct RBPs might produce better prediction performance, but, in practice, it is usually time-consuming and almost infeasible to determine these hyperparameters based on a separate validation set. In fact, the fixed hyperparameter set can also be viewed as part of the prediction model. In our parameter learning process, we only aimed to determine the weight and bias terms in each layer. The above arguments rationalize our hyperparameter settings from a theoretical point of view. To show that our training procedure was also feasible in practice, we also checked the error curves of both training and test data, as shown in Fig. S3. If overfitting occurred in our training process, we expected to observe an increasing error on the test data as the training error steadily decreased. However, such a phenomenon did not happen in our training process. As shown in Fig. S3, the curves of both training and test errors decreased steadily, which implies that overfitting was not an issue in our framework.
Here, we also provide several implementation details of generating RBP binding motifs that are omitted in the main paper. During the top-layer RBM sampling, we first clamped the label vertex to be one, and conducted the mean-field based Gibbs sampling for 10,000 iterations. The same procedure was conducted when clamping the label to be zero for computing ∆P . To sample the RNA words, we set the sampling threshold δ according to the average RNA word probability, i.e., δ = 2.5×10 −4 for the base sequence profiles and δ = 5 × 10 −4 for the secondary structural profiles. We sampled 10,000 RNA words and used WebLogo [15] to visualize both RNA primary sequence and secondary structural motifs of PTB binding.    Table S3 shows the consistency results for the top five 3D binding motifs of PTB derived by our method. The consistency between these RNA 3D motifs and the corresponding base sequence and secondary structural features can be confirmed by the following observations: (i) The fraction of C and U bases in the top five motifs was high (≥ 60%, see Table S3), which was in line with the fact that PTB has the CU-rich sequence binding motif [16] (see also Table 2 Figure S4: Comparison of the derived secondary structural motifs for PTB binding before and after inactivating the secondary and tertiary structural profiles for 20% of the training samples.

S6 The consistency between different types of derived binding motifs for PTB
pin loops, which was consistent with the PTB 2D binding motif (i.e., hairpin-loop-rich) derived by our model (see also Table 2 in the main paper).

S7 Results of RBP binding motif generation after partially inactivating structural profiles
To investigate how the encoded RNA structural profiles affect the extraction of structural features in our deep learning framework, we introduced a fraction of unstructured RNA sequences (i.e., bound and unbound sites devoid of structural features) into the training data, and then checked whether our approach can still capture the structural binding motifs. In particular, we randomly selected 20% of positive (bound) and negative (unbound) samples from the training dataset for PTB, and inactivated their secondary and tertiary structural features by constructing the structural profiles randomly. The remaining 80% training data were kept unchanged. Then we performed the same motif generation process as described in the main paper.
The test result showed that the derived secondary structural motif (see Fig. S4(b)) after inactivating the structural features for 20% of the training samples was less notable than the original one (see Fig. S4(a)) derived before inactivation. However, the loops (e.g., multi-, internal and hairpin loops) still held certain conservation in this motif, which was in line with known experimental findings [17]. On the other hand, following the same routine of identifying the tertiary binding motifs as described in our main paper, we found that the distribution difference between P (3D|bound) and P (3D|unbound) became almost negligible (< 0.05), which implies that our framework discovered few dominant 3D structural preferences for PTB binding in this case. In a word, by inactivating the input structural profiles for 20% of the training samples, our approach was not able to fully discover the structural preference of PTB binding, but still captured certain secondary structural binding features. Table S4 presents the docking efficiency of the top five predicted RNA 3D motifs to the RRM domains of PTB except RRM4. Table S5 shows the ratio of C and U bases among the nucleotides involved in the interactions with PTB.

S9 Supplementary details and results on the PTB binding site prediction in the IRES regions
To predict possible PTB binding sites in IRES, we scanned 40 available IRES regions with window size 26 (which is the average target size of PTB) and step size one, which yielded 15,664 scanned subsequences/windows. Each scanned subsequence was extended by 150 nucleotides in both directions (i.e., upward and downward) for RNA secondary structure prediction. These IRES subsequences were then inputted into both GraphProt and our deep learning framework to predict new candidate PTB binding sites. In GraphProt, the subsequences with the SVM margins above zero (which was the default parameter setting in GraphProt) were treated as binding sites, while in our method, those with output probabilities above 0.6 (which was the best threshold determined by the 10-fold cross-validation procedure) were regarded as PTB targets.
In addition to the overall performance evaluation for all 40 IRES regions, we also checked the prediction results in those 13 IRES regions in which the PTB binding has been validated experimentally in the literature, including APAF1 [19,20], BAG1 [21,22], BIP [23], MYC family (C-MYC, L-MYC and N-MYC) [24,25], MNT [26], MTG8A [26], MYB [26], P27 [27], P53 [28,29], UNR [30,31] and P58 [32]. As shown in Fig. S5(a), compared to GraphProt, the DBN model with RNA 3D structural profiles predicted more binding sites for almost all 13 experimentally-verified IRES regions, except BAG1 and L-MYC, in which all three approaches did not detect any PTB target. In particular, unlike GraphProt which failed to identify any PTB binding site for APAF1, P53, MNT and P58, our framework succeeded in detecting the PTB-IRES interactions that agreed with the known experimental studies for these IRES regions. In summary, for these 13 IRES, our method captured 84.6% of PTB-IRES interactions that have been verified experimentally, while GraphProt only detected 46.2% of them. Moreover, our method identified a number of novel candidate PTB binding targets of the other 27 IRES regions in which the PTB binding has not been reported in the literature (to our best knowledge), as shown in Fig. S5(b). These new predicted binding sites may shed light on the exploration of new interactions between PTB and IRES, and thus provide useful hints for studying the corresponding biological functions.
In the negative control test, we first randomly chose another set of 40 mRNA subsequences (one for each IRES) that did not overlap any of the 40 IRES segments. These 40 mRNA subsequences were also regarded as false or unbound examples to constitute the negative control dataset. In addition, we let the negative control dataset have the same length distribution as that of the IRES set by requiring the length of each unbound sequence to be equal to that of its IRES counterpart. Note that a similar strategy has been used in both GraphProt and our DBN model to generate the unbound examples in the cross-validation process.
In addition to APAF1, we conducted an additional case study on another IRES UNR (upstream of N-ras), which is an internal initiation trans-acting factor that regulates mRNA stability and cap-independent translation [31]. UNR has been found to contain two PTB binding regions [31], which were denoted as B1 and B2 sites, respectively (see Fig. S7(a)). As in APAF1, we also divided the whole UNR segment into different regions, i.e., B1, B2 and unbound regions. As shown in Fig. S7(b), although we observed that all three models can detect a number of PTB-UNR interactions in both B1 and B2 regions, our prediction model integrating RNA 3D structural information yielded much more consistent results in those PTB binding regions experimentally verified. The false positive ratios of the three models, i.e., GraphProt, DBN-3D and DBN+3D, were quite small (0.020, 0.004 and 0.071, respectively), which implies that the predictions in UNR using these prediction models (including our deep learning model) did not suffer from the false positive issue.  Figure S6: The complete secondary structure of IRES APAF1, in which the PTB binding sites, including the P1 and P2 sites (shown in red), have been verified experimentally [20]. The figure was generated using RNAstructure [33]. Since there is no experimentally-derived structure of UNR in the literature (to our knowledge), here we only present its sequence. The number at the beginning of each row denotes the position of the beginning site with respect to the mRNA containing UNR. The B1 and B2 sites are the experimentally-verified PTB binding sites [31], which are both marked in red and denoted between round brackets and braces, respectively. (b) Prediction results of UNR. The first two x-axis labels represent the prediction results in the B1 and B2 regions, which contained the RNA subsequences in the scanning windows overlapping the B1 and B2 sites, respectively. The last x-axis label represents the prediction results in the unbound region, which contained the remaining sites that were excluded from the above regions. The faction of hits was defined as the ratio of the number of predicted bound windows vs. the total number of scanning windows in the corresponding region. DBN+3D and DBN-3D represent the DBN models with and without RNA 3D structural profiles, respectively.