Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins

Motivation: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein–RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein–RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. Results: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3’UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein–RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. Availability and implementation: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf. Contact: tomaz.curk@fri.uni-lj.si Supplementary information: Supplementary data are available at Bioinformatics online.


Detailed information on analyzed RBP experiments
Reference and details about all experiments used in the study are listed. Experiments for the same proteins are sorted into groups A-Q. Factor models for each experiment do not consider biological or technical replicates in the same group to eliminate bias. Depending on the experimental protocol used (PARCLIP, CLIPSEQ, iCLIP, HITSCLIP) we report number of cross-linking clusters and number of individual sites (measured as the sum of cluster lengths) for each experiment used. We used information on clusters (column Clusters) if provided by the original study. In case of experiments 18-20 and 22, we used information on individual crosslink sites (column CL sites). From individual positions (clusters or individual crosslinks), we select up to 100.000 samples with highest cDNA counts.  Hafner et al. (2010) PARCLIP_AGO1234_hg19.bedGraph.gz [2] Ago2-MNase HEK293 PARCLIP A 3013785 33396 Kishore et al. (2011) PARCLIP_AGO2MNASE_hg19.bedGraph.gz [3] Ago2 (1) HEK293 HITSCLIP A 432139 7153 Boudreau et al. (2014) HITSCLIP_Ago2_binding_clusters.bedG raph.gz [4] Ago2 (2) HEK293 HITSCLIP A 432089 7152 Boudreau et al. (2014) HITSCLIP_Ago2_binding_clusters_2.be dGraph.gz [5] Ago2 2 Details on the iONMF algorithm

Derivation of update rules
The general matrix factorization problem can be solved with different optimization approaches; these include: • Alternative least squares Lee et al. (2001) have proven convergence properties but may be computationally ineficcient, • (projected) gradient descent can be computatinally efficient while requiring manually selecting the learning rate and requires explicit control over the solutions staying in the feasible region Lin (2007), • (quasi-)Newton methods provide a solution to the general problem of selecting the learning rate Zdunek and Cichocki (2006), • multiplicative update rules are an instance of the gradient descent algorithm with variable learning rate, are computationally efficiet (subject to optimal ordering of matrix multiplication). Also, the non-negativity constraint is satisfied by definition if initialized matrices are non-negative.
• There is a large body of work on probabilistic matrix factorization (PMF), see Salakhutdinov and Mnih (2008), which provides improved regression results, but the non-negativity and orthogonality constraints are difficult to include in the normal distribution on which most PMF methods are based.
Due to implicit adherence to non-negativity and orthogonality constraints as well as computational efficiency, we propose a multiplicative update rule-based algorithm, which is presented in more detail below.
Non-convexity of the problem follows from observing that there exist equivalent solutions X = WU T UH, where U is any unitary matrix of appropriate size. This implies that multiple solutions exist, which yield same objective value. The problem is thus non-convex and a unique minimum does not exist.
To learn a factor model with iONMF we propose the following optimization problem with respect to W and H i for i = 1, ..., N : Regularization parameter α determines the trading off between approximation error and orthogonality of vectors in H i . Because the optimization problem is non-convex in all W, H i , local minimum can be found by fixing all but one variable and applying multiplicative update rules. The cost function in Problem 2.1 can be rewritten as: Following standard theory of constrained multivariate optimization, we construct the Lagrangian: By fixing all H i , we can calculate the derivative of the Lagrangian with respect to W: To satisfy the Karush-Kuhn-Tucker optimality conditions at a stationary point, the following must hold: we have: is a fixed point equation, which can be solved by iteratively applying the following update rule: Since we assume X i , H i , W are non-negative for all i = 1, ..., N , the update is equal to: which is the update rule given in Equation 2 (see main text).
Following a similar argument, the update rules for coefficient matrices H i can be derived. Fixing W and all H j , j = i, the derivative of the Lagrangian with respect to H i is equal to: To satisfy the Karush-Kuhn-Tucker optimality conditions at a stationary point we must have: = 0 which leads to the following update rules:

Equivalence to gradient descent
As noted by Lee et al. (2001), the multiplicative update rules are a special case of gradient descent. rules in Equation 2 and Equation 3, respectively (see main text). Hence, the same convergence properties of gradient descent apply.

Comparison of iONMF, RNAContext and GraphProt
We have compared iONMF with two state-of-the-art approaches to predict putative CLIP interaction sites: GraphProt by Maticzka et al. (2014) and RNAContext by Kazan et al. (2010). We have used the same sequences and evaluation as described in Maticzka et al. (2014), which we downloaded from the authors' website 1 .
For reasons of computational complexity, we used the published results reported by Maticzka et al. (2014). Running RNAContext and GraphProt with suggested parameters yielded very similar results (data not shown) as originally reported by Maticzka et al. (2014).
Here, we report the exact values from Maticzka et al. (2014), Fig. 3, for GraphProt and RNAcontext. We evaluated iONMF using the same setup. The provided parameter fitting set contained 500 positive and 500 negative positions were used to select the optimal subset of data sources. These same positions were originally used to fit parameters of GraphProt and RNAContext. The optimal subset of data sources and hyperparameters for iONMF were selected via 3-fold cross-validation on the parameter fitting set. The final evaluation was performed on the provided validation set using 10-fold cross-validation.
The iONMF hyperparameter α was set to 0.1 for all experiments and the factorization rank was selected from range [10,50]. The half-window size for generation of matrices X CLIP , X RG , X RNA , X KMER was set to 100 nucleotides.
Two separate comparisons were performed: 1. using data sources on sequence (X KMER ) and structure (X RNA ), which are the sources used by GraphProt and RNAContext. See Supplementary Table 2 for AUC scores and Supplementary Table 4 for precision-recall scores.
2. using all five data sources (X CLIP , X RG , X RNA , X KMER , X GO ). See Supplementary Table 3 for AUC scores and  Supplementary Table 5) for precision-recall scores.
When using sequence (X KMER ) and structure (X RNA ), iONMF performs best on 10 out of 24 data sets. Empirically, we have found that the largest effect on the test performance of the iONMF model is due to selection of the subset of data sources. Inclusion of other data sources increases the number times iONMF is ranked first to 13 (out of 24 data sets).
The results obtained with the measures AUC and precision-recall are very similar with respect to critical distance diagrams. It is interesting to compare the differences in predictive performance within Supplementary Table 2 with the importance of sequence (X KMER ) and structural information (X RNA ), estimated in Supplementary Figure 5. Cases where RNAContext outperforms iONMF appear to be highly enriched in sequence-specific RBPs (ELAVL1, ESWR1, FUS, QKI, TAF15, TDP-43).
Cases where iONMF improves the most are less sequence dependent according to Supplementary Figure 5. The predictive performance seems to be compensated for in the way structural information is processed, as these cases show higher dependence on X RNA (Ago2, iONMF AUC=0.876, next best GraphProt AUC=0.765; ALKBH5, iONMF AUC=0.805, GraphProt AUC=0.680). This underlines the flexibility of weighting the contribution of particular data sources associated with target response, which happens implicitly within the iONMF model.
Inclusion of all data sources, reported in Supplementary Table 3, improves the performance on IGF2BP1-3, MOV10 and ZC3H7B. Not surprisingly, MOV10 and IGF2BP1-3 show less dependence on sequence and more on region types, as shown in Supplementary Table 5. In this setting, iONMF yields the best predictive performance in the majority 13 out of 24 data sets (average AUC=0.907, next best GraphProt avg. AUC=0.887). According to ranks in the critical distance diagram in Supplementary      Effect of parameter α on average sparseness and average angle between all pairs of basis vectors H i is shown on Suppl. Figure 2. Note, because all vectors are non-negative, the maximum angle is 90 • , representing and orthogonal model. As α increases (from left to right) both measures increase by a large amount, while retaining a similar level of predictive performance (AUC).

Correlation among feature vectors (predictors)
To illustrate the advantage of orthogonal decomposition, we examine the differences in feature vectors discovered by each matrix factorization method in more detail. Orthogonality is related to the phenomenon of multicollinearity, where multiple feature vectors in a model are highly correlated Chatterjee and Hadi (2015). This may lead to suboptimal prediction performance and can have an effect on the magnitude of particular regression coefficients. Our framework can be seen as learning multiple regression models, where each row x j ∈ X is predicted by a (non-negative) linear combination of feature vectors in H given by coefficients in row w j . We examine the models used for obtaining results in Section and Table 1 (see main text) in more detail. For each of the methods iONMF, NMF, and NMF-QNO 2 , we select the model resulting from parameters with best performance on the test set. Recall that each model produces the coefficient matrices H i , each with rank r = 10.
For each RBP experiment and model, we examine the feature vectors in each H i . Let ρ j be the maximal Pearson correlation of each feature vector h j ∈ H with all other feature fectors in H. Supplementary Fig. 3 shows the average maximal correlationρ computed over 10 feature vectors in a particular model.
We can observe iONMF found models that have leastρ in 30 out of 31 experiments. Also, the feature vectors are on average the least correlated in models found by iONMF when averaged over all 31 experiments (shown by horizontal lines in Supplementary Fig. 3). This results support the claim that multicollinearity is alleviated and explain the improved predictive performance measured by AUC (Table 1, see main text).

Model parameters
In this section, we present details on parameter settings (factorization rank) for factor models learned on different subsets of data sources. The model learned on the complete set of data sources (CRTKG), with m = 50000 training samples, expected rank r c = 10 and number of target columns n Y requires N c = (m + n Y + n) × r c = (m + 1 + 3030 + 101 + 505 + 25865 + 39560) × r c = 1190570 free parameters. This is the total number of entries in matrices W, H Y and all H i of a particular subset, where total number of features is n = i n i . Note that the X CLIP matrix has up to 3030 columns, the exact number is depending on the replicate group corresponding to the selected protein. To ensure fair comparison among different combinations of data sources, we set the rank to r = N c /(m + n Y + n) , for each combination of data sources separately. This ensures an approximately equal number of free parameters of the factor model (column N in Suppl. Table 6) that need to be fit.
Supplementary Table 6: Details on parameter settings for factor models on differet subsets of data sources.

Prediction accuracy for data source subsets on individual RBPs
Each combination of RBP experiments and subset of data sources yields a factor model. In Suppl. Tables 7,8 and 9 we compare all such models on area under ROC curve (AUC), obtained via prediction of the independent hold-out test set (see main text). For each protein, we highlight the best-scoring subset of data sources. (1) [7] eIF4AIII (2) [8] ELAVL1

Importance of individual data sources
Assesing the importance of individual data sources in Suppl. Tables 7-9 is non-trivial since the scores obtained by a particular subsets are not independent. Nevertheless, we transformed each column of Suppl. Tables 7-9 in a 31 × 5 binary matrix B, corresponding to 31 subsets of 5 data sources. For each data source, we calculate the Spearman correlation coefficient between: its corresponding column in B and column with AUC scores. Thus, we obtain a 5 × 1 vector containing correlation coefficients for each protein, which we use to perform hierarchical clustering in Suppl. Figure 5.
We observe clear influence of X KMER on sequence-dependent proteins (TIA1/TIAL1, PUM2, hnRNPs, U2AF2, FUS, etc.), while the influence is less pronounce for more region-type dependent proteins, such as those displaying bias towards introns and 3'UTRS (ELAVL, Uren et al. (2011)), or in both exons and introns (SRSF1, Aenkoe et al. (2012)). On the other hand, RNA structure is most informative data source for eIF4AIII, which binds unstructured RNA (Saulière et al. (2012)). Indeed, this can be observed on plots of individual feature vectors (Suppl. Section 8.3), where the first component is associated to > 70% of positive nucleotides and appears with no distinctive structure.
Interestingly, all experiments performed using iCLIP show strong RNA k-mer preference, which is attriuted to the individual nucleotide resolution. Protocols with lower resolution, such as CLIPSEQ, conversely are not correlated with RNA k-mers, but are rather modelled by more coarse data sources such as region type and RNAfold.     7.6 Discovery of RNA motifs RBP binding is dependent on both positioning and sequence content of motifs; these are encoded in X KMER , where each column represents the presence of a specific RNA k-mer at a specific offset from the cross-linked site. Because we scan for k-mers within w nucleotides, this gives W · 4 k columns (in our case, k = 4 and W = 101). To alleviate the exponential increase in the number of columns with increasing k, we use a heuristic approach based on the learned factor models for extracting complex motifs of arbitrary length from k-mer frequency and positional information, similar to Hutchins et al. (Hutchins et al., 2008). Our approach differs in that it uses all other data sources on additional circumstantial evidence besides k-mer frequency to aid the identification of the sequence motifs and the positional distribution associated with protein binding. Upon discovery of modules most associated with positions of a specific selected RBP (Section 2.5 in main text), associate the positions beloning to module as the positive set P. The vector in H KMER corresponding to the module is termed h. To improve specificity, we retain only the 5% highest elements of h by setting other elements to zero. The vector h vector is then interpreted as an estimate of the probability distribution of k-mers within sequences associated to the set P. We use a background probability distribution for nucleotide n is given by P exp At each iteration, M is scored the log-probability that the observed distribution is generated by M. Define a following function to score a motif M represented a a positional count matrix against the positions in the positive set P.

Supplementary Function 1 Estimation of complex motifs
Input: estimated motif count matrix M ∈ R 4×L , set P, expected nucleotide probability distribution P exp . Output: Motif log odds comparing to the expected distribution.

5:
return log(p o ) -log(p e ) Consider Suppl. Function 1. The positions within the motif at index l are first normalized to obtain an estimate of the probabilities. The motif M is then scored against the expected distribution P exp to get p e . The observed probability distribution is defined as the maximum probability within the nucleotide regions at positions w = 1...W − L + 1, averaged over all sequences s ∈ {A, C, T, G} W corresponding to the positions in the positive set P. The returned score is the log odds ratio log po pe . Until convergence, k-mers are sampled from the distribution given by h and M is updated at columns j:(j + k − 1) to give M . At each iteration, M is updated at positions (j + 1) mod (L − k + 1) to ensure equal update frequency for all positions within the motif. The update is accepted if the log-probability of the updated M increases comparing to the previous iteration. The procedure is summarized in Suppl. Algorithm 2.

4:
Let k ∼ sample a k-mer according to the distribution given by h.

5:
Let M' equal M updated with k at positions (j + 1) mod (L − k + 1). In all our experiments, the pseudocount C was set to 20 and the estimation converged within 1000 iterations. Using the algorithm described, we estimate longer, complex motifs. Four modules resulting in four motifs with highest probability were chosen for each protein. Ward's hierarchical clustering is performed and 50 motifs which result as centroids after K-means clustering. Heatmap displays the difference in observed frequency (at cross-link sites) versus expected probability (at random positions). Columns and rows were reordered using Ward's hierarchical clustering. Each motif is displayed with Weblogo software.  (right) Complex motif derived from the row vector using procedure in Suppl. Section 7.6. The motif is similar to known motif UGUA-NAUA (Hafner et al., 2010). The observed frequency of the derived motif in proximity to cross-link sites (blue) is greater than the expected probability at random positions within protein coding genes (green).

Protein
Predicted motif

Sequences of Alu elements bound and regulated by hnRNPC
Investigation of known binding sites of [16] hnRNPC and [17] hnRNP C (König et al., 2010;Zarnack et al., 2013). As supported by Fig. 4a (see main text) and Supplementary Fig. 13, motifs GGCTGG, GCCCAG, CCTGCC, GCCGGG are associated with hnRNPC . These motifs commonly occur in antisense Alu elements next to the U-tract that directly interacts with hnRNPC ( Fig. 16), which demonstrates that our algorithm can detect common neighbouring motifs, even if these are not part of the primary binding site. Consesus Alu elements sequences in human were scanned for motifs, where at least three motifs were identified in each consesus sequence ( Supplementary Fig. 17).