TOSCCA: a framework for interpretation and testing of sparse canonical correlations

Abstract Summary In clinical and biomedical research, multiple high-dimensional datasets are nowadays routinely collected from omics and imaging devices. Multivariate methods, such as Canonical Correlation Analysis (CCA), integrate two (or more) datasets to discover and understand underlying biological mechanisms. For an explorative method like CCA, interpretation is key. We present a sparse CCA method based on soft-thresholding that produces near-orthogonal components, allows for browsing over various sparsity levels, and permutation-based hypothesis testing. Our soft-thresholding approach avoids tuning of a penalty parameter. Such tuning is computationally burdensome and may render unintelligible results. In addition, unlike alternative approaches, our method is less dependent on the initialization. We examined the performance of our approach with simulations and illustrated its use on real cancer genomics data from drug sensitivity screens. Moreover, we compared its performance to Penalized Matrix Analysis (PMA), which is a popular alternative of sparse CCA with a focus on yielding interpretable results. Compared to PMA, our method offers improved interpretability of the results, while not compromising, or even improving, signal discovery. Availability and implementation The software and simulation framework are available at https://github.com/nuria-sv/toscca.


Introduction
New technologies in clinical and biomedical research facilitated the collection of highthroughput omics data using DNA sequencing, RNA microarrays, or mass spectroscopy.These methods typically result in hundreds or thousands of variables per patient, yet, sample sizes remain low.As a result, the number of variables largely exceeds the number of observations.Genomics studies concerned with finding common structures between multiple pheno-or genotypical measures call for statistical models capable of dealing with high-dimensions.
Integrative approaches such as Canonical Correlation Analysis (CCA) [5] can contribute to improvements in diagnostics and understanding of biological mechanisms, by exploring connecting attributes between datasets.This includes genomics as well as neurological data which recently has been at the centre of many CCA applications linking brain connectivity data to genetic, demographics, behavioural or thought patterns [1,13].CCA is a multivariate method in high-dimensional analysis for exploring underlying signals relating two (or more) datasets through pairs of weight vectors.It is an immediate extension of PCA for more than one dataset, and a scale invariant adaptation of PLS.CCA searches for linear combinations of the data, called latent variables, that are maximally correlated to each other.The original variables are summarised into these lower-dimensional variables.This process may be repeated to render multiple latent variables.Methods combining dimension reduction and correlation maximisation have competitive accuracy in predicting complex traits than other conventional ML methods [10], which means that CCA is an interesting tool for high-dimensional analysis.
For such, however, computation times are prone to be high and results may be difficult to extrapolate, generalise or test.Furthermore, interpreting the estimated weights is far from trivial in large dimension problems.This problem persists in CCA applications as there may be many possible linear combinations of variables maximising correlations.Hence, the probability of selecting highly correlated noise increases and so does the insample canonical correlation estimate.In these scenarios, the presence of redundant variables is dealt with through sparse constraints dealing with regularisation, variable selection or a combination of both.
Recent methods for CCA search for complex associations, i.e. nonlinear or supervised.However, these generally are concerned with prediction and show lack interpretability, or assume some type of structure or classification.As we are concerned with finding interpretable results from an exploratory analysis, we focus on penalised alternatives which render sparse weights for both datasets.
Sparse extensions to CCA include lasso or elastic net penalisation methods for parameter shrinkage and variable selection [12,8,15].The sparsity is chosen through crossvalidation techniques of the penalty parameters.Said techniques are known to be unstable both in terms of the estimation of the penalty parameters [7] and that of the coefficients [17].Not only does this affect interpretability of the results, it also means that the permutation testing framework is ill-behaved.
We address these concerns by imposing a threshold on the support of the canonical vectors using soft-thresholding, rather than using a penalty parameter.Hence, we introduce sparsity into our canonical vectors by stating the number of nonzero weights, keeping the number of selected variables equal through permutations and promoting interpretable results via direct control over the number of selected variables.In addition, to achieve Type-I error control, we also show that using the out-sample correlation, instead of the in-sample correlation, accounts for spurious associations.We propose a fast estimation scheme based on the NIPALS [16] algorithm, essential for efficient testing in high-dimensional CCA.
With simulations, we evaluated signal recovery and Type-I error control and compared its performance to the popular Penalised Matrix Analysis (PMA) method [15].Moreover, we applied our algorithm to real data on gene expression and drug sensitivity measures to study the performance of our sparse CCA.We found that, compared to PMA, fixing the number of nonzeros improved stability of the shape and size of the canonical weights for increasingly large matrices.

Methods
Suppose we have two data matrices X 1 ∈ R n×p and X 2 ∈ R n×q containing respectively p and q variables from n samples from which we want to extract a sequence of K pairs of canonical vectors {(α 1 , β 1 ), . . ., (α K , β K )}, where K ≤ min(p, q).The k th pair of canonical variables is given by γ k = X 1 α k and ζ k = X 2 β k .The correlation between this pair of canonical variables, referred to as canonical correlation, is given by The goal of CCA is to choose the weights A = (α 1 , . . ., α k , . . ., α K ) and B = (β 1 , . . ., β k , . . ., β K ) such that the correlation between all canonical vectors is maximised under the restriction that the columns in the sets (γ 1 , . . ., γ K ) and (ζ 1 . . . ., ζ K ) are orthogonal.Generally, the weights are estimated such that the first pair of canonical vectors has the highest canonical correlation and with each succeeding pair the canonical correlation decreases.

Nonlinear Iterative Partial Least Squares and CCA
In this paper, we consider the CCA problem in a regression framework in which pairs of canonical vectors are sequentially estimated with an alternating regression procedure.This technique, known as Nonlinear Iterative Partial Least Squares (NIPALS) [16], starts by initialising one of the canonical vectors, α (0) , and computing β given α (0) .The estimation of the weights β is equivalent to a simple least square problem.To obtain the first pair of canonical vectors, we use the equivalence between maximising equation (1) and the optimisation problem where the canonical variables are required to have unit norm to have the same constraints as in equation ( 1).In the alternating regression procedure, we initialise and fix vector α (0) and scale γ to have unit norm after step 2 in algorithm 1, i.e.
An estimate of β obtained as Vice versa, we obtain ζ (1) and α (1) fixing β (1) .The estimated vectors describe the strength of linear association between the matrices.This process is then repeated until convergence of some tolerance measure.Generalised to k > 1, we initialise α k as α k to then repeatedly fix and re-estimate new weights to obtain a sequence {(α k ), . . ., } that is monotonically convergent [4] for each component.The first canonical vector of X 1 , α (0) k , can be initialised randomly, with uniform weights or with some type of matrix decomposition.
Many penalised alternatives based on lasso [8,15] or elastic net [12] have been proposed to deal with high-dimensional CCA.Both approaches impose sparsity in the weights α k and β k by penalising the model used in equation 3. To obtain a certain sparsity, it is therefore necessary to search for the corresponding penalty parameter.As an alternative, we propose to introduce a soft-thresholding penalty to the regression formula (3).This penalisation allows us direct control on the number of nonzero weights in both α k and β k .Not only will this improve the interpretation of the results, it also speeds up NIPALS algorithm since we do not have to search for the penalty that corresponds to a particular number of nonzero weights.Additionally, this allows us to use permutations for hypothesis testing.

vectors
To add the soft-threshold penalty, we ignore the collinearity of our data by assuming that X T 1 X 1 = I p .As with X 2 , simplifying the regression from equation (3) into steps 3 and 6 [2].We calculate the optimal weight coefficients in algorithm 1, through a modified NIPALS with soft-thresholding (steps 7 and 4) based on threshold parameters p α ∈ {1, 2, . . ., p} and q β ∈ {1, 2, . . ., q}.We can show that the relationship between the in-sample canonical correlation and progressively larger p α , given q β , is nonconcave and increases for non-sparse solutions.Therefore, we use the out-sample canonical correlation which indeed shows a convex trajectory, implying decrease of the canonical correlation, as more irrelevant variables are included.
Through this algorithm, smaller choices of (p α , q β ) yield canonical vectors which are subsets of dense alternatives when there is a signal, keeping one penalty fixed.That is, for threshold choices p α,1 ≤ • • • ≤ p and some fixed q β , both in the simulation study and the real application, supp(α(p α,i )) ⊆ supp(α(p α,j )) if i ≤ j.This property is useful for interpretation of the results, as it shows selection stability of the larger contributors.

Estimating multiple canonical variates
In high-dimensional settings, finding the true canonical weights linking both datasets is particularly difficult as, in practice, there are many possible competitive combinations, rendering comparable canonical correlations.Furthermore, we wish to balance the percentage of explained variance to the number of selected variables, ruling out tuning of a penalty parameter.Instead, we choose soft-thresholding, which allows the user to search over and compare results from multiple specific sparsity levels at the same time.
From the computational perspective, the NIPALS algorithm allows the simultaneous and efficient estimation of the canonical vectors for different combinations of penalties by defining vector pair (p α , q β ).Then the canonical vectors for each component become matrices (A k , B k ) for which each column represents a (p α,i , q β,i ) pairing.In a single run of the NIPALS algorithm, it is possible to estimate the canonical vectors for several sparsity levels.That is, steps 3 and 6 in algorithm 1 become B = X T 2 Γ and A = X T 1 Z, where Γ and Z are matrices of latent variables from canonical weights with different sparsity levels.
We calculate the canonical weights and correlations for later components (k ≥ 2) deflating the data to account for the variance explained in previously estimated latent variables.We deflate the matrices as where X (1) 1 = X 1 .Matrix X 2 is deflated following the same scheme.In the original unpenalised version of NIPALS, this deflation would make latent variables orthogonal for different components.However, it is well known that introducing sparsity compromises this property [6] as it is the case with other sparse CCA methods.Consequently, the standard measure of cumulative percentage of explained variance (CPEV) 1 , which is used to determine the number of components [11], may be inaccurate due to the presence of repeated information.That is, as orthogonality of the canonical vectors can no longer be guaranteed, new components do not necessarily contain new information, and hence latent variables may be correlated.We propose the following alternative measure to adjust for repeated information for k = 2, . . ., K:

Permutation testing
To assess the estimated correlations, we test the null hypothesis of no correlation between the datasets, and their deflated counterpart for subsequent components via permutation testing.We permute one of the datasets and re-estimate the canonical correlation to approximate the distribution of the correlations under the null.Since the canonical correlation estimate is affected by the number of nonzero weights, using the number of nonzeros as the original analysis makes the permuted correlations comparable between themselves and the original estimate.
Standard penalties, such as the lasso or the elastic net, optimise the combination of weights and variable selection to match the corresponding dataset.This yields null distributions which are contingent on sparsity levels and, thus may lead to incorrect assessment of the estimated canonical correlations, as the same penalty across permutation may not return the same sparsity level 2 .
The canonical correlation estimate is non-decreasing as a function of variables selected.Controlling over the number of the selected variables avoids catering the dimension reduction to the indiosyncracies of the data.We argue that setting a more direct penalty over variable selection together with an appropriate residualisation scheme has several advantages.Mainly these are improvements in subsequent signal detection, interpretability and assessment of the relationships found without interference form.This scheme returns appropriate type I error rates.
Multi-component canonical correlation analysis requires testing for each component.Due to the high-dimensional data we expect the gaps between quantiles of the null distributions to be small; under the null distribution, the estimated correlations will have similar values.We have empirical evidence supporting this statement coming from the permutation distributions amongst different correlation estimates looking very similar.Hence we determine that a simple Bonferroni correction will suffice to manage multiple testing concerns.We address multiple testing concerns using the statistic for the largest correlation as threshold for the rest.

Simulations
We simulated three true components of different sizes for data with n = 100, p = 2500 and q = 500.We analysed the simulated data using the approach from section 2, from 1 Where CPEV is defined as tr(γ T 1:k γ 1:k )/tr(X T 1 X 1 ), and equally for ζ and X 2 .now on referred to as TOSCCA for Thresholded Ordered Sparse CCA, and compared its performance to the existing method, PMA, a popular sparse CCA method used in the study of high-dimensional data aiming to improve interpretability.We examined each model's accuracy (Figure 1), selection stability, convergence and adjusted CPEV, from equation (5).
We fixed the sparsity level for each method as p α = q β = 100 variables for all components and found the best penalty for PMA using their built-in function.Figure 1 displays the true signals (top) and the estimated canonical vectors by TOSCCA (centre) and PMA (bottom).For the sake of comparability, both methods had initial values drawn from a random uniform distribution.This was a minor modification to the PMA algorithm as it was originally designed to be initialised with values from an eigen rendering canonical vectors which do not differ much from their initialisation.That is, canonical vectors are effectively predetermined from the start.Since convergence irrespective of the initialisation is a desirable property for iterative algorithms, we compared the two approaches using the same random initialisation.We observed that TOSCCA consistently selects the corresponding variables for each component; when the signal involved fewer than {p α , q β } variables for each, the remaining weights were set closer to zero.Moreover, there was no overlap between signals and the canonical weights were appropriately paired.PMA, on the other hand, selected the larger signal for both components.
We checked TOSCCA's selection stability for running the algorithm for 8 different subsamples, one for each choice of p α while keeping q β fixed.The signal was distinctly identified regardless the number of nonzero variables.We observed the selection stability described in the previous section 3 .We observed the adjusted CPEV increase for the first three components, where the signal was located, and then plateaued for the fourth component.The auto-correlation between canonical vectors was effectively zero, reducing equation ( 5) to the original formula.We assessed the validity of the estimated canonical correlations for each component through permutation testing. 4We found the first three canonical correlations to be statistically different from those find in the permuted data.The fourth estimated correlation was correctly found to be not significant.

GDCS data
We applied TOSCCA to analyse data on the Genomics of Drug Sensitivity in Cancer [3] from the GDSC project [18] aimed at identifying molecular markers of drug response.The data is comprised of drug sensitivity measures for cell lines and their corresponding genomic profile (gene expression, methylation profiles, mutations and copy numbers) from the Catalogue of Somatic Mutations in Cancer database.We were interested in quantifying the associations between gene expression and drug sensitivity (IC 50 ) to explore how combinations of genes may affect drug effectiveness.The data is comprised of 737 samples of 49,386 gene expression measurements and 320 IC 50 values.We fixed the sparsity of the estimated canonical correlation to be of p α = 100 variables belonging to gene expression to q β = 20 from the IC 50 values.These numbers were chosen to simplify analysis, limiting the number of variables to be interpreted to what we believe is feasible and to illustrate how fixing sparsity may improve results for exploratory analysis.We ran the same analysis using PMA, where we used their proposed method based on cross-validation to find the optimal penalty parameters.These penalties rendered 800 gene expression variables and 7 drugs.Finally, we repeated the analysis with TOSCCA matching PMA's optimal sparsity level.(left), p α = 800, q β = 7 (centre) and p α = 800, q β = 7 (right).
We observed PMA achieve a greater correlation across components with the exception of the first component (k = 1), Figure 2a.However, as previously argued, correlation alone is a weak indicator for links between high-dimensional data.The CPEV values for the four first components, Figure 2b, show TOSCCA, with the default configuration (TOSCCA 100), generally outperformed all other alternatives.After inspection, PMA's correlation and CPEV values for subsequent components were attributed to PMA selecting virtually the same variables across components.Thus, replicating the first, usually the highest, correlation estimate.This is in line with what observed from PMA in section 3, as it is prone to compute very similar components (Figure 1). Figure 2c shows said correlation values between the subsequent components the first one.These suggest that the variance added from subsequent components was unlikely coming from new information.
We used the proposed adjusted CPEV in equation (5) to control for almost identical information from subsequent components.This adjusted CPEV is displayed in Figure 2d where the values from Figure 2b account for high correlation between components, as an indicator of repeated information.After this adjustment, TOSCCA consistently outperformed PMA across the board.
We then followed with permutation testing for the estimated correlations.We observed all four components to be far away from the null distribution, hence deemed significant 5 .Datasets of such dimensions and characteristics will most likely keep returning significant correlations for many components as biological interactions go beyond the simple digits in this example.Nevertheless, we previously stated the advantages of keeping these links manageable in favour of interpretation.
Last, as CCA and PCA are closely related, we used the equivalent of a score plot to observed any potential similarities represented by the estimated latent variables.We grouped observations by the region assigned to each cell line, as displayed in Figure 3.We chose to focus on the blood, lung and skin regions as these were the ones with the most observations.Studies show idiosyncrasies in drug resistance from cancers in organ systems as they create a micro-environment which impacts drug delivery outcomes [9], compared to that of blood cancers.
TOSCCA found two or three different groups that were strongly associated with the cell lines' region.PMA (right) showed the same linearity discussed above, consequently both k = 2 and k = 3 look similar when plotted against k = 1.These plots appear to pick up on the distinction between cancers on organ tissues and blood cancers across methods, as blood cell lines tend to remain further apart from skin and lung cell lines.Further analysis into the dynamics between gene expression and drug sensitivity measures is beyond the scope of this paper.

Discussion
We introduced the method TOSCCA to carry out exploratory analysis on high-dimensional data.We used the NIPALS algorithm, together with soft-thresholding to induce sparsity, for its efficiency in dealing with high-dimensional data.Our method introduces computational and interpretational attributes that ease the search and analysis of the associations integrating this data.In our method, we fix the number of nonzero canonical weights therefore promoting interpretable results and limiting the computational burden in estimation and, consequently, permutation testing.This framework allows for multiple sparsity levels to be computed simultaneously, which further facilitates the choice of sparsity.Moreover, TOSCCA shows selection stability across different choices of sparsity.
Understanding the contribution of a variable or set of variables in penalised highdimension analysis is complicated as different results can easily yield very similar outcomes.This is particularly true of genomic data where the correlation structure interferes with deriving inference from the results.We argue that simplifying the search is more aligned with the exploratory efforts on integrated high-dimensional data.Fixing sparsity levels achieves said goal.This approach, then, focuses on variable selection and shows selection stability both in the simulations and real data applications.
Altogether, the above scheme supports reliable assessment of the estimated canonical correlations through permutation testing as this dimension reduction strategy has permutations be comparable and, hence, draw an appropriate null distribution.TOSCCA shows improvements in signal discovery, especially for subsequent components, and assessment when compared to the PMA method which, as established in section 1, continues to be a popular method for exploring associations in genomics datasets.
The R-package toscca is available on github (https://github.com/nuria-sv/toscca),where we include a the script to reproduce the analysis on the simulations and the real data.

2
See Figure A in the supplementary material.

Figure 2 :
Figure 2: CCA result comparison between TOSCCA and PMA for initialisations from eigen decomposition, random uniform and random normal.