Individualized multi-omic pathway deviation scores using multiple factor analysis

Summary Malignant progression of normal tissue is typically driven by complex networks of somatic changes, including genetic mutations, copy number aberrations, epigenetic changes, and transcriptional reprogramming. To delineate aberrant multi-omic tumor features that correlate with clinical outcomes, we present a novel pathway-centric tool based on the multiple factor analysis framework called padma. Using a multi-omic consensus representation, padma quantifies and characterizes individualized pathway-specific multi-omic deviations and their underlying drivers, with respect to the sampled population. We demonstrate the utility of padma to correlate patient outcomes with complex genetic, epigenetic, and transcriptomic perturbations in clinically actionable pathways in breast and lung cancer.


Introduction
This presentation of the Multiple Factorial Analysis (MFA) is mainly borrowed from Abdi et al. (2013) and Abdi and Williams (2010). MFA applies to the case when several (K) sets of variables describe a same set of I individuals (observations). MFA is part of the family of methods based on principal components analysis (PCA) and makes use of the basic notions of inertia and distance.

Inertia and distance
To start, consider a single data table X = {xij}I,J. The inertia of xj, the j th column of X, is defined as the sum of its squared elements: When the column is centered, 2 corresponds to the variance of xj. The sum of all 2 across columns is denoted I and is called the inertia of the data table, or the total inertia.
The center of gravity of the rows, denoted g, is the vector of the means of each column of X. When X is column-centered (which is commonly the case), g is equal to the J-dimensional null vector. The Euclidean distance of the i th observation to g is equal to The sum of all 2 is equal to the total inertia I of the data table.

Principal components analysis
PCA computes new variables called principal components, which are obtained as linear combinations of the original variables, and is based on the concept of inertia. By design, the first principal component is required to have the largest possible inertia. The second component is then constructed to have the largest possible inertia, under the constraint of orthogonality with respect to the first component; the other components are computed in a similar manner. These new variables are called factor scores, which can be interpreted geometrically as the projections of the observations onto the principal components. The coefficients of the linear combinations used to compute the factor scores are called the loadings.
Specifically, X is factorized via a Singular Value Decomposition (SVD) as X = P  Q T , where P is an I × L matrix of normalized left singular vectors (with L being the rank of X ), Q the J × L matrix of normalized right singular vectors, and the L × L diagonal matrix of the L ordered singular values ([1], …,[l],..,[L]), such that [1]≥ …≥[l]≥…≥ [L]. For convenience, we will denote the largest singular value of X, [1], as . Matrices P and Q are orthonormal matrices (i.e., Q T Q = P T P = I). The SVD is closely related to the canonical decomposition of XX T , where P is the matrix of normalized eigenvectors of XX T , Q is the matrix of normalized eigenvectors of X T X, and the singular values are the square root of the eigenvalues of XX T .

Loadings and factor scores
As Q is the matrix of loadings, the matrix of factor scores F is equal to:

Deviation scores
We define the deviation score of an observation as the distance of its factor scores to the center of the cloud of points; that is, if we note fi,l the factor score of the i th individual for the l th component, the deviation score may be calculated as follows: Note that this is equivalent to the Euclidean distance defined above, assuming X has been column-centered.

Multiple factor analysis
We now consider a set of K tables describing the same set of I individuals (observations). We denote the merged dataset X = (X[1] ,…,X[k] ,…X[K]), where X[k] is the data matrix corresponding to the k th set of variables, possibly preprocessed (generally centered and normalized). MFA comprises two main steps: ].
The MFA is then computed as the simple PCA of X*, where the SVD of X* is given by X* = P**Q* T , with P* T P* = Q* T Q* = I.

Loadings
Because the matrix X* concatenates K tables, with each of them respectively comprising J [k] variables, the matrix Q* of the left singular vectors can be partitioned in the same way as X*. Specifically, Q* can be expressed as a column block matrix: * =

Factor scores
The factor scores of the observations represent a compromise (i.e. a common representation) for the set of the K matrices. They are obtained by: F = P**XA -1 Q.
Interestingly, this last equation can be rewritten as follows:

Deviation scores
As in a simple PCA, we define the deviation score of an observation as the distance of its factor scores to the center of the cloud of points; that is, if we denote fi,l the MFA factor score of the i th individual for the l th component:

Contribution of each table to the deviation scores
It may also be of interest to quantify the individual contributions of each data table to the overall deviation score; in particular, this can help identify which data tables have the greatest influence on an individual's deviation score. As above, we denote F the matrix of MFA factor scores and [ ] the matrix of partial factor scores corresponding to the k th table.
We first note that the overall deviation score 2 can also be written as follows: the contribution of the k th table to the deviation score for individual i is equal to . By construction, the contributions of the K tables to the overall deviation score sum to 0. In addition, per-table contributions can take on both negative and positive values according to the extent to which the table influences the deviation of the overall score from the origin (i.e., the global center of gravity across individuals); large positive values correspond to tables with a large influence on the overall deviation of an individual, while large negative values correspond to tables which tend to be most similar to the global average.

Contribution of arbitrary groupings of variables to the deviation scores
It is possible to directly quantify the contribution of individual omics (or in fact any arbitrary grouping of variables or any single variable) to the deviation score; in fact, a strategy similar to that for gene-levels contributions to the overall score can be used. Specifically, let the concatenated set of weighted standardized gene tables X* be partitioned into R groups, such The matrix Q can then be partitioned the same way, leading to the following: For notational simplicity above, the R groups are noted as being made up of contiguous variables in X*, but any arbitrary partition of variables (e.g., all measurements for each omic) can be used in a similar way by appropriately permuting the columns and rows of * and , respectively, to group the variables as needed. The matrix of MFA factor scores F can then be written as where [ ] denotes the matrix of partial factor scores of partition r such As the overall pathway deviation score for individual i can be rewritten as the individual contribution of the r th variable partition to the overall pathway deviation score for individual i is equal to .

Supplementary observations and variables
The results of the MFA can also be used to compute factor scores, loadings, and distances for new observations (rows) and variables (columns) that were not included in the original analysis. For simplicity, we focus on the use of the former in the following discussion, but analogous calculations can be performed for the latter. In particular, assuming that the supplementary rows are scaled in a manner comparable to the original rows of X, we can compute the factor scores fsup for a supplementary row, which is represented by the Jdimensional vector . The supplementary factor scores are then computed as = − , while the partial factor scores are obtained by :

Deviation scores
Similar to the case of observations contained in the original data table, the deviation score of a supplementary observation is the distance of its factor scores to the center of the cloud of points. That is, if fsupi,l represents the supplementary factor score of the i th individual for the l th component: In addition, the contribution of each data table to this overall deviation score for a supplementary individual can be calculated in a manner analogous to that described above.

TCGA data acquisition and pre-processing
Briefly, using TCGA2STAT (Wan et al., 2016), we downloaded processed TCGA Level 3 data from the Broad Institute Genome Data Analysis Center (GDAC) Firehose on March 18, 2017 for individuals of self-reported European ancestry for whom gene expression, methylation, copy number alterations (CNA), microRNA (miRNA) abundance, and somatic mutation data were all available; this ancestry filter was applied to minimize population-specific variance and focus on the group with the largest available sample size. In addition, two individuals from the BRCA dataset (TCGA-E9-A245, TCGA-BH-A1ES) were identified as outliers with consistently extreme deviation scores across multiple pathways and were removed from the remainder of the analyses; the final sample sizes were thus n=504 and n=144 individuals for the BRCA and LUAD datasets, respectively.
Per-gene normalized expression estimates were calculated using RSEM (Li et al., 2011). Methylation was quantified using the maximally variant probe from the Illumina Infinium Human Methylation450 BeadChip located within ± 1500bp of the transcription start site, and representative probe beta measures were transformed to the logit scale. Somatic CNAs were called by comparing Affymetrix 6.0 probe intensities from normal (i.e., non-cancer tissue) and cancer tissue, and genome segments were aggregated to gene-level measures by TCGA2STAT and CNTools. Individuals were classified as carriers or noncarriers of a nonsynonymous somatic mutation for each gene using TCGA2STAT. Normalized miRNA abundance was quantified as Reads per million microRNA mapped (RPMMM) values. RNAseq and miRNA-seq quantifications were TMM-normalized (Robinson and Oshlack, 2010), converted to counts per million (CPM), and log2-transformed. Only genes with available RNAseq expression measures were retained for the remainder of the analysis, corresponding to 20,501 and 19,971 genes for BRCA and LUAD, respectively. Finally, batch effects have been shown to have a strong impact on the analysis of high-throughput data in general (Leek et al., 2010) and for the TCGA data specifically (Akulenko et al., 2016). As specific sample plates have been shown to represent significant batch effects in previous analyses (https://bioinformatics.mdanderson.org/BatchEffectsViewer/), each processed omic (with the exception of somatic mutation data) was individually batch adjusted for each cancer to correct for plate-specific effects using removeBatchEffects in limma (Ritchie et al., 2015). Plots of the first two components from a transcriptome-wide and genome-wide single-omics PCA and multi-omics MFA for the batch-corrected data are included in Supplementary Figures 3 and 4.

Choice of curated pathway collection
We consider the pathways included in the MSigDB canonical pathways curated gene set catalog (Liberzon et al., 2011) which includes genes whose products are involved in metabolic and signaling pathways reported in curated public databases. We specifically use the "C2 curated gene sets" catalog from MSigDB v5.2 available at http://bioinf.wehi.edu.au/software/MSigDB as described in the limma Bioconductor package (Ritchie et al., 2015). We focus in particular on a collection of 1322 gene sets from public databases, including Biocarta, Pathway Interaction Database (Schaefer et al., 2009), Reactome (Fabregat et al., 2018), Sigma Aldrich, Signaling Gateway, Signal Transduction Knowledge Environment, and the Matrisome Project (Naba et al., 2012), the smallest and largest of which were respectively made up of 6 and 478 genes (median size 29 genes). For the subsequent padma analysis, we excluded gene sets for which fewer than 3 genes mapped to quantified features in the TCGA gene expression data, corresponding to a total of 1136 gene sets.

Details of simulation study
It is of particular interest to investigate the conditions under which padma can correctly identify outlier individuals and the corresponding genes driving their aberrant multi-omic profiles, as well as how it compares to other alternatives. To evaluate the performance of padma in a variety of scenarios, we conducted a simulation study as follows.
Assuming a pathway of fixed size with K = 29 genes (the median pathway size of the MSigDB collection) and multi-omic data made up of three assays, we simulated 29 three-dimensional tables for n individuals, X = (X[1] ,…, X[k] , …X[29]), using a trivariate Gaussian distribution: For each gene table, the pairwise correlation among omics was drawn from a Uniform(0.2, 0.8) distribution. To create a given percentage paberrant of aberrant individuals in the population, we selected a number of driver genes |dgenes| as well as a number of driver omics per driver gene |domics|. Let i represent one of the n × paberrant individuals selected to have an aberrant multi-omic profile, dgenes,i the set of randomly selected genes for i, and for each of these driver genes g, domics,i,g the set of randomly selected omics for i. To create aberrant multiomic profiles for a selected individual i and gene g, the simulated value for [ ] [ , ] (i.e., the i th row of [ ] ) was replaced by a draw from a trivariate Gaussian distribution with covariance matrix as before and mean , = ( 1 , 2 , 3 ), where ~Uniform((−6, −4) ∪ (4,6)) if d ∈ domics,i,g and 0 otherwise for d = 1, 2, 3.
We compared the performance of the multi-omic padma pathway deviation scores to singleomic padma pathway deviation scores (using the first assay as an arbitrary choice) as well analogous pathway deviation scores calculated using a standard PCA (i.e., where MFA pertable weights were not applied). As a measure of the ability to successfully identify the known aberrant individuals, we calculated the area under the receiver operating characteristic curve (AUC) using multi-omic padma, single-omic padma, and PCA-based pathway deviation scores. In addition, for multi-omic padma we also provide AUC values for detecting the true gene drivers using the per-gene contributions to the individualized pathway deviation scores, averaged over the n × paberrant aberrant individuals; this helps to highlight to what extent the true gene drivers for each aberrant individual are successfully recovered by padma.
Results are shown in Supplementary Figures 12-14 and Supplementary Table 6. Overall, we note that when the number of driver genes and/or driver omics increases, it becomes increasingly easy to detect aberrant individuals, regardless of the method. In addition, increasing to paberrant = 5% yields slightly lower AUCs than 1% outlier individuals, although AUC values for padma and PCA remain high. Within a given combination of paberrant, |dgenes| and |domics|, increasing sample sizes lead to slight increases in AUC, and less variable performance across simulations.
Across simulation settings, padma and PCA perform very similarly (Supplementary Figures  12-13) for identifying aberrant individuals, particularly for large values of n, |dgenes|, and |domics|. However, we do observe a slight advantage for padma over the PCA-based approach is observed in settings where |dgenes| = 2 for smaller sample sizes (i.e., n ≤ 50). In the most challenging setting, where |dgenes| = 2 and |domics| = 1, padma has slightly higher median and mean AUC values across the 200 simulated datasets for nearly every combination of n and paberrant (Supplementary Table 6). In nearly all settings, padma single-omics performs the worst, which is unsurprising as only a subset of the data are used and pertinent assays for driver genes may have been omitted. Finally, we remark that multi-omic padma has the added advantage of providing a straightforward quantification of the importance of each gene in driving aberrant profiles. The average AUC (across the n × paberrant individuals) for detecting gene drivers shows that driver genes are generally successfully recovered across simulation settings, particularly when multiple omics are perturbed (Supplementary Figure 14).
Taken together, this simulation study illustrates the satisfactory performance of padma across a wide range of settings, as well as a slight advantage in identifying aberrant individuals for padma compared to a PCA-based alternative, particularly in cases with smaller sample sizes (i.e., n ≤ 50) and fewer driver genes and omics. We also confirmed that the per-gene contributions to the individualized padma pathway deviation scores successfully recover the true gene drivers in all scenarios considered here.

Supplementary Figures
Supplementary Figure 1 identifying aberrant individuals in populations of varying sampling size n using individualized pathway deviation scores calculated using multi-omic padma (green), single-omics padma (purple), and a PCA of concatenated data (orange). Each panel represents a different combination of percentage of outliers in the population (paberrant = 1% or 5%) and number of driver genes per aberrant individual (|dgenes| = 2, 3 or 4). Results are shown for |domics| = 1 (i.e. a single driver omic per driver gene for each aberrant individual) across 200 simulated datasets.
identifying aberrant individuals in populations of varying sampling size n using individualized pathway deviation scores calculated using multi-omic padma (green), single-omics padma (purple), and a PCA of concatenated data (orange). Each panel represents a different combination of percentage of outliers in the population (paberrant = 1% or 5%) and number of driver genes per aberrant individual (|dgenes| = 2, 3 or 4). Results are shown for |domics| = 2 (i.e. two driver omics per driver gene for each aberrant individual) across 200 simulated datasets.
identifying the driver genes (averaged across the paberrant × n aberrant individuals in each simulated dataset) using the per-gene contribution to the multi-omic padma individual pathway deviation scores in populations of varying sampling size n). Each panel represents a different combination of percentage of outliers in the population (paberrant = 1% or 5%) and number of driver genes per aberrant individual (|dgenes| = 2, 3 or 4). Results for |domics| = 1 and |domics| = 2 (i.e., one or two driver omics per driver gene for each aberrant individual) are shown in green and purple, respectively, across 200 simulated datasets.