Statistical significance of variables driving systematic variation in high-dimensional data

Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. Availability and implementation: An R software package, called jackstraw, is available in CRAN. Contact: jstorey@princeton.edu


Supporting Information
Supplementary Figures

Introduction
Latent variable models play an important role in understanding variation in genomic data [1,2].They are particularly useful for characterizing systematic variation in genomic data whose variable representation is unobserved or imprecisely known (Fig. 1).Principal component analysis (PCA) has proven to be an especially informative method for capturing quantitative signatures of latent variables in genomic data, and it is in widespread use across a range of applications.For example, PCA has been successfully applied to uncover the systematic variation in gene expression [3][4][5], estimate structure in population genetics [1,6], and account for dependence in multiple hypothesis testing [2,7].Generally, principal components (PCs) can be thought of as estimates of unobserved manifestation of latent variables; they are constructed by aggregating variation across thousands or more genomic variables [8].What is missing from this highly successful system is a method to precisely identify which genomic variables are the statistically significant drivers of the PCs in genomic data, which in turn identifies the genomic variables associated with the unobserved latent variables.
In a typical application of PCA, all genomic variables (e.g.microarray probes, genetic loci, etc.) will have nonzero "loadings", meaning that they all make some contribution to the construction of PCs.In some cases, when many (or most) of these contributions are forcibly set to zero, similar PCs nevertheless emerge.Methods have been proposed to induce sparsity in the loadings 1 , for example, with a lasso penalized PCA or a Bayesian prior [10][11][12][13].Also, various formulations of statistical significance have been considered previously in the context of PCA.These have usually been focused on scenarios where the number of observations is substantially larger than the number of variables, significance is measured in terms of a completely unstructured data matrix where all variables are mutually independent, or the goal is to only determine the number of significant principal components [14][15][16][17][18][19][20][21].The problem we consider here differs from those scenarios.
Our goal is not a minimal representation of a PCA; we would like instead to develop a strategy that accurately identifies which genomic variables are truly associated with systematic variation of interest.This can be phrased in statistical terminology as developing a significance test for associations between genomic variables and a given set, subset, or linear combination of PCs estimated from genomic data.We introduce a new resampling approach, which we call the jackstraw, to rigorously identify the genomic variables associated with PCs of interest, as well as subsets and rotations of PCs of interest.Our approach is capable of obtaining the empirical null distribution of association statistics (e.g., F-statistics) and applying these to the observed association statistics between genomic features and PCs in order to obtain valid statistical significance measures.Succinctly, new PCs are computed from a dataset with a few independently permuted variables, which become tractable "synthetic" null variables.The association statistics between newly computed PCs and synthetic null variables serve as empirical null statistics, accounting for the measurement error and over-fitting of PCA.
As an application, we consider the problem of identifying genes whose expression is cell-cycle regulated.In this case, there are infinitely many theoretical curves that would represent "cell-cycle regulation" to the point where a standard statistical analysis involves an unwieldy "composite null hypothesis" [22].We identify the few realized patterns of cell-cycle regulated gene expression through PCA and we are able to directly test whether each gene is associated with these using the proposed approach.As another application, we analyzed observational gene expression profiles of blunt-force trauma patients [23], whose post-trauma inflammatory responses are difficult to be quantified using conventional means.When the clinical phenotype of interest can not be precisely measured and modeled, we may estimate it directly from genomic data itself.We identify genes driving systematic variation in gene expression of post-trauma patients and demonstrate that our analysis is biologically richer than the original analysis [23].
PCA has direct connections to Independent Component Analysis [9] and K-means clustering [24,25].Therefore, the methods we propose are likely applicable to those models as well.Furthermore, this approach has potential generalizations to a much broader class of clustering and latent variable methods that all seek to capture systematic variation.

Statistical Model and Approach
Consider a m×n row-centered expression data matrix Y with m observed variables measured over n observations (m n).Y may contain systematic variation across the variables from an arbitrarily complex function of latent variables z.We may calculate the expected influence of the latent variables on Y by E[Y|z], and then write Y = E[Y|z] + E, where E is defined as Y − E [Y|z].There exists a r × n matrix, called L(z), that is a row basis for E [Y|z], where r ≤ n [2,7].This low dimensional matrix L(z) can be thought of as the manifestation of the latent variables in the genomic data.As illustrated in Fig. 1, this conditional factor model is common for biomedical and genomic data [26].Since z is never directly observed or utilized in the model, we will abbreviate L(z) as L. This yields the model where B is a m × r matrix of unknown parameters of interest.The i th row of B, which we write as b i , quantifies the relationship between the latent variable basis L and genomic variable y i .This model ( 1) is schematized in Fig. S7.The principal components (PCs) of Y may be calculated by taking the singular value decomposition (SVD) of Y.This yields Y = UDV T where U is a m×n orthonormal matrix, D is a n × n diagonal matrix, and V is a n × n orthonormal matrix.PCs are then the rows of DV T , where the i th PC is found in the i th row of DV T .The columns of U are considered to be the loadings of their respective PCs.
Suppose that the row-space of L has dimension r.The top r PCs may then be utilized to estimate the row basis for L [8].Specifically, under a mild set of assumptions, it has been shown that as m → ∞, the top r PCs of Y converge with probability 1 to a matrix whose row space is equivalent to that of L [26].For our estimation purposes, we only need to consider the V T matrix since this captures the row-space.We would therefore estimate L by simply obtaining the top r right singular vectors, which we denote by V T r .Let's now consider a concrete example of z, L, V T r , and the ultimate inference goal.[27] carried out a gene expression study to identify cell cycle-regulated genes of Saccharomyces cerevisiae (Fig. 2).In this experiment, m = 5981 genes' expression values were measured over n = 14 time points in a culture of yeast cells whose cell cycles had been synchronized.Here, z are the latent variables that represent the dynamic gene expression regulatory program over the yeast cell cycle.L is the manifested influence of z on the observed scale of gene expression measurements (Fig. 1).The ordered time points themselves do not capture the underlying cell cycle regulation, and it is therefore not clear how to a priori accurately model L. If L were directly observed, then we could identify which genes are cell-cycle regulated by performing a significance test of H 0 : b i = 0 vs. H 1 : b i = 0 for each gene i.
However, since L is not observed, we can instead perform the analogous association test using V T r .Fig. 2(a) shows the first two PCs of Y, where it can be seen that these capture systematic variation that resembles cell-cycle regulation.(It should be noted that the remaining PCs, three and higher, do not capture any systematic variation of interest.)Since the row-spaces of L and V T r (r = 2) are theoretically close [26], we can instead utilize the model where Γ is a m × r matrix of unknown coefficients.We would then perform a significance test of H 0 : γ i = 0 vs. H 1 : γ i = 0 for each gene i.
Note that if V T r → L in row-space as m → ∞, then these two hypothesis tests would be equivalent.However, for fixed m, they are not equivalent.There are two main issues: (i) V T r is a noisy estimate of L; (ii) V T r is itself a function of Y, so hypothesis testing on Y = ΓV T r + E results in over fitting as shown in the left panel of Fig. 3. Our proposed method deals with problem (ii) by accounting for the over-fitting that is intrinsic to performing hypothesis testing on model (2).The numerical results in this paper are carried out so that we generate the data from model (1) and evaluate the accuracy of the significance based on the truth from model (1).Therefore, we provide direct evidence that our proposed method accounts for both issues (i) and (ii).

Proposed Algorithms
We have developed a resampling method (Fig. 4) to obtain accurate statistical significance measures of the associations between observed variables and their PCs, accounting for the over-fitting characteristics due to computation of PCs from the same set of observed variables.The proposed algorithm replaces a small number s (s m) of observed variables with independently permuted "synthetic" null variables, while preserving the overall systematic variation in the data.We denote the new matrix with the s synthetic null variables replacing their original values as Y * m×n .This is simply the original matrix Y with the s rows of Y replaced by independently permuted versions.On each permutation data set Y * , we calculate association statistics for each synthetic null variable, exactly as was done on the original data.We carry this out B times, effectively creating B sets of permutation statistics.The association statistics calculated on Y are then compared to the association statistics calculated on only the s synthetic null rows of Y * to obtain statistical significance measures.
Algorithm to Calculate Significance of Variables Associated with PCs 1. Obtain r PCs of interest, V T r by applying SVD to the row-centered matrix 3. Randomly select and permute s rows of Y m×n , resulting in Y * m×n .

Obtain
5. Calculate null F-statistics F 0b 1 , . . ., F 0b s from the s synthetic null rows of Y * as in Step 2, where V T r is replaced with V * T r .
We call this approach the jackstraw for the following reason.By permuting a relatively small amount of observed variables in the original matrix, the underlying systematic variation due to latent variables is preserved as a whole.This makes the PCs of Y * almost identical to the PCs of the original data, Y, up to variation due to over-fitting of the noise.Replacing s variables with null versions is analogous to the game of jackstraws where the goal is to remove one stick at a time from a structured set of sticks without disrupting the overall structure of the sticks.Since the overall structure of Y is preserved in Y * , we know that the level of associations between these synthetic null variables and the top r PCs is purely due to the over-fitting nature of PCA.From these synthetic null statistics, we can therefore capture and adjust for the over-fitting among the original statistics.
A balance between the number of resampling iterations B and the number of synthetic null variables s is relevant to the speed of the algorithm and the accuracy of the resulting p-values.One extreme is to set s = 1, where the accuracy of the p-values is maximized while the algorithm is the least efficient.In each resampling iteration, s determines the number of estimated null statistics, so to get the same resolution of a particular empirical null distribution (s×B total null statistics), B must increase proportionally with a decreasing s.For example, s = 1 and B = 10000 yields the same number of null statistics as s = 100 and B = 100.The number of true null variables in Y * is always greater than or equal to the number of true null variables in the original matrix Y. Therefore an increase of s in the proposed algorithm may lead to a greater over-fitting into the noise of Y * relative to the overfitting in Y, resulting in conservative estimates of significance.Due to this favorable trade-off between s and B, the proposed algorithm is always guarded against anti-conservative bias.
The hypothesis test H 0 : γ i = 0 vs. H 1 : γ i = 0 applied to model (2) may be generalized to performing the test on subspaces spanned by the PCs, shown in the Supporting Information.This generalization allows one to perform the association tests on a subset of PCs, while adjusting for other PCs.It also allows for one to consider rotations of V T r and projections of V T r onto relevant subspaces.For example, it is possible to rotate the PCs to obtain "independent components" from independent component analysis (ICA) [9] and then perform our algorithm on any desired subset of the independent components.

Results
We evaluated the proposed method on simulated data so that we could directly assess its accuracy, and we also applied the method to two genomic datasets to demonstrate its utility in practice.

Simulation Studies
Through a set of simulation studies, we demonstrated that the proposed method is able to accurately estimate the statistical significance of associations between the latent variable basis L and observed variables y i (where i = 1 . . .m).The data in our simulation studies were generated from model (1) Y = BL + E, where variables y i corresponding to b i = 0 are, by definition, the "null variables" not associated with L (Fig. S7).The accuracy of our approach is evaluated by performing m hypothesis tests using the proposed algorithm (where only Y is observed) and assessing whether the joint distribution of p-values corresponding to the null variables is correctly behaved.
The joint null criterion.
We utilized the "joint null criterion" of [28] to assess whether the set of p-values corresponding to the null variables follow the desired joint distribution (Fig. S8).When the null hypothesis is true, a valid statistical hypothesis test must generate "null" p-values that are distributed uniformly between 0 and 1.We evaluated the uniformity of null p-values using the Kolmogorov-Smirnov test (KS test) which is designed to detect any deviation from the Uniform(0,1) distribution.The set of null p-values produced by a method satisfies the joint null criterion in a given simulation scenario when their joint distribution is equivalent to a set of i.i.d.observations from the Uniform(0,1) [28].
There are two ways in which we measured deviations from the Uniform(0,1) joint null criterion.The first is via a two-sided KS test, which detects any deviation; the second is a onesided KS test, which detects anti-conservative deviations where the null p-values are skewed towards zero.Anti-conservative deviations will occur when a method does not properly take into account the fact that the association statistics are formed between the variables and PCs (which have been built from the variables themselves), leading to over-fitting and anticonservative p-values.Evaluation of the joint null criterion works by simulating many data sets (corresponding to independently repeated studies) from a given data generating process (Fig. S8).The joint behavior of the null p-values is then evaluated among these.
We considered 16 simulation scenarios, described below.For a given scenario, we simulated 500 independent studies and calculated 500 KS test p-values, each of which is based on the set of null p-values from its respective study.In other words, for 500 simulation data sets per scenario, 500 KS test p-values are calculated to measure deviations from the Uniform(0,1); a second application of the KS test (so-called "double" KS test) is then performed on these 500 KS p-values to assess whether any anti-conservative deviation from the Uniform(0,1) among these studies has occurred (Fig. S8).If the statistical method being evaluated provides accurate measures of statistical significance, the collection of double KS test p-values must be distributed Uniform(0,1).This guards against any single simulated data set leading one to an incorrect conclusion by chance.
Overall, we demonstrate that our proposed method provides accurate measures of statistical significance of the associations between variables and the latent variables, when the latent variables themselves are directly estimated from the data via PCA.At the same time, we show that the conventional method does not provide accurate statistical significance measures.

Simulation scenarios and results.
We constructed 16 simulation scenarios representing a wide range of configurations of signal and noise (Fig. 5), with 500 independent studies simulated from each.Let's first consider one of the simpler scenarios in detail.Model ( 1) is used to generate the data.In this particular scenario, we have m = 1000, n = 20, r = 1 and a dichotomous mean shift resembling differential expression between the first 10 observations and the second 10 observations.(The factor 1/ √ n is to give L unit variance.)For 95% of the variables, we set b i = 0, implying they are null variables; we parameterize this proportion by π 0 = 0.95.The other 50 non-null variables were simulated such that b i i.i.d ∼ Uniform(0,1).The noise terms are simulated as e ij i.i.d ∼ Normal(0,1).The data for variable i are thus simulated according to For a given simulated data set, we tested for the associations between the observed variables and the latent variables by forming association statistics between the observed y 1 , y 2 , . . ., y m and their collective PC, V T r (r = 1).We calculated p-values using both the conventional F-test and the proposed method with s = 50 synthetic null variables.Over 500 simulated data sets, the conventional F-test resulted in 500 one-sided KS p-values that exhibit a strong anti-conservative bias with a double KS p-value of = 9.71 × 10 −196 (Fig. S9, black points).On the other hand, the proposed method correctly calculates null p-values, by accounting for the over-fitted measured error in PCA, with a double KS p-value of 0.502 (Fig. S9, orange points).Note that the classification of null p-values is based on the true association status from the population-level data generating distribution from model (1), not based on model (2) or on the observed loadings from the PCA.
We carried out analogous analyses on 15 more simulation scenarios, detailed in Fig. 5.We utilized all possible combinations of the following: (1) either dichotomous or sinusoidal functions for L; (2) the parameters B were simulated from either a Bernoulli or Uniform distribution; (3) m = 1000 or m = 5000 variables; and (4) the proportion of true null variables set to either π 0 = 0.75 or π 0 = 0.95.The proposed method was applied with s = 0.05m, 0.10m, and 0.25m to study the impact of the choice of the number of synthetic null variables.For each scenario, we applied the joint null criterion double KS evaluation (Fig. S8), using 500 simulated data sets.The conventional F-test method consistently produced anti-conservative null p-values, while the proposed method yielded accurately distributed null p-values (Fig. 6).
In these simulations, we found that the proposed method tended to produce more conservative null p-values as s increased (Fig. 6).The explanation for this is that inclusion of a larger number of synthetic null variables leads to a greater over-fitting of PCA to the noise, which in turn yields a conservative empirical null distribution formed by the synthetic null statistics.We therefore identified a trade-off between computational speed and how conservative the calculated p-values are in the choice of s.We note, however, that the null p-values were never observed to be prohibitively conservative in that the power became unreasonably diminished.In practice, the user has the option to lower the value of s to minimize this, at the cost of greater computation.
We note that we also investigated a delete-s version of the jackstraw, which draws on ideas from our proposed method, which one could call the permute-s jackstraw.However, this implementation did not produce valid null p-values (Supporting Information).
Testing for associations on subsets of principal components.
We have generalized the proposed method to be able to test for associations on any subset of the top r PCs, while adjusting for the remaining PCs among the top r.Here, we demonstrate that the proposed method can identify variables driving a chosen subset of PCs of interest, V T r 1 , while adjusting for the remaining of the top r PCs which are not of interest, V T r 0 , where r 0 + r 1 = r.Based on model (1), we simulated data with m = 1000, n = 20, r = 2 and L 1 and L 2 are truly associated with 100 variables and 60 variables, respectively.Among these, 40 variables that are truly associated with both L 1 and L 2 .We generated the noise term as e ij i.i.d ∼ Normal(0,1).We set r = 2 and tested for associations with the first PC while adjusting for the second PC.Note that the first PC effectively captured the signal from the first latent variable.In this case, the null variables were defined to be 900 variables associated with either only the second latent variable or no latent variable.The conventional F-test resulted in an anti-conservative bias among the null p-values, with a double KS test p-value of 8.73 × 10 −20 , while the proposed method produced a correct joint null p-value distribution with a double KS test p-value of 0.352 (Fig. S10).
Furthermore, we generalized the proposed method to be able to test for associations on rotations of the top r PCs (and on subsets of these rotations).This allows for a general exploration of the row-space spanned by the top r PCs.The details of these generalizations are given in the Supporting Information.

Application to Gene Expression Studies
Typically, genomic variables are tested for the associations with external variables, which are measured independently of genomic profiling technology, such as disease status, treatment labels, or time points.However, external variables may be imprecise or inaccurate due to poor understanding of the biology or technological limitations; sometimes the external variables of interest may not be capable of being measured at all.For example, in a cancer gene expression study, the cancer types may be based on histological classification of the tumor cells.Then, association tests, such as F-tests, are conducted between the histological classification and transcriptional levels to discover genes of interest.However, the histological classification of cancer tumors may not distinguish important cancer subtypes [29,30].This lack of information may lead to a spurious signal or reduced power in statistical inference.
When the external variables are unmeasured or imprecise, we are interested in using the latent variable basis, L, to discover genes of interest (Fig. 1).Because L is never directly measured, we must estimate it from the genomic data, using PCA and related methods.We apply our proposed method to two genomic datasets to demonstrate its utility in practice.

Cell cycle regulated gene expression in Saccharomyces cerevisiae.
It is known that in Saccharomyces cerevisiae there is an abundance of genes whose transcription is regulated with respect to the cell cycle [27,31].Nonetheless, comprehensive identification of the yeast genes whose expression is regulated by the cell cycle is still an active area of research, since it's unclear how the yeast cell cycle regulation should be quantified and modeled [32][33][34][35].The experimental time points after cell population synchronization are readily measured, but this external variable does not directly represent periodic transcriptional regulation with respect to the cell cycle.
Suppose that we want to carry out a hypothesis test on each gene of whether it shows regulation associated with a periodic pattern over the cell cycle.The null hypothesis is then that population mean is not periodic over the cell cycle.This null hypothesis contains an infinite number of mean time-course trajectories that are non-periodic, making the null hypothesis composite.A composite null hypothesis such as this one is largely intractable because it contains an unwieldy class of potential probability distributions describing gene expression.Indeed, a survey of the literature reveals that this composite null hypothesis is the major challenge when a traditional hypothesis testing approach is taken.However, by using our approach, we can reduce the complexity of this problem by directly estimating the manifested systematic periodic expression variation and applying the proposed method to identify genes associated with this systematic variation due to the latent variables, L.
[27] measured transcriptional levels of m = 5921 yeast genes, every 30 min for 390 min after synchronizing the cell cycle among a population of cells by elutriation2 .The top two PCs capture the manifestation of cell cycle regulation on gene expression [5], explaining 48% of total variance (Fig. 2a,b).By testing for associations between time-course gene expression and the top two PCs, we essentially transform this challenging problem into a tractable association significance testing problem with a simple null hypothesis H 0 : γ i = 0 vs. H 1 : γ i = 0 (as opposed to a composite null).The hypothesis test is now simply whether gene i is associated with r = 2 latent variables estimated by the top two PCs.
We applied the proposed method to test this hypothesis and identified a large number of genes associated with yeast cell cycle regulation.We discovered that approximately 84.0% of the 5981 measured genes are associated with the top two PCs ( π 0 = 1 − 0.84 = 0.16).At FDR ≤ 1%, 2988 genes were found to be statistically significant.Hierarchical clustering of those 2988 genes shows strong, yet diverse, periodic patterns (Fig. 2c).The generalized proposed method allows us to compute statistical significance measures of associations with a subset of PCs.When testing for associations with the first PC while adjusting for the second PC, 1643 genes were called statistically significant at FDR ≤ 1%, with the estimated proportion of null variables π 0 = 64.6%.On the other hand, at the same FDR threshold, we found 966 genes were significantly associated with the second PC with π 0 = 59.8%.

Inflammation associated gene expression in post-trauma patients.
Large-scale clinical genomic studies often lead to unique analytical challenges, including dealing with a large number of clinical variables, unclear clinical endpoints or disease labels, and expression heterogeneity [2].The "Inflammation and the Host Response to Injury" (IHRI) consortium carried out a longitudinal clinical genomics study on blunt force trauma patients.They collected 393 clinical variables (some longitudinal) and time-course gene expression (total of 797 microarrays) on 168 post-trauma patients [23].One of the main goals in this study was to elucidate how inflammatory responses after blunt force trauma are manifested on gene expression.To aggregate relevant clinical variables into a manageable daily score, the IHRI consortium used a modified version of the Marshall score to rate the severity of multiple organ dysfunction syndrome [36].
Based on the modified Marshall score trajectories, [23] clustered post-trauma patients into five groups, called "ordered categorical Multiple Organ Failure" (ocMOF) labels.The time-course gene expression profiles of each patient were summarized by "within patient expression changes" (WPEC) [23].Then, they tested for correlations between the WPEC genomic variables and the ocMOF score to discover genes associated with inflammatory responses of post-trauma patients.However, the use of the potentially noisy ocMOF clinical variable may impose limitations, as patients with similar Marshall scores may exhibit a wide range of clinical outcomes [37].Furthermore, five discrete values for the ocMOF scores potentially limits the resolution of the clinical variable.
To investigate this further, we utilized our proposed approach where the gene expression itself was used to construct clinical phenotypes on the patients.We directly used the WPEC data to characterize the molecular signature of inflammatory responses to blunt force trauma.We estimated the manifestation of post-trauma inflammatory responses on gene expression, L, with the top 9 PCs (Fig. S13).Then, we applied the proposed method to identify the genomic variables in WPEC associated with the top 9 PCs.The original analysis in [23] estimated 24% of the 54, 675 genomic variables (probe sets) to be associated with the ocMOF score.In contrast, our analysis revealed a much larger proportions of the genomic variables to be significantly associated with the major sources of variation, ranging from 62% for 1 st PC to 39% for 9 th PC.
The genes identified in the original analysis [23] were largely also identified in our analysis, although our analysis provided many more significant genes.To compare the biological relevance of our re-analysis versus the original analysis, we tested for enrichment of 17 inflammation-related gene sets [38], using one-sided Mann-Whitney-Wilcoxon tests with permutation-based significance.At the FDR ≤ 1%, none of the inflammatory-related gene sets is enriched for the original analysis using the ocMOF scores [23].In contrast, a large number of inflammation-related gene sets are significantly enriched when the genomic variables are tested for the associations with the top 9 PCs individually (Table 1).MAPK Signaling is enriched for every PC, except 5 th PC, whereas Innate Pathogen Detection is enriched for 1 st , 4 th , 6 th , and 9 th PCs, at the FDR ≤ 1%.Those two biological pathways were emphasized in the original analysis [23] as indicating down-regulation of innate pathogen detection and up-regulation of MAPK signaling pathway, and they were seen as strong predictors of long-term complications from brute force trauma.Based on enrichment tests, the proposed method appears to provide a biologically richer source of information than the analysis based on the ocMOF scores.

Discussion
We have developed a method to accurately carry out statistical significance tests of associations between high-dimensional variables and estimated latent variables, which have been estimated through systematic variation present in the observed high-dimensional variables themselves.Our approach is to maintain the overall systematic variation in the highdimensional data set, while replacing a small number of observed variables with independently permuted synthetic null variables.These synthetic null variables allow us to estimate the null distribution of the association statistics calculated on the original data, that takes into account the inherent over-fitting that occurs when estimating latent variables through methods such as PCA.We call this approach the jackstraw because it draws on the idea of the game of jackstraws, where a player must remove a stick (i.e., a variable) from a pile of tangled sticks without disturbing the overall structure.Through extensive simulations, we demonstrated that the proposed method is capable of accounting for over-fitting and producing accurate statistical significance measures.We also demonstrated that applying conventional association testing methods to this problem artificially inflates the statistical significance of associations.
An input required for the proposed method is the number of PCs, r, that capture systematic variation from latent variables.Determining the number of "statistically significant" PCs is an active area of research, and defining a number of significant PCs depends on the data structure and the context [15][16][17][18]26].Setting r too small results in systematic residual variation in model (2), which hinders the accuracy of the approach.We suggest erring on the side of setting r larger, and utilizing PCs not of interest as adjustment variables in the more general form of our algorithm.For example, if one would like to identify variables associated with the top three PCs, but is unsure whether the given data has three or four significant PCs, we have found it more robust to input r = 4 which will adjust for potential systematic residual variation captured by the fourth PC.
We demonstrated our approach using PCA.It is well known that individual PCs may not be directly interpretable or may contain multiple signals of interest that the user wishes to distinguish.Our method is capable of considering any subspace spanned by the top r PCs, while adjusting for the remaining subspace spanned by these PCs (to account for systematic variation to of interest).Our method is also easily adapted to methods such as Independent Component Analysis.Details are given in the Supporting Information.This allows the user to intelligently investigate the space spanned by systematic variation and carefully construct estimates of the latent variables of interest.We do not advocate blindly applying our method to the top r PCs without considering these issues.
The proposed method represents a novel resampling approach operating on variables, whereas established resampling approaches, such as the jackknife and the bootstrap, tend to operate on observations [39][40][41].When applying these methods, systematic variation due to latent variables is intentionally perturbed, since their purpose is typically to assess the sampling variation of a single variable.In high-dimensional data, we may need to preserve systematic variation due to latent variables, which is the problem that the jackstraw addresses.
By accurately testing for associations between observed high-dimensional variables and the systematic manifestation of latent variables in the observed variables, our proposed method allows for the automatic discovery of complex sources of variation and the genomic variables that drive them.The proposed method extends PCA and related methods beyond their popular applications in exploring, visualizing, and characterizing the systematic variation to genomic variable level (e.g., gene-level) significance analyses.Given the increasingly important role that non-parametric estimation of systematic variation plays in the analysis of genomic data [1,2,5], the proposed method may be useful in many areas of quantitative biology utilizing high-throughput technologies as well as other areas of high-dimensional data analysis.
Tables Table 1: Q-values from gene enrichment analysis using inflammation-related gene sets

Figures
Figure 1: Illustration of systematic variation genomic data due to latent variables.Complex biological variables, such as cancer subtypes and cell cycle regulation, may be difficult to define, measure, or model.Instead, we can characterize the manifestation of latent variables, L(z), directly from high-dimensional genomic data using PCA and related methods.The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y.  .By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable "synthetic" null variables while preserving the overall systematic variation.Association statistics between the s synthetic null variables in Y * and V * T r form the empirical null distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs.To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the "joint null criterion" [28].For a given simulation study, a valid statistical testing procedure must yield a set of null p-values that are jointly distributed Uniform(0,1).We use a KS test to identify deviations from the Uniform(0,1) distribution.Fig. S8 provides a detailed overview of the evaluation pipeline.
One−sided KS−test Two−sided KS−test  For each of 500 independent studies per scenario, we tested for deviation of null p-values from Uniform(0,1), resulting in 500 KS test p-values for each scenario.An individual point in the QQ-plot represents a double KS test p-value for one scenario, comparing its 500 KS test p-values to Uniform(0,1).On the left panel, the systematic downward displacement of 16 black points indicates an anti-conservative bias of the conventional F-test.In contrast, the proposed method produces null p-values that are not anti-conservative.On the right panel, a set of 16 points are below the diagonal red line if the joint null distribution deviates from the Uniform(0,1) distribution.The proposed method adjusts for over-fitting of PCA and produces accurate estimates of association significance.
Consider a r × r rotation matrix, R.Then, the generalization of model ( 2) is To perform a significance test of H 0 : γ i ∈ Ω 0 vs. H 1 : γ i ∈ Ω 1 for each gene i, the algorithm needs to adapt the model ( 3) with RV T r , in place of model ( 2) with V T r .The rotation matrix R must be applied on V * T r in every iteration of the estimation step of null association statistics.This generalization allows one to perform hypothesis tests using various rotations and projections of PCs.
The rotation matrix R could reflect biological or clinical measurements, independent of genomic data.By regressing the top r PCs on a noisy phenotype, we may improve the molecular signature of interest (e.g., separating technical artifacts from biological variation).For example, let's assume a high-dimensional genomic data set contains three statistically significant PCs (r = 3) and we are interested in identifying variables associated with a particular linear combination of the top three PCs: Then, we may construct a 3 × 3 rotation matrix R, which must be orthonormal whose determinant equals 1, as required by any proper rotation matrix.In this case, we arrive at After obtaining this particular rotation of the top three PCs, W r = RV T r , we can continue onto performing a significance test using model (3).The two modifications to the algorithm are: 1. To apply R on both V T r and V * T r to obtain W r = RV T r and W * r = RV * T r in the computation of observed association statistics and null association statistics, respectively.2. To construct a hypothesis test of association between variable i and w 1 , while accounting for w 2 and w 3 : Another interesting example of this generalized jackstraw algorithm arises from transformations of V T r based on a statistical criterion.Starting with V T r , Independent Component Analysis (ICA) seeks r components that are mutually statistically independent (Comon, 1994).In some cases with statistically independent and non-Gaussian sources of variation, independent components may be more sensible, providing an interpretable low-dimensional representation.We may obtain independent components by rotating V T r to maximize mutual statistical independence (Hastie et al., 2011).Let R be a rotation matrix that maps V T r into the r independent components.To carry out the association test on the r independent components, we use the generalized model ( 3) with RV T r , instead of model ( 2) with V T r .As above, the analogous substitution occurs at every iteration to ensure the same rotation is applied to synthetic null variables.
Remark 1.If one wants to test linear hypotheses on the unrotated PCs, then Step 2 is the above algorithm is carried out such that R is the identity matrix.

Remark 2.
Step 4 of the generalized algorithm above permutes s rows of Y, thereby breaking all systematic variation spanned by V T r among these s variables.If one is testing for associations on only a subspace of V T r , then it may be desirable to preserve the systematic variation spanned by the subspace of V T r acting as adjustment variables.(E.g., If one is testing for associations with the first PC where r = 2, then it may be desirable to preserve the systematic variation spanned by the second PC among these s variables.)This can be accomplished, for example, by modifying the permutations to act only on the residuals after regressing out the adjustment variables, or by carrying out a bootstrap null procedure whereby residuals resulting from this regression are bootstrapped and added back to the adjustment variable only model fit (see, e.g., [43]).
An alternative delete-s approach.
We also investigated a delete-s version of the jackstraw, which draws on ideas from our proposed method, the permute-s jackstraw.In the delete-s jackstraw, the set of m variables is broken up into m/s disjoint sets of s variables each, where s m.The significance of associations between a given set of s variables and unobserved latent variables is then calculated by testing for the associations between each of the s variables and the top r PCs calculated among the remaining m − s variables.Since the set of s variables are not used to estimate the latent variables, the association p-values for those s variables should be marginally valid.However, we found that the delete-s jackstraw did not satisfy the joint null criterion.We applied the delete-s jackstraw to the above 16 simulation scenarios and found that the delete-s jackstraw results in an anti-conservative bias (Fig. S11).As s increases, the null p-values from the delete-s jackstraw exhibited a greater anti-conservative bias.We hypothesis that this delete-s approach did not produce valid null p-values because of a complex dependence among the m/s sets of s p-values (any two sets of s p-values have an overlap of m − 2s variables used to construct the PCs used in the association tests), similar to dependence issues encountered in cross-validation.

Supplementary Figures
Figure S7: Diagram of the latent variable model (1).The latent variable basis L is not observable, but may be estimated from Y using the top r principal components (V T r ).The noise term E is independent random variation.We are interested in performing statistical hypothesis tests on b i (i = 1, . . ., m), which quantifies the relationship between L and y i (i = 1, . . ., m).Since L must be estimated from Y, a conventional association test results in anti-conservative p-values.We have developed the jackstraw method to account for overfitting due to using estimates of L to compute the statistical significance of associations between L and y i .
Figure S8: Evaluation pipeline for 16 simulation scenarios.To assess statistical accuracy in testing the associations between variables and principal components, we generated 16 scenarios from a wide range of configurations.For a given scenario, we simulated 500 independent studies, which resulted in 500 KS test p-values (refer to a green box).For a valid statistical method, this set of 500 KS test p-values should be distributed Uniform(0,1); we evaluate an anti-conservative bias among 500 KS test p-values with another application of the KS test (a "double KS test") .This results in the 16 simulation scenarios being summarized by 16 double KS test p-values (Fig. 6), giving us a comprehensive view of the method's statistical accuracy.
One−sided KS−test Two−sided KS−test  In this scenario, we generated a dichotomous mean shift between two groups of observations with the true proportion of null variables π 0 = 0.95.We assessed the Uniform(0,1) property of null p-values using the Kolmogorov-Smirnov (KS) test.In the left QQ-plot comparing one-sided KS test p-values and the Uniform(0,1) distribution, the downward displacement of the black points below the diagonal red dashed line indicates anti-conservative p-values resulting from the conventional F-test.In contrast, the upward displacement of the colored points demonstrates how the proposed method guards against anti-conservative bias.On the right panel, a two-sided KS test detects any deviation -both conservative and anti-conservative -from the theoretically correct Uniform(0,1) distribution.
One−sided KS−test Two−sided KS−test  We simulated m = 1000 variables of n = 20 observations, with r = 2 significant PCs.We applied the conventional F-test and the proposed method to test for associations between m variables and the 1 st PC.The proposed method is capable of correctly estimating significance measures of the associations between variables and their 1 st PC, while adjusting for the 2 nd PC.The conventional F-test produces anti-conservative p-values due to over-fitting, where KS test p-values are skewed towards 0.
One−sided KS−test Two−sided KS−test   The top two PCs of the original 14 microarrays corroborate an aberrant gene expression profile from 300 min, which is removed from our new analysis.q q q q q q q q q q q qq q q q qq qq q qqq qq qq qqqq Percent Variance Explained by All 126 PCs Index Percent Variance Explained q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 25 30

Figure 2 :Figure 3 :
Figure 2: Identification of yeast genes associated with the cell cycle regulation.(a) The top two PCs of gene expression profiles, measured over time in a population of yeast whose cell cycles have been synchronized by elutriation, capture major transcriptional regulatory patterns [27].(b) The percent variance explained by PCs shows that the top two PCs capture 48% of the total variance in the data.(c) Hierarchical clustering of transcriptional levels significantly associated with the top two PCs at FDR ≤ 1% reveals a diverse set of systematic time-course gene expression trajectories.

Figure 4 :
Figure 4: A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r principal components (V T r ).By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable "synthetic" null variables while preserving the overall systematic variation.Association statistics between the s synthetic null variables in Y * and V * T r form the empirical null distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs.

Figure 5 :
Figure5: Sixteen simulation scenarios generated by combining four design factors.To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the "joint null criterion"[28].For a given simulation study, a valid statistical testing procedure must yield a set of null p-values that are jointly distributed Uniform(0,1).We use a KS test to identify deviations from the Uniform(0,1) distribution.Fig.S8provides a detailed overview of the evaluation pipeline.

Figure 6 :
Figure6: QQ-plots of double KS test p-values from 16 simulation scenarios versus the Uniform(0,1) distribution.For each of 500 independent studies per scenario, we tested for deviation of null p-values from Uniform(0,1), resulting in 500 KS test p-values for each scenario.An individual point in the QQ-plot represents a double KS test p-value for one scenario, comparing its 500 KS test p-values to Uniform(0,1).On the left panel, the systematic downward displacement of 16 black points indicates an anti-conservative bias of the conventional F-test.In contrast, the proposed method produces null p-values that are not anti-conservative.On the right panel, a set of 16 points are below the diagonal red line if the joint null distribution deviates from the Uniform(0,1) distribution.The proposed method adjusts for over-fitting of PCA and produces accurate estimates of association significance.

Figure S9 :
FigureS9: QQ-plots of KS test p-values from the highlighted simulation scenario in the main text, using the conventional F-test and the proposed method.In this scenario, we generated a dichotomous mean shift between two groups of observations with the true proportion of null variables π 0 = 0.95.We assessed the Uniform(0,1) property of null p-values using the Kolmogorov-Smirnov (KS) test.In the left QQ-plot comparing one-sided KS test p-values and the Uniform(0,1) distribution, the downward displacement of the black points below the diagonal red dashed line indicates anti-conservative p-values resulting from the conventional F-test.In contrast, the upward displacement of the colored points demonstrates how the proposed method guards against anti-conservative bias.On the right panel, a two-sided KS test detects any deviation -both conservative and anti-conservative -from the theoretically correct Uniform(0,1) distribution.

Figure S10 :
FigureS10: Evaluation of statistical tests for the associations on subsets of principal components (PCs).We simulated m = 1000 variables of n = 20 observations, with r = 2 significant PCs.We applied the conventional F-test and the proposed method to test for associations between m variables and the 1 st PC.The proposed method is capable of correctly estimating significance measures of the associations between variables and their 1 st PC, while adjusting for the 2 nd PC.The conventional F-test produces anti-conservative p-values due to over-fitting, where KS test p-values are skewed towards 0.

Figure S11 :
FigureS11: QQ-plots of double KS-test p-values from applying the delete-s jackstraw on 16 simulation scenarios.In a delete-s version of the jackstraw, the association p-values between the latent variable basis and a given set of s variables are estimated by testing for the associations between the PCs from m − s variables and the separate set of s variables, which were left out in computation of the PCs.A systematic downward displacement of the points below the red diagonal line indicates anti-conservative p-values, present in both the conventional F-test and the delete-s jackstraw method.A larger value of s in the delete-s jackstraw leads to a greater anti-conservative bias.

Figure S12 :
Figure S12: Original gene expression profiles of the Spellman et al. (1998) yeast cell cycle experiment.(a) A heat map of the covariance matrix of 14 arrays reveals an outlier from the time point at 300 min (low values in blue and high values in yellow).(b) The top two PCs of the original 14 microarrays corroborate an aberrant gene expression profile from 300 min, which is removed from our new analysis.

Figure S13 :
Figure S13: Percent variance explained by principal components (PCs) of the "Within Patient Expression Change (WPEC)" matrix from Desai et al. (2011).The considerable drop (the "elbow") in this scree plot helped us to identify the top 9 PCs as significant, likely capturing the molecular signatures of patient responses to blunt force trauma.
Generalized Algorithm to Calculate Significance of Variables Associated with PCs 1. Obtain r PCs of interest, V T r by applying SVD to the row-centered matrix Y m×n = UDV T .2. Rotate V T r using a r × r rotation matrix R to model Y = ΓRV T r + E .3. Calculate m observed F-statistics F 1 , . . ., F m , testing H 0 : γ i ∈ Ω 0 vs. H 1 : γ i ∈ Ω 1 from the Step 2 model.4. Randomly select and permute s rows of Y m×n , resulting in Y * m×n .