Summary: We provide a new statistical algorithm and software package called ‘eigen-R2’ for dissecting the variation of a high-dimensional biological dataset with respect to other measured variables of interest. We apply eigen-R2 to two real-life examples and compare it with simply averaging R2 over many features.
Availability: An R-package
Supplementary information:Supplementary data are available at Bioinformatics online.
A common goal in the analysis of high-dimensional biological studies is to decompose the variation of thousands of measured features in terms of some other variables. In statistical parlance, the measured features are considered a set of related ‘response variables’ and the other variables used to explain their variation are called ‘independent variables’.
A common application of this goal is in dissecting the variation of transcriptional levels of thousands of genes in terms of relevant biological variables. For example, Jin et al. (2001) estimated the contributions of sex, genotype and age to transcriptional variation in Drosophila melanogaster. They found that expression variation is mostly explained by sex, genotype and their interactions, and less explained by age. Ross et al. (2000) characterized the transcriptional variation of thousands of genes across human cancer cell lines. Brem et al. (2002) dissected the variation of expression in yeast according to genotype, based on recombinant lines derived from two distinct isogenic strains. Similarly, Morley et al. (2004) analyzed genome-wide variation in human gene expression in terms of underlying genetic variation. The genome-wide transcription variation explained by population structure has been estimated in the teleost fish (Oleksiak et al., 2002) and humans (Storey et al., 2007).
In all of these studies, one can think of each feature as being a response variable, where a key summary statistic is the proportion of variation among the thousands of response variables explained by the independent variables of interest. For a single response variable, the proportion of variation explained by independent variables is usually accomplished by calculating R2. This is computed as the ratio of the variance of the fitted model to the variance of the response variable. For example, heritability is the R2 value according to a particular model of the quantitative trait based on genotypes.
With thousands of response variables, one can calculate R2 values for each one, resulting in thousands of these values. Even though it is reasonable to simply plot the distribution of these R2 values, sometimes it is also desirable to calculate an average R2, so that statements can be made about the overall proportion of variation in the response variables explained by variables of interest. One obvious choice is to simply take the mean of the R2 values, which we refer to as mean-R2. However, this measure may be vulnerable to technical artifacts in the data. For example, in a gene expression study, many genes will not be expressed. For such genes, the expression measurements are low, resulting in the ratio of the mean expression to the SD being unstable. In this case, the R2 values of these genes would fluctuate wildly. This has been recognized as an issue in testing genes for differential expression, and usually a small constant is added to the estimated variance of the genes to down weight the genes that are likely not expressed (Tusher et al., 2001). Another solution that has been implemented is to eliminate genes on the arrays before any analysis is done by a presence/absence call.
Here, we introduce a new statistical algorithm and software package for calculating a quantity called eigen-R2. The eigen-R2 quantity is computationally efficient to calculate, and it is robust to the issues described above. We show, for example, that the eigen-R2 values in an expression study are robust to different levels of gene-filtering based on a presence/absence call, whereas mean-R2 is strongly affected by gene-filtering.
Suppose that the high-dimensional biological dataset is organized as an m × n matrix, Y, where the rows of Y represent different response variables and the columns represent different observations. For example, in a gene expression study, the rows of Y are the m genes and the columns of Y are the n arrays. Suppose also that an additional variable z = (z1, z2,…,zn)T has been measured. For example, z could be a clinical variable, genotype or treatment. If yi = (yi1, yi2,…,yin)T is the data for response variable i, then we may fit a model of yi on z to obtain fitted values . The proportion of variation in yi that is explained by z is then:is the mean of and is the mean of yi. It then follows that mean-R2 is the average of these across all response variables, .
Instead of taking the average of the , we have developed an approach to employ principal components analysis to measure an overall R2. In principal components analysis, a singular value decomposition is applied to the data matrix, decomposing Y into the following: y = U D VT, where the matrices U and V are column orthogonal so that UTU = VTV = I and D is a diagonal matrix. The columns of V are the right eigenvectors and the columns of U are the left eigenvectors. We are particularly interested in the right eigenvectors, because these represent aggregated trends in the response variables. Specifically, the first column of V is the linear combination of all response variables that explains the most variation in the data, called the first right eigenvector. The second column of V is the linear combination of all response variables that explains the most variation in the data once the first eigenvector has been removed, and so on. The proportion of total variation captured by the i-th eigenvector is πi = di2/∑l=1Ldl2, where di is the eigenvalue of the i-th eigenvector, which is obtained from the i-th diagonal entry of D.
When Y is a matrix of expression data, the right eigenvectors have been called ‘eigen-genes’. Alter et al. (2000) first proposed this terminology and showed that principal components analysis is a useful tool for identifying major trends in expression data. The top few eigen-genes tend to capture biologically relevant trends present in a number of genes. In a more general setting, one could call the columns of V ‘eigen-response-variables’.
Let vi be the i−th column of V and let be the fitted values when modeling vi in terms of z. For each of these, we can calculated an R2 value, denoted by . Since πi of the total variation in the data is explained by vi, should be weighted by πi. Additionally, since each pair of eigen-response-variables is uncorrelated, the variation explained by z in vi is orthogonal to the variation explained by z in vj where j≠i. Therefore, as an overall measure of R2, we proposed to take the average of the , weighted by their respective πi:
Step 1. Let Y be an m × n matrix, where the rows of Y represent different response variables and have been mean centered, and the columns represent different observations. Use singular value decomposition to decompose the matrix of response variables as y = UD VT.
Step 2. For each column of V, denoted by vi, fit the user−specified model of vi on the independent variable(s) z to obtain fitted values , i=1, 2,…,n. Calculate the R2 value of this model fit as described above to obtain .
Step 3. Calculate the proportion of variation explained by vi with πi = di2 / ∑l = 1Ldl2.
Step 4. Calculate the overall R2 value as eigen-.
Straightforward linear algebra shows that using models that can be fitted by applying a linear operator (e.g. ordinary least squares regression, least squares regression with spline functions), eigen-R2 can equivalently be written as follows:values, where the weights correspond to the response variable's contribution to the overall distribution of baseline variances. It also follows from this equivalence that if the rows of Y are not only mean-centered, but also scaled to have unit SD, then eigen-R2 is equal to mean-R2. In the Supplementary Material, we prove the above equivalence and compare the eigen-R2 measure to two other methods for determining the weights.
We have developed an R library called
The software offers an option to adjust the R2 calculations to accommodate small sample sizes (small n) by replacing R2 with 1 − (1 − R2) × (n − d0)/(n − d), where d0 and d are the degrees of freedom spent in fitting the null model and the target model, respectively. By default, the null model is a model with only intercept and d0 = 1. The adjusted estimator has been shown to be an unbiased estimator of the true value (Schroeder et al., 1986). The software also provides the option to utilize only those eigenvectors that explain more variation than would be expected by chance. When the significance threshold is specified, the function estimates the P-value for each eigen-gene (Leek and Storey, 2007), and only selects the ones pass the threshold to estimate eigen-R2. This acts as an additional mechanism for de-noising the overall eigen-R2 estimate.
4.1 Expression variation within and among populations
Storey et al. (2007) recently analyzed patterns of natural gene expression variation in B lymphoblastoid cells derived from 16 HapMap individuals of European and African ancestry, eight from each respective population. They characterized the gene expression variation to within and among these populations for each gene by utilizing R2 calculations, and also tested for genes differentially expressed within and among populations. We employed the eigen-R2 method and software on this dataset to produce a genome-wide summary of the within-population and inter-population contributions to genome-wide expression variation.
In Table 1, we present a comparison of the eigen-R2 and mean-R2 values over varying numbers of probe sets obtained from different filtering thresholds available in the
|No. of probe sets||8746||5969||5194||4646|
|No. of probe sets||8746||5969||5194||4646|
Each column represents a different threshold for filtering probe sets during preprocessing. Technical variation was not removed before calculating R2, so this variable explains the remaining variation in the data.
4.2 Genetic dissection of transcriptional variation
Brem et al. (2002) explored the genetic architecture of budding yeast by analyzing data from a experimental cross between two strains of Saccharomycis cerevisiae. The parental strains are haploid derivatives of a standard laboratory strain (BY) and a wild isolate from a California vineyard (RM). For the 112 segregants, expression levels of ∼6000 genes were measured and genetic markers were identified with oligonucleotide microarrays. We performed linkage analysis of the expression levels similarly to Brem et al. (2002). At a P-value cutoff of 5 × 10−7, about 9000 out of the ∼19 million of gene-marker combinations show significant linkage. Figure 1a plots the significantly linked gene expression trait positions against the linked marker positions on the yeast genome. We estimated the eigen-R2 and mean-R2 values at each locus, shown in Figure 1b. It can be seen from these plots that both quantities capture the linkage hotspots well. However, eigen-R2 tends to capture more signal, particularly on Chromosomes II, III, XII, XIII and XIV. The MAT mating locus on Chromosome III, which has been shown to play an important role in gene expression (Brem et al., 2002), has the highest eigen-R2 of 8.1%, while the mean-R2 at that locus is 1.9% (Fig. 1c).
Conflict of Interest: none declared.