Summary: Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic experiments are batch effects—when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the
Supplementary information:Supplementary data are available at Bioinformatics online.
High-throughput data are now commonly used in molecular biology to (i) identify genomic features associated with outcomes and (ii) build signatures for prediction. These goals are complicated by the presence of latent variables or unwanted heterogeneity in the high-throughput data. Batch effects are the most widely recognized potential latent variable in genomic experiments. The impact of batch effects can be severe, potentially completely compromising biological results (Leek et al., 2010). Furthermore, batch effects are not the only potential source of latent variation that may compromise the statistical or biological validity of a study (Leek and Storey, 2007).
Here we introduce the
2 Using the SVA package
2.1 Data format
The data are formatted as a matrix, with features (transcripts, genes, proteins) in rows and samples in the columns. Two model matrices must be created with the
2.2 The sva function for estimating and removing surrogate variables
2.3 The ComBat function for removing batch effects
2.4 fsva for prediction
For genomic prediction, datasets are generally composed of a training set and a test set. For each sample in the training set, the outcome/class is known, but latent sources of variability are unknown. For the samples in the test set, neither the outcome/class nor the latent sources of variability are known. When applying genomic predictors, individual samples must be corrected. But most functions for batch correction and surrogate variable estimation have been developed in the context of population studies. ‘Frozen’ surrogate variable analysis can be used to remove latent variation in the training and test sets, as well as individual samples obtained in future studies, similar to the recently developed normalization procedures (McCall et al., 2010).
The arguments that must be passed to
We have introduced the
3.1 Surrogate variables versus direct adjustment
The goal of
In some cases, latent variables may be important sources of biological variability. If the goal of the analysis is to identify heterogeneity in one or more subgroups, the
In contrast, direct adjustment only removes the effect of known batch variables. Batch effects are the best-known source of latent variation in genomic experiments (Leek et al., 2010). However, there are many variables that may have a substantial impact on genomic measurements, from environmental variables (Gibson, 2008) to genetic variation (Brem et al., 2002; Schadt et al., 2003). These variables may be the focus of the study being performed. But there are many studies that focus on identifying the association between genomic measurements and specific outcomes or phenotypes. In these studies, genetic and environmental variables are often unmeasured or unmodeled. If ignored, these biological variables may act in the same way that batch effects act by obscuring signal, reducing power and biasing biological conclusions (Leek and Storey, 2007).
As a rule of thumb, when there are a large number of known or unknown potential confounders, surrogate variable adjustment may be more appropriate. Alternatively, when one or more biological groups is known to be heterogeneous, and there are known batch variables, direct adjustment may be more appropriate.
We would like to thank Rafa Irizarry and the Feinberg Lab for helpful comments and feedback on the
Funding: National Institutes of Health grants: (RR021967 and R01 HG002913).
Conflict of Interest: none declared.