Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses

Abstract Removal of, or adjustment for, batch effects or center differences is generally required when such effects are present in data. In particular, when preparing microarray gene expression data from multiple cohorts, array platforms, or batches for later analyses, batch effects can have confounding effects, inducing spurious differences between study groups. Many methods and tools exist for removing batch effects from data. However, when study groups are not evenly distributed across batches, actual group differences may induce apparent batch differences, in which case batch adjustments may bias, usually deflate, group differences. Some tools therefore have the option of preserving the difference between study groups, e.g. using a two-way ANOVA model to simultaneously estimate both group and batch effects. Unfortunately, this approach may systematically induce incorrect group differences in downstream analyses when groups are distributed between the batches in an unbalanced manner. The scientific community seems to be largely unaware of how this approach may lead to false discoveries.

1. Distribution of error terms after two-way ANOVA batch adjustment

Statistical model
We assume a general linear model with intercept or more general covariates (α), study groups or study parameters of interest (β), and batch effects or other covariates to adjust for (γ): . . . , Y n ] are the observable random variables, parameters α, β, and γ are p, q, and r vectors with design matrices A, B, and C, and ϵ = [ϵ 1 , . . . , ϵ n ] ∼ N(0, σ 2 I n ) where I n is the n × n identity matrix. We also assume that the full design matrix, combining A, B, and C, has full rank p + q + r: i.e., that the covariates are linearly independent.
In general, we wish to assess the explanatory power of B when including C as batch effects and A as additional covariates. Often, p = 1 with α the intercept term, but this formulation allows general covariates to be included in the analyses. The covariates of B which are subject to testing are more commonly specified as contrasts on the full design matrix combining A, B, and C, but in our case this splitting is more convenient.
Least squares estimates for the parameters can be obtained through (1.2)

Elimination of covariates A
We can eliminate A from the model, simultaneously removing p degrees of freedom. One way of doing this is to find an n × n rotation matrix R, i.e. with RR t = I, so that AR = [0|A ′ ] with A ′ a p × p invertible matrix. Applying this rotation to all elements, we split the n dimensions into . We may then eliminate the degrees of freedom used to estimate α from the model yielding and Y 0 and ϵ 0 are now n − p vectors. If α is the intercept term, this corresponds to centering all variables and covariates, but the rotation also eliminates the corresponding degree of freedom. Parameter estimates are now given by Eliminating the covariates in this manner is strictly not required. We could have done all computations on the full model, but this elimination makes the computations a little simpler to deal with and present.

F statistic for model without batch effects
If batch effects are not included in the model, i.e. r = 0, we get the common linear model in which we get a decomposition (1.5) If we believe our data are "batch effect free", and therefore do not include batch in the model, this is the distribution of the F -statistic we would assume. Hence, this will be the assumed distribution when batch adjusted data are analysed using a linear model without batch effects included.

Distribution of F statistic after batch adjustment
We define batch adjusted dataỸ = Y −γC using the estimated batch effectγ from the full model using (1.2). After elimination of covariates A, the batch adjusted data becomesỸ q+r , and so the variance of the error term is the same in the batch adjusted caseỸ 0 =β 0 B 0 +ε 0 . However, the sum of squares |(β 0 − β 0 )B 0 | 2 explained by B is less easily expressed.
For convenience, we define the projection ρ W to the n − k-hyperplane perpendicular to the k-hyperplane spanned by the rows of W for any non-degenerate k × n matrix with k n, , (1.6) through which solutions will be more easily expressed.
Let λ 1 , . . . , λ q 1 be the non-zero eigenvalues of L: this inequality holds since L is a non-degenerate symmetric rank q matrix, and has the same eigenvalues as has eigenvalues in (0, 1) (not 0 since B 0 and C 0 are linearly independent). Then, by decomposing the variation along the eigenvectors, Using Satterthwaite's approximation, which matches the first two momenta, we get (1.11) From λ i 1 follows thatσ 2 σ 2 ,qσ 2 qσ 2 , andq q. Equality happens when all λ i = 1, which occurs when B 0 and C 0 are orthogonal, i.e. B 0 C t 0 = 0. The accuracy of Satterthwaite's approximation depends on the variability of the eigenvalues λ i . If these are all identical, the approximation is exact. However, if the eigenvalues vary greatly, the extreme upper tail of the distribution will be more heavily influenced by the larger eigenvalues, and thus have a longer upper tail, than the approximation takes into account.

Reformulation in terms of the original design matrices
We would like to expressσ 2 andq terms of the original design matrices, i.e. before elimination of covariates. The rotated design matrices were defined as B 0 = BR 0 , etc., using the rotation R = [R 0 |R ′ ] so that AR 0 = 0 while AR ′ is non-singular. Hence, it follows that R 0 R t 0 = ρ A . This allows us to express (1.12) We can then computeq andσ 2 from have the same non-zero eigenvalues. Notice that if B is orthogonal to A, which can be achieved by replacing B with B − V A for some matrix V , then Bρ A = B. If C is also made orthogonal to A, then ρ A,C can be replaced by ρ C .

Comparison to running full ANOVA or linear model
If the full model, including batch effects, is analysed using a traditional ANOVA or linear model approach, the result will be an F -statistic with distribution F ∼ F q,n−p−q−r : the q degrees of freedom for measuring the effect contribute evenly to the statistic.
If batch-adjusted data are analysed without including covariates, the q degrees of freedom end up scaled up by factors λ 1 , . . . , λ q . The corrected F -distribution can account for this. However, if the λ i eigenvalues vary, some degrees of freedom will contribute more to the F -statistic than others, which makes for inefficient use of the data, and results in a reduction in the effective degrees of freedom toq: the F -statistic becomes more variable than that from the common linear modelling approach which exploited the full q degrees of freedom.
Thus, although an appropriate distribution of the F -statistic can be found for the two-step approach, there will still be some loss of power.

Implementation in R
In order to demonstrate the computations, we provide the R script theory/F-distribution.r at https://github.com/ous-uio-bioinfo-core/batch-adjust-warning-reports.git.

Methods similar to ComBat
Though our work was originally focused on ComBat, we realise that other methods are also available which remove batch effects in much the same way and thus have the same problems. Below is a list of the methods encountered along with a brief description.

2.1.1
Partek The commercial software Partek Genomics Suite is commonly used software for analysing genomic data. Included in it is a Batch Remover tool which, based on the user guide (obtained from Partek support), seems to estimate the batch and group effects using an ANOVA model and then remove the estimated batch effects. It is clear from the user guide that the main purpose of this tool is for visualization, and if further statistical tests are performed on the data set, the batch factor must still be included.
2.1.2 removeBatchEffect The method removeBatchEffect found in the popular limma (Smyth and Speed, 2003) package adjusts for batch effects using an ANOVA model with batch and groups included. The methods help description states that its use is for visualization and not for further linear modelling.

ber
The recent R package ber (Giordan, 2013) uses, in addition to ANOVA, bagging to better estimate the batch effects. Inclusion of group labels for which the effects should be preserved seems to be its recommended use. So far 4 citations are listed in google scholar

Applications and pipelines that use ComBat
The inclusion of ComBat in numerous pipelines makes its use easier, but may also make it more difficult to identify the potential problems we discuss. We found several pipelines or libraries with ComBat included, some are soft wrappers, while others have a more worrisome implementation. The short descriptions below are made from a shallow review of the article, tutorial or code. We have not tried them out.
2.2.1 ChAMP ChAMP (Morris and others, 2014, version 1.2.8) is a pipeline for analysing the llumina Infinium HumanMethylation450 BeadChip. Batch effect adjustment is optional with a modified version of ComBat specifically implemented to use the BeadChip, which consist of 12 samples, as batch. From the tutorial example, no parameters are passed to their ComBat implementation. However, inspection of the code shows that sample information entered earlier will automatically be used as covariates. This recent tool has already 15 citations in google scholar.

intCor
The intCor package (Fernández-Albert and others, 2014, version 1.03) may be used to analyse data from liquid chromatography coupled to mass spectrometry experiments. ComBat correction is optional and it seems that group assignments entered earlier in the pipeline is automatically used as covariates. This new pipeline has not many users so far.

2.2.3
AltAnalyze AltAnalyze (Emig and others, 2010, version 2.0.8) is a tool for analysing alternative splicing. ComBat as an option was included in 2013. The user guide states that prior entered group assignments will be used as covariates. ComBat is here implemented in Python rather than R. The tool has 59 citations in Google scholar, most prior to the inclusion of ComBat.

inSilicoMerging
The R package inSilicoMerging (Taminau and others, 2012, version 1.8.7) combines public available microarray gene expression data. The implementation of ComBat is presently without covariates as an option. It has 12 citations according to google scholar. (Reich and others, 2006, version 3.9.0) is a popular platform for analysing gene expression data. It incorporates a lot of tools including ComBat. The implementation is a soft wrapper around ComBat, and the inclusion of covariates is handled similar.

GenePattern GenePattern
2.2.6 SCAN.UPC SCAN.UPC (Piccolo and others, 2013, version 2.6.3) is a microarray normalization method d to facilitate personalized-medicine workflows. It includes a soft wrapper around ComBat, which seems to be made out of convince and operates as the original. Currently it has 7 citations in google scholar.
2.2.7 TCGA The Cancer Genome Atlas generates and shares lots of data, some with batch effects. Several batch adjustment tools are implemented, including ComBat. However, it seems that the inclusion of covariates is not an option.