Reference-free cell mixture adjustments in analysis of DNA methylation data

Motivation: Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation. A recently published set of statistical methods exploits this association to infer changes in cell mixture proportions, and these methods are presently being applied to adjust for cell mixture effect in the context of epigenome-wide association studies. However, these adjustments require the existence of reference datasets, which may be laborious or expensive to collect. For some tissues such as placenta, saliva, adipose or tumor tissue, the relevant underlying cell types may not be known. Results: We propose a method for conducting epigenome-wide association studies analysis when a reference dataset is unavailable, including a bootstrap method for estimating standard errors. We demonstrate via simulation study and several real data analyses that our proposed method can perform as well as or better than methods that make explicit use of reference datasets. In particular, it may adjust for detailed cell type differences that may be unavailable even in existing reference datasets. Availability and implementation: Software is available in the R package RefFreeEWAS. Data for three of four examples were obtained from Gene Expression Omnibus (GEO), accession numbers GSE37008, GSE42861 and GSE30601, while reference data were obtained from GEO accession number GSE39981. Contact: andres.houseman@oregonstate.edu Supplementary information: Supplementary data are available at Bioinformatics online.


INTRODUCTION
Recently there has been increasing interest in the effects of cell mixture on the measurement of DNA methylation, specifically the extent to which small perturbations in cell mixture proportions can register as changes in DNA methylation (Adalsteinsson et al., 2012;Bock, 2012;Houseman et al., 2012;Lam et al., 2012;Liu et al., 2013;Reinius et al., 2012). DNA methylation, tightly associated with alterations in the nucleosome DNA scaffold (and hence chromatin), is in part responsible for coordination of gene expression in individual cells (Ji et al., 2011;Khavari et al., 2011;Natoli, 2011). It is now appreciated that differentially methylated DNA regions (DMRs) distinguish cell lineages with high sensitivity and specificity (Baron et al., 2006), and considerable research is now underway to delineate precise DMRs that define and specify a particular cell lineage. A recently published set of statistical methods exploits this association to infer changes in cell mixture proportions solely on the basis of a DNA methylation profile (Houseman et al., 2012). These methods may broadly be conceived as the projection of DNA methylation data from a target dataset S 1 onto a reference dataset S 0 consisting of DNA methylation profiles for isolated cell types. For example, S 1 may consist of DNA methylation measured in whole blood from a case-control study (Langevin et al., 2012;Liu et al., 2013;Koestler et al., 2012;Wang et al., 2013), while S 0 may consist of corresponding DNA methylation profiles from isolated leukocyte cell types (e.g. CD4þ leukocytes, B cell lymphocytes, granulocytes) as provided by Houseman et al. (2012) and Reinius et al. (2012). Recently, some investigators have used these methods to adjust for cell mixture effect in the context of epigenomewide association studies (EWAS, Liu et al., 2013). However, these adjustments require the existence of reference datasets S 0 , and these sets may be laborious or expensive to collect. In addition, for some tissues such as placenta, saliva, adipose or tumor tissue, the relevant underlying cell types may not be known. In this article, we propose a method for conducting EWAS analysis when a reference dataset is unavailable, demonstrating that it can perform as well as the methods that make explicit use of a reference dataset.

STATISTICAL METHODS
Our proposed method, closely related to surrogate variable analysis (SVA, Leek and Storey, 2007), relies on a simple projection based on singular value decomposition (SVD), as does SVA. In SVA, the residuals of a linear model are decomposed into a factor-analytic structure and the factors are used subsequently in a regression model, with iteration resulting in a final set of surrogate variables. Our approach, which does not require iteration to obtain estimates, includes unadjusted linear coefficient estimates as columns of the matrix to be decomposed. As we demonstrate later in the text, this construction associates the residuals of the unadjusted model with the unadjusted coefficient estimates in a manner consistent with a linear mixing assumption.
We assume DNA methylation array results Y, an m Â n matrix representing DNA methylation measurements for m CpG sites and n subjects. We assume that the measurements are on a 'beta value' scale, having an interpretation as estimates of the proportion of methylated molecules corresponding to a given locus. In addition, we assume an n Â p matrix X of covariates, including an intercept (in the first column), the phenotype of interest and potential confounders. The standard unadjusted EWAS analysis (on beta values) posits the linear model where B Ã is an m Â p matrix of coefficients and E Ã is an m Â n matrix of errors. Generally, analysis proceeds as if the rows of E Ã were independent, although permutation-based inference is sometimes used to account for potential correlation among rows, and B Ã can sometimes include 'surrogate variable' confounders (Leek and Storey, 2007;Teschendorff et al., 2011) that account for technical error. However, DNA methylation effects may be mediated through covariate effects on cell mixtures, i.e. : ¼ X! þ . and Y ¼ BX T þ M: T þ E, where : is an n Â k matrix of subject-specific cell proportions for k cell types (with rows summing to values 1), ! is a p Â k coefficient matrix representing cell-proportion effects, . is an n Â k matrix of errors, B is an m Â p matrix of direct epigenetic effects (not mediated by effects on cell type), M is an m Â k matrix of cellspecific mean methylation values (falling between 0 and 1) and E is an m Â n matrix of errors. The goal of an adjusted EWAS analysis is to estimate the direct effects B. Note that M can be obtained from reference data, but it is unknown if a reference dataset does not exist. We explicitly posit a model on the betavalue scale because the effects are expected to be linear and additive only on this scale, not on the M-value scale often used (Liu et al., 2013). Substitution results in the following linear model: where it becomes evident that Although a naive analysis would treat the error matrix E Ã as having independent rows, an alternative model is to assume a factor-analytic structure on E Ã or E, specifically where L is an m Â q matrix of CpG-specific factor loadings, U is an n Â q matrix of latent effects and ( is an m Â n matrix of 'uniqueness' errors. This formulation is implicit in methods such as SVA (Leek and Storey, 2007) and independent surrogate variable analysis (ISVA, Teschendorff et al., 2011), which are techniques proposed for addressing batch effects and confounders as well as cell mixture effects, although neither SVA nor ISVA explicitly posits this structure. Substituting (2) in (3) results in the following model: which expresses the explicit dependence of the latent structure of E Ã on the unknown cell-specific methylation matrix M. M appears twice in (4), once in the error structure, and once as a random effect on X. The SVA method extracts latent subjectspecific effects using an SVD, which might also be used in this formulation as well, as long as the double appearance of M can be addressed. Careful inspection of (4) reveals that for some m Â ðk þ qÞ matrix , ¼ ðM, LÞ and ðk þ qÞ Â ðp þ nÞ matrixŨ T ¼ ½ð!, 0 pÂq Þ T , ð., UÞ T , withŨ T having at least q orthogonal columns. Although the k rows of the biologically determined matrix ½! T , . T need not be orthogonal, the products M. and M! T in (2) are not fully identified, in the sense that M. ¼ MAA À1 . and M! T ¼ MAA À1 ! T for any invertible k Â k matrix A, including any that orthogonalizes the rows of ½! T , . T ; thus, it is possible to find a fully orthogonal U T that still results in quantities that satisfy (2) element of " 0 is less than every diagonal element of"; , is thus obtained asL". As M is a submatrix of ,, we propose the following estimator for B, obtained as the residual of the projection of B Ã onto the column space of ,: For any non-singular d Â d matrix D, ð,DÞ½ð,DÞ T ð,DÞ À1 ð,DÞ T ¼ ,ð, T ,Þ À1 , T so that (5) is independent of the scales chosen for each column of ,. Although it is impossible to distinguish M from L within ,, L has an interpretation that is identical with the surrogate variables extracted by SVA, and that SVA would also be unable to distinguish M from L. Although this estimator is motivated by an explicit statistical model, its construction is somewhat ad hoc. Inference, therefore, demands an appropriate bootstrap estimate of coefficient standard errors. To adequately account for correlation in the error structure, we propose the following approach. First, B Ã ¼ B þ M! T represents all systematic variation (cell-mediated and non-cell-mediated) and E Ã ¼ M. T þ E represents all unexplained variation (in both cell composition and non-cellmediated variation), so that sampling with replacement from the columns of E Ã should form the basis of the bootstrap. However, the variance of the elements of E Ã will depend on the corresponding elements of M Ã ¼ B Ã X T , as the biologically determined values are approximately beta-distributed.
[M Ã ¼ EðYjXÞ is a matrix of mean values Ã ji , not the cell-specific methylation matrix M.] As the variance of a beta-distributed variable with mean scales by ð1 À Þ, we obtain a matrix Q Ã of mean-standardized errors by dividing each elementê Ã ji of E Ã by ½ Ã ji ð1 À Ã ji Þ 1=2 , where Ã ji is the corresponding element of M Ã . Thus, we construct bootstrap sample Y ðrÞ as Y ðrÞ ¼M Ã þ E ÃðrÞ , where the elements of E ÃðrÞ are obtained by first sampling with replacement from the columns ofQ Ã , then multiplying the result, element-wise, by the elements ½ Ã ji ð1 À Ã ji Þ 1=2 (of the un-resampled matrix). This standardization approach preserves the relationship between the mean and variance of a beta distribution, while sampling column-wise preserves correlations across CpGs. As shown in the Supplementary Material, the latter property allows the bootstrap samples to form the basis of an omnibus test of significance over the entire array. Bootstrap estimates B ÃðrÞ and B ðrÞ are thus obtained by fitting (1) on Y ðrÞ , and subsequently recomputing (5). Standard errors for the elements ofB are obtained by calculating the corresponding standard deviations over bootstrap samples B ðrÞ , r 2 1, :::, R (with, e.g. R ¼ 250 or R ¼ 500). Bootstrap-based standard errors forB Ã can be obtained simply by computing standard deviations from the bootstrap samples obtained by fitting (1) to the bootstrap samples Y ðrÞ .
The proposed methods can easily be adapted to accommodate missing values in Y as follows: (i) rows of B Ã corresponding to rows of Y having missing values can be estimated on a row-by-row basis via complete-case analysis; (ii) the SVD producingL andŨ can be obtained from completely observed rows of ðB Ã ,Ê Ã Þ; , is then constructed fromL for completely observed rows, or else for rows with missing values by projecting ðB Ã ,Ê Ã Þ ontoŨ using a complete-case regression analysis. The bootstrap procedure introduces missing values by sampling from replacement from the columns ofQ Ã , which will contain missing values for elements of Y that are missing.
A remaining issue is appropriate selection of the dimension d. Teschendorff et al. (2011) propose a method for estimating the dimension of a latent surrogate variable using random matrix theory (RMT); application of their algorithm to the matrixÊ Ã is one simple approach for estimating d. The simulation studies presented in the Supplementary Material suggest that this method performs well. As shown in the Supplementary Material, it outperforms a simple approach based on minimizing Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), and data analysis results suggest that it produces estimates reasonably similar to those produced by the method suggested by Buja and Eyuboglu (1992) and implemented in the R package sva. In the case whereÊ Ã possesses missing values, only the completely observed rows are used.
Finally, we once again emphasize that the proposed methodology necessarily requires that analysis proceed on a linear (-value) scale, because the logit-transform used to construct M-values destroys the linear mixing assumption. However, we acknowledge that a substantial portion of error may occur on the logit scale, owing to the fact that the individual probes used to interrogate the molecular methylation states are expected to have approximately lognormal distributions. We address this concern in the next section.

SIMULATION STUDY
We conducted a simulation study to confirm that the proposed methodology produces unbiased results. Simulation parameters were chosen to produce datasets as similar as possible to realistic DNA methylation datasets, although the dimensions were small enough to make the simulation study feasible. Fixing m ¼ 1000, n ¼ 250 and k ¼ 4, we constructed m Â k matrix M by selecting each element of the first 250 rows of M as a Betað0:25, 0:25Þ random variable, setting the remaining rows equal to zero; thus, the first 250 rows of M correspond to CpGs that are DMRs for k ¼ 4 cell types. M was held constant over all simulations. The m Â 2 matrix B was constructed as B ¼ ðb 1 , b 2 Þ, where 2 f0, 1g and B ¼ ðb 1 , b 2 Þ were generated (once for all simulated datasets) from a 3-part mixture model as described in the Supplementary Material, in such a way that non-zero direct effects tended to occur only for CpGs with mid-range values, and their signs tended to correlate inversely with the intercept. Details appear in the Supplementary Material (Section I), but for each simulated dataset there were 30 negative effects, 23 positive effects and 947 null effects. For each simulated dataset, the phenotype of interest, x i (i 2 f1, :::, ng), was generated as a UðÀ0:5, 0:5Þ uniform random variable. Each row of the cell proportion matrix : was generated as Dirichletðx i Þ, where the dispersion parameter ¼ 100 resulted in biologically plausible variation in cell proportions for a specimen such as blood, x i ¼ ð0:2, 0:1, 0:1, 0:6Þ T þ ð0:05, À0:05, 0, 0Þ T x i , and 2 f0, 1g was a simulation parameter controlling the strength of the phenotype effect on cell mixture. Generation of the error component of the model was implemented in a manner that allowed us to investigate the effects of error on different scales, logit and linear. First, we presume that biological variation occurs on the linear scale, with error arising from a beta distribution. Thus, with ð 1i , ::: q j ¼ minfn À1 P n i¼1 ji ð1 À ji Þ= 2 À 1, 1g and ¼ 0:008; in other words, we chose row-specific dispersion parameters such that the average variance of b ji for row j over i 2 f1, :::, ng was about 2 . Simulations demonstrate relative insensitivity to , as shown in Supplementary Material (Section IV; To obtain the measured methylation values, we added a microarray error of the factor-analytic form described by (3) but incorporated on a logit scale. Beta values y ji are typically constructed as the ratio y jiM =ðy jiM þ y jiU þ "Þ, where y jiM is the measurement obtained from a probe designed to interrogate a methylated molecule, y jiU is the measurement obtained from the corresponding unmethylated probe and " % 0 is a small constant chosen in advance. A common assumption in microarray analysis is that measurements such as y jiM and y jiU are lognormally distributed; consequently, logitðy ji Þ % logðy jiM Þ À logðy jiU Þ is normally distributed, and we expect the technical error introduced by the microarray to occur on the logit scale. For a latent error component on the logit (M-value) scale of measurement, we set q ¼ 2 and generated the elements of the n Â q matrix U ¼ ðu 1 , :::, u n Þ in (3) as standard normal variables, while the corresponding L matrix was generated as Z", with the elements of the m Â q matrix Z generated as standard normal variables and diagð"Þ ¼ ð0:25, 0:01Þ T . For the matrix ( in (3), the elements of each row j were simulated as Nð0, 2 j Þ, with 2 j ¼ 0:25 2 À diagð" 2 Þ (i.e. the standard deviation of each value was 0.25, but the errors were correlated across CpGs). Thus, with microarray errors ( ¼ ð ji Þ ji , the simulated value y ji for row j and column i was generated as y ji ¼ ½1 þ expflogðb ji ÞÀ logð1 À b ji Þ þ ji g À1 . Table 1 describes the combinations of simulation parameters 2 f0, 1g and 2 f0, 1g used for each of four scenarios used to investigate basic properties of proposed estimator. Supplementary Figure S2(b) provides a clustering heatmap displaying a typical dataset simulated under scenario #1. In each simulation (except those conducted to compare methods of dimension estimation), the RMT method of Teschendorff et al. (2011) was used to estimate the latent dimension.
The Supplementary Material describes additional simulations used to investigate several specific issues, as well as providing additional graphical results for the four main simulation scenarios reported here. Supplementary Material (Section III) reports the effects of larger sample sizes (n ¼ 500). Section IV describes a simulation experiment designed to investigate the effect of different scales of error variability on the accuracy of different methods of dimension estimation (both in estimating the dimension itself and in the impact on Root-Mean-Squared-Error (RMSE)). Section IV also provides a comparison of our proposed methodology with SVA. Section V describes a simulation experiment designed to investigate power and Type I error. In every scenario considered, we simulated 100 separate datasets, and for each simulated dataset, we used 250 bootstrap samples for inference.
For simulation #1, Figure 1 compares slope estimatesb 2 versus b 2 , the SVA-adjusted variant ofb 2 versus b 2 andb Ã 2 versus b 2 , on each of the m ¼ 1000 features. This figure demonstrates that although we expect the naive unadjusted estimator to provide an unbiased estimator of the total effect b Ã 2 , its estimates of direct effects are somewhat biased in comparison to our proposed estimatorb 2 , especially for the null slopes. It also demonstrates that SVA produces biases similar to the unadjusted analysis. This figure is consistent with Table 2, which reports the total RMSE [e.g. the square root of the simulation average of m À1 P m j¼1 ðb j À b j Þ 2 ] for each of the four comparisons across all four scenarios. Table 2 suggests similar behavior for null direct effects when the cell mixture effect is non-null (simulation # 3). When the mixture effect is null,b 2 andb Ã 2 estimate b 2 with about the same precision, as one would anticipate. Under simulation scenario #1, for each of 1000 features, Figure 2 plots simulation SD versus median bootstrap estimate (across 100 simulations) of the direct effect estimator. The bootstrap procedure appears tolerably unbiased, although our bootstrap standard error estimator yields apparently inflated estimates for some DMRs and a handful of non-DMR CpGs having non-null effect. The two very biased estimates result from intercepts lying near the zero boundary for mean methylation , resulting in non-linear effects (because of truncation) for some subjects having strongly negative values of x; in the Supplementary Material we provide evidence that the bias decreases in larger samples (Section III). Table 2 reports the median (over CpGs) of the ratio of median bootstrap standard error (over simulations) to simulation SD for all four scenarios, for bothb 2 andb Ã 2 , standard errors for the latter being computed using the standard linear model theory approach. Although the proposed bootstrap standard error methodology is imperfect, it   appears to be as good as or better than the standard asymptotic methods used to compute standard errors for unadjusted effect estimatesB Ã . Supplementary Material (Section III) provides plots similar to those provided in Figures 1 and 2 for simulation scenarios #2, #3 and #4; they are consistent with the results and interpretations given here.
The results of the simulation experiment described in Section IV suggest that the RMT method of dimension estimation has good accuracy in estimating the correct latent variable dimension when the magnitude of the latent effects is sufficiently large relative to other error components, superior to simpler methods based on AIC and BIC. RMT was able, in general, to estimate the dimension correctly even though the errors were incorporated on different scales (linear beta and logit M-value). In addition, the results suggest that when the microarray variation is relatively large, the reference-free and SVA methods have about the same level of error, presumably because the microarray error swamps the biological error. However, when the microarray error is smaller [but still consistent with realistic datasets, as shown in Supplementary Fig. S2(a)], our proposed referencefree method outperforms SVA, particular in estimating coefficients for DMRs; this latter phenomenon presumably occurs because the cell-mixture property is explicitly used in supervising the deconvolution. Section V describes a simulation experiment to investigate power and Type I error, using a bootstrap-based method for testing omnibus significance. It suggests appropriate Type I error control when b 2 ¼ 0, and reasonable power when the effects of b 2 are large enough.

DATA EXAMPLES
We demonstrate our proposed methodology on four datasets. The first consists of Illumina Infinium HumanMethylation450 BeadChip array data on which bisulfite-converted DNA from whole blood was hybridized; the data were obtained from Gene Expression Omnibus (GEO), Accession number GSE42861, and consisted of n ¼ 689 subjects: 354 rheumatoid arthritis patients (cases) and 335 normal controls, originally published by Liu et al. (2013). Using the data available on GEO, we obtained four different estimates for the difference in DNA methylation measured on the beta scale between case and control at 384 410 autosomal CpGs whose Infinium probes contained no single nucleotide polymorphism (SNP) and had no SNP at a flanking G site ('non-SNP CpG sites'). In the unadjusted analysis, we simply applied the limma procedure (Smyth, 2004), with design matrix consisting of an intercept and an indicator variable for case status. We then applied the method of Houseman et al. (2012) to data from 387 CpG sites overlapping between non-SNP CpG sites on the HumanMethylation450 and the 500 leukocyte differentially methylated regions made available publicly to infer leukocyte proportions (the full Illumina Infinium HumanMethylation27 dataset is available on GEO, Accession number GSE39981); in the reference-based analysis, we used limma to estimate casecontrol differences adjusted for leukocyte type by using five of the six available types: B-cell, CD4þ T, CD8þT, granulocyte and NK (monocyte proportions were dropped to avoid an illconditioned design matrix). In the SVA-adjusted approach, we used the R package SVA (version 3.6.0) both to compute the dimension of the surrogate variables and to determine the surrogate variables themselves. Using the method of Buja and Eyuboglu (1992) implemented in sva, we found d ¼ 53 surrogate variables; after estimating the 53 surrogate variables, we adjusted for them using limma in a manner similar to the previous analysis. Finally, we applied our proposed reference-free analysis, with X equal to the same design matrix used in the unadjusted analysis. In the latter analysis, the latent variable dimension was estimated to be 37 by the RMT method of Teschendorff et al. (2011); because simulations suggest accurate dimension estimation by the RMT method, we used d ¼ 37. 500 bootstrap samples were used for inference. The DNA methylation dataset available on GEO contains no missing values. Figure 3 shows volcano plots of the arthritis case coefficient for the three different analyses, demonstrating diminished significant for both adjusted analyses shown in the figure (reference-based and referencefree). Interestingly, significance for the reference-free analysis is diminished relative to the reference-based analysis, suggesting that the six leukocyte types profiled by Houseman et al. (2012) and available in GEO Accession GSE39981 may be insufficient for analysis of blood data. This interpretation is further reinforced by Figure 4, which shows reference-free coefficient estimates by their corresponding reference-based estimates; there is general agreement between methods for coefficients with larger magnitude, except for a single CpG whose magnitude appears larger by the reference-based method than by the reference-free method; in general, for the relatively null CpGs, the referencebased method produces estimates of larger magnitude than the reference-free method. Supplementary Material (Section VI) provides volcano and scatter plots similar to those shown in Figures 3 and 4, but showing the SVA results. In addition, Supplementary Figure S9 shows a comparison of RMSE between our reference-free approach and SVA, where the 500-DMR reference-based analysis was used as a presumed gold standard. In general, the SVA results were dissimilar from both of the other adjusted analyses, with greater significance relative to both, and more similarity with the unadjusted analysis. This suggests inadequate adjustment by SVA. SVA results To increase legibility of the plot, SE estimates for two CpGs producing extreme bias have been moved to the left, as indicated with d ¼ 37 were similar to those obtained using d ¼ 53 (data not shown in detail, but a summary appears in Supplementary Fig.  S9). For each of the four analysis types, Table 3 summarizes overall significance as measured by q-value methodology (Storey et al., 2004) implemented in the R package qvalue, including an estimate of 0 , the proportion of nulls, as well as the number of CpG sites for which q50:1. Interestingly, the unadjusted and reference-based analyses produced about the same estimated proportion of null CpGs as well as a relatively large number of CpGs for which q50:1. Despite apparently increased significance shown in the volcano plot appearing in the Supplementary Material, the SVA-adjusted approach produced a higher value of 0 and fewer CpGs (though still a substantial number) for which q50:1. The reference-free analysis produced a much higher estimated value of 0 and no CpGs for which q50:1. Although there were no significant q-values after reference-free adjustment, an omnibus test of significance proposed in Section V of the Supplementary Material results in an overall P50:002 even after reference-free adjustment. As mentioned in Leek and Storey (2007), q-values can be misleading when applied to multiple correlated tests.
A possible explanation for decreased significance of the reference-free method relative to the other methods is that the reference-free approach led to less precise estimates. However, the median ratio of reference-free to reference-based standard error was 0.79, so that the reference-free approach led to estimates that were apparently more precise, overall, than the reference-based adjustment. Supplementary Material (Section VI) provides plots comparing estimates and standard errors for the three different methods. On the basis of these statistics and results, it appears that the reference-free method more precisely estimates effects that are smaller than those produced by the other two methods; thus, inflation of standard errors is an inadequate explanation. Another explanation for decreased significance of the referencefree method is that the reference-free approach potentially accounts for many more cell types than either the reference-based or reference-free approaches. The reference set provided by Houseman et al. (2012) differentiates granulocytes from other cell types, but does not differentiate neutrophils, basophils and eosinophils within the granulocyte category; it distinguishes CD4þ and CD8þ T-cells from other types of cells that are not T cell lymphocytes, but does not differentiate T helper cells, regulatory T cells or memory T cells. These types may be important differentiators of rheumatoid arthritis.
Our second analysis consists of array data for 92 independent peripheral blood mononuclear cell (PBMC) samples, assayed using the Illumina Infinium HumanMethylation27 (27K) technology, originally published in Lam et al. (2012) and available in GEO, Accession number GSE37008. For the purposes of this analysis, PBMC samples can be thought of as whole blood with granulocytes removed. In addition to DNA methylation data, complete blood count differential data were available for each sample, thus providing gold standard estimates for the fraction  of the PBMC sample consisting of monocytes, assumed to be one minus the fraction of lymphocytes. Using several different approaches applied to the subset of autosomal CpGs, we examined the association between DNA methylation and the logarithm of il6 response to phorbol-12-myristate-13-acetatein ('log pma'), a potent cell division promoter. The first analysis was unadjusted; the second analysis was adjusted for monocyte fraction; the third and fourth analyses were adjusted for blood cell fractions estimated using the approach of Houseman et al. (2012), similar to the approach described above, with the top 100 or 500 DMRs published in Houseman et al. (2012). Koestler et al. (2013) provide a detailed study of the application of reference-based estimates of cell proportion using this dataset.
In the fifth approach, we adjusted for d ¼ 11 surrogate variables using SVA, with d obtained by the method of Buja and Eyuboglu (1992). In the sixth approach, we applied the reference-free approach proposed in this article with d ¼ 10, estimated via RMT. Both the third and fourth approaches produced similar results, and results reasonably similar to the second approach. The reference-free approach produced slightly increased significance over approaches 2-4. The SVA results were similar to the reference-free approach, although the resultant RMSE values were larger for the SVA approach than for the reference-free approach, when monocyte-adjustment or reference-based adjustment was used as a gold standard [ Supplementary Fig. S10(o)]. The unadjusted analysis produced results that were substantially more significant than those produced by the other four approaches. Details of this analysis appear in Section VII of the Supplementary Material. In summary, our proposed reference-free approach produces results similar to (though slightly less variable than) those produced by SVA, to reference-based adjustments as well as adjustment by a known (though coarsely differentiated) gold standard, but quite distinct from the unadjusted approach.
Our third analysis consists of 27K array data for 176 placenta samples originally published in Banister et al. (2011). All n ¼ 176 infants considered in this analysis were of gestational age greater than 37 weeks. Data were adjusted for BeadChip effect using ComBat (Johnson et al., 2007). Maternal-age adjusted differences in placental DNA methylation between 52 small-for-gestational age (SGA) infants and 124 normal infants at 21 551 autosomal non-SNP CpG loci were estimated using three methods. In the unadjusted analysis, limma was used with design matrix consisting of an intercept, an indicator variable for SGA, and maternal age. In the SVA-adjusted analysis, we additionally adjusted for d ¼ 13 surrogate variables, with d obtained by the method of Buja and Eyuboglu (1992). In the reference-free analysis, we used our proposed method with d ¼ 12 (estimated via RMT). The volcano plots shown in Figure 5 suggest that the SVA and reference-free analyses produce modestly diminished significance compared with the unadjusted analysis. Figure 6 shows adjusted effect estimates by their corresponding unadjusted estimates and by SVA-adjusted estimates; the figure suggests that a substantial fraction of CpGs demonstrates slightly larger effect magnitude when compared with their adjusted counterparts. As shown in Figures 5 and 6, as well as Supplementary Material (Section VIII), SVA and the referencefree methods produce almost identical results. In addition, Table 3 suggests diminished significance of results in the unadjusted analysis. Overall, the analysis suggests that much of the effect of SGA on DNA methylation may be explained by cell mixture, but that the mixture effect is substantially smaller than that occurring for DNA methylation measured in blood.
Our final analysis consisted of 27K data for 203 gastric tumors and 94 gastric non-malignant samples, originally published by Zouridis et al. (2012) and available in GEO, Accession number GSE30601. In each of the three analyses of DNA methylation at autosomal CpGs, we compared tumors to non-malignant samples. The first analysis was unadjusted; the second analysis was adjusted for 27 surrogate variables (with d ¼ 27 determined by the method of Buja and Eyuboglu, 1992); in the third, we applied the reference-free approach proposed in this article, with d ¼ 24, obtained via RMT. In general, the unadjusted, SVA and reference-free approaches produced different results. The SVA and reference-free approaches differed systematically (Wilcoxon P50:0001) at CpGs mapped to polycomb target genes . Placenta dataset: volcano plots. Volcano plots forB 2 unadjusted for leukocyte composition and adjusted using the proposed reference-free method with d ¼ 12 (Bracken et al., 2006;Lee et al., 2006;Schlesinger et al., 2007;Squazzo et al., 2006), with the reference-free approach demonstrating greater hypomethylation in tumors at PcG targets. Thus, compared with the other approaches, the reference-free method may demonstrate superiority in identifying biologically relevant alterations of DNA methylation in highly heterogeneous tumor samples. A summary of significance appears in Table 3; details of the analysis and graphical results appear in Supplementary Material (Section IX).

DISCUSSION AND CONCLUSIONS
We have proposed a reference-free method of conducting EWAS while adjusting for cell mixture. Following on work by Houseman et al. (2012), this method posits a statistical model that involves a latent variable representing mean methylation, together with a factor-analytic error model consistent with the surrogate variable approaches of Leek and Storey (2007) and Teschendorff et al. (2011). We have also proposed a companion bootstrap methodology for estimating standard errors.
Our simulation studies suggest that our proposed approach returns reasonably unbiased estimates of the direct effect, i.e. the effect not mediated by cell type, and reasonable standard error estimates. They also suggest adequate control of Type I error and reasonable power to detect direct effects of sufficient magnitude. Our simulations also suggest that the RMT method of dimension estimation, proposed by Teschendorff et al. (2011), performs well for estimating the dimension of the latent structure. Finally, our method performs about as well as surrogate variable analysis (SVA) when technical error is of large magnitude, and better than SVA when technical error is of smaller (but still realistic) magnitude. We have demonstrated our method on four separate datasets. The first was a rheumatoid arthritis dataset, where DNA methylation in blood shows substantial attenuation in effect after applying our reference-free approach, even in comparison to an approach similar to that used by Liu et al. (2013), wherein DNA methylation was adjusted for cell types profiled by Houseman et al. (2012) using methodology proposed in that paper. A possible explanation is that DNA methylation effects may be mediated by cell types not profiled by Houseman et al. (2012). The second set consisted of PBMC data and associated effects of il6 response to a potent tumor promoter, for which we demonstrate that our proposed reference-free approach produces results similar to reference-based approaches but dissimilar from an unadjusted analysis. The third set was a placental dataset for which no reference data were available for component cell types. In this dataset, effects of SGA showed modest attenuation after applying our referencefree method. This suggests that the effects of growth restriction on DNA methylation are likely less mediated by changes in cell distribution, a result that one might anticipate in a solid tissue sample such as placental tissue. The fourth analysis compared gastric tumors with non-malignant gastric tissue, showing differences in results between mixture-adjusted and unadjusted analyses, but with significant direct effects produced even in the adjusted analyses. In this last example, there is evidence that the results of the reference-free method may be more biologically meaningful than those produced by SVA.
An interesting aspect of our methodology is its tendency to produce 'spikes' in volcano plots, such as those shown in Figure 5 or Supplementary Figure S12(a). These are caused by shrinkage of standard errors, particularly at DMRs, i.e. CpGs that drive the cell mixtures in ,. Because these CpGs collectively borrow statistical strength from each other, their standard errors tend to be smaller. This phenomenon is evident in Supplementary Figures S7(b) and S10(d), which compare reference-free and unadjusted standard errors for the blood and PBMC datasets, respectively; these plots demonstrate that for many known leukocyte DMRs, the standard errors in the reference-free method are noticeably smaller than the corresponding standard errors from the unadjusted analysis. The same phenomenon is evident in Supplementary Figure S7(h), which plots the standard errors from the reference-adjusted analysis against the corresponding standard errors from the unadjusted analysis, and demonstrates shrinkage of standard errors at leukocyte DMRs.
Unlike the SVA approaches of Leek and Storey (2007) and Teschendorff et al. (2011), we posit a specific data generation model that incorporates a factor-analytic structure that is implicit in the previously published methods. However, we incorporate all CpG features in the factor-analytic structure, rather than attempting to select a subset of features that are optimally informative. While there may be some loss of precision in using all CpG features present on a given array, we view this as an acceptable sacrifice to accommodate an agnostic approach that permits any CpG to serve as a DMR. In addition, the surrogate variables for which our model implicitly adjusts have an explicit mixing interpretation. Our simulations suggest equivalent or better results using our proposed method; the data analyses demonstrate that our method can produce results that are similar to SVA, as in the PBMC and placenta examples, or results that are distinct, as in the arthritis and gastric tumor examples. In the gastric tumor analysis, there is a suggestion that our referencefree approach may produce results slightly more consistent with known biology. The reference-free approach is able to do this without the somewhat time-consuming comparison of candidate surrogate variables with potential confounders; however, it achieves this result by essentially projecting unadjusted effect estimates on an error matrix, with a linearity assumption that, while biologically motivated, may sometimes fail. In addition, our method is designed to deconvolute cell mixtures; it is not designed to uncover confounders that are specific to technical sources of variation. Thus, SVA might be applied on an Mvalue scale to extract surrogate variables that are specific only for technical variation (i.e. by removing from the set of estimated surrogate variables those that are associated with biological confounders) and are subsequently included in our reference-free approach. However, it is somewhat unclear how to proceed when batches are confounded with phenotypes; more detailed research is needed to develop methods that adjust simultaneously for linear mixing effects and non-linear technical effects, using datasets having fully annotated technical data such as chip number and position. However, we view this article as a concrete step in that direction, and our data samples demonstrate the practicality of our method.
Our approach now offers the possibility of conducting EWAS analysis adjusting for mediation by cell type even without the existence of reference datasets that may be expensive or infeasible to collect. If widely adopted, it could pave the way for EWA studies that are more robust with higher potential for replication of results.