Detection of epigenetic field defects using a weighted epigenetic distance-based method

Abstract Identifying epigenetic field defects, notably early DNA methylation alterations, is important for early cancer detection. Research has suggested these early methylation alterations are infrequent across samples and identifiable as outlier samples. Here we developed a weighted epigenetic distance-based method characterizing (dis)similarity in methylation measures at multiple CpGs in a gene or a genetic region between pairwise samples, with weights to up-weight signal CpGs and down-weight noise CpGs. Using distance-based approaches, weak signals that might be filtered out in a CpG site-level analysis could be accumulated and therefore boost the overall study power. In constructing epigenetic distances, we considered both differential methylation (DM) and differential variability (DV) signals. We demonstrated the superior performance of the proposed weighted epigenetic distance-based method over non-weighted versions and site-level EWAS (epigenome-wide association studies) methods in simulation studies. Application to breast cancer methylation data from Gene Expression Omnibus (GEO) comparing normal-adjacent tissue to tumor of breast cancer patients and normal tissue of independent age-matched cancer-free women identified novel epigenetic field defects that were missed by EWAS methods, when majority were previously reported to be associated with breast cancer and were confirmed the progression to breast cancer. We further replicated some of the identified epigenetic field defects.


INTRODUCTION
Identifying molecular alterations that happen early in carcinogenesis, known as field defects, is important for early cancer detection. One common approach is to compare normal tissue of healthy individuals to normal tissue adjacent to tumor (normal-adjacent tissue) of cancer patients as a surrogate of pre-cancer tissue that are difficult to collect. There have been studies in identifying epigenetic field defects (1)(2)(3)(4), notably early DNA methylation alterations. DNA methylation is an epigenetic modification that has been shown to be crucial in gene expression (5)(6)(7)(8) and cancers (9)(10)(11)(12). There are mainly two types of aberrant DNA methylation in cancers, local hyper-methylation in promoter-related CpGs that leads to the silencing of downstream tumor suppressor genes (13)(14)(15)(16)(17), and global hypomethylation that leads to chromosome instability (17)(18)(19)(20). Studies have successfully identified epigenetic field defects in breast cancer by comparing normal-adjacent tissue of breast cancer patients to normal tissue from healthy individuals. Teschendorff et al. identified epigenetic field defects in breast cancer based on differential variability (DV), i.e. variance signals in DNA methylation (3), using methylation site-level analyses. Our previous work (21) identified epigenetic field defects in breast cancer based on both differential methylation (DM), i.e. mean signals, and DV, using methylation region-level analyses. In both studies, epigenetic field defects were found to be mainly driven by increased variation in methylation due to several outlier normal-adjacent tissue samples.
Due to the fact that CpG site-level signals for epigenetic field defects may be very small, existing methods based on differences (DM or DV or both) on CpG site-level may not have good power. Standard epigenome-wide association studies (EWAS) that focus on mean signals (EWAS-DM) perform CpG site-level tests to identify differentially methylated CpGs between two experimental groups using standard tests such as a t-test, a regression-based test or its regularized versions (22)(23)(24), or a non-parametric Wilcoxon rank sum test (25). EWAS that focus on variance signals (EWAS-DV) perform CpG site-level tests to identify differential variation CpGs between two experimental groups using standard tests such as the F-test (26,27), the Bartlett's test or its regularized version (3,4), or an empirical Bayes extension of the Levene's test (28). The F-test and Bartlett's test are sensitive to departures from normality which is usually the case for methylation data, while the Levene's test is more robust to non-normality. On the other hand, distance-based methods that characterize (dis)similarity between pairwise samples across a gene, a genetic region, a pathway or an entire genome have been proven to be powerful in genetic and gene expression studies (29)(30)(31)(32)(33). While standard EWAS perform CpG site-level tests with stringent multiple comparisons adjustment, in a gene or a genetic region level, the common practice using non-distance-based methods is to select the minimum P-value out of all CpGs in that region. These methods will not be powerful when site-level effects are very small. Alternatively, the distancebased methods accumulate any CpG site-level signals from a gene or a genetic region via the (dis)similarity matrix thus boost the overall association power, making them the ideal methods for detection of epigenetic field defects.
Here, we developed a weighted epigenetic distance-based method to identify epigenetic field defects at gene or genetic-region levels using both DM and DV signals. CpG site-level weights were incorporated in the calculation of (dis)similarity matrix to further boost signals and reduce noises. Specifically, we used original DNA methylation measures to examine DM and centered quadratic methylation measures to examine DV and considered site-level weights based on strengths of site-level DM and DV signals. Simulation studies showed much improved performance of the proposed weighted epigenetic distance-based method over several comparing methods including non-weighted versions and methods that use either DM or DV signals as well as standard EWAS methods. We further demonstrated the performance of the proposed method through an application to the 450K DNA methylation data of normaladjacent tissue of breast invasive carcinoma (BRCA) patients and normal tissue from independent age-matched cancer-free women from Gene Expression Omnibus (GEO). The proposed method that accumulates weighted DM and DV signals identified genes with epigenetic field defects that were missed by standard EWAS methods and non-weighted distance-based methods. Many of these epigenetic field defects were previously reported to be associated with breast cancer. Further examination confirmed their enrichment in the progression to breast cancer and replicated some of these identified epigenetic field defects.

MATERIALS AND METHODS
Case-control designs using normal tissue from healthy individuals (Y = 0) and normal tissue adjacent to tumor from cancer patients (Y = 1) as a surrogate of pre-cancer tissue are widely used to identify epigenetic field defects in cancers. We therefore focused on case-control designs and illustrated and applied the proposed weighted epigenetic distance-based method on gene level. However, the proposed method can be easily adapted to other types of design and on genetic region or genome levels. There are three steps in the proposed distance-based method: (i) to define genelevel weighted epigenetic distance matrix; (ii) to calculate pseudo-Fstatistic and (iii) to assess statistical significance using permutations.
Step 1: Define gene-level weighted epigenetic distance matrix Define epigenetic distance matrix. For each gene, let X m be an 2N × n matrix with original DNA methylation measures for N cases and N controls of n CpG sites in a gene, where element x m i j harbors DNA methylation measure of the j th CpG site, j = 1, ..., n in the gene, for the i -th subject, i = 1, ..., N. This X m matrix will be used to examine differential methylation (DM) capturing methylation mean signals. Let X v be an 2N × n pseudo data matrix of variability score capturing methylation variance signals, which will be used to examine differential variability (DV). The element 2 harbors centered quadratic methylation measure of the same j th CpG site for the i th subject.
x m i j is the mean methylation measure of the j -th CpG site across N cases and N controls separately. The quadratic terms are centered to better capture variance signals. By using X mv = [X m , X v ], an 2N × 2n matrix, we will be able to capture both methylation mean and methylation variance signals of the n CpG sites. Before constructing the epigenetic distance between any pair of subjects, we performed normalization on each column of X mv such that each column has mean zero and unit standard deviation.
We define the 2N × 2N epigenetic distance matrix D DM−DV with element d DM−DV st that captures dissimilarities between any given pair of individuals s and t, s, t = 1, ..., 2N as Incorporate CpG site-level weights into epigenetic distance matrix. We construct CpG site-level weights aiming to upweight signal CpGs (mean or variance) and to down-weight noise CpGs in calculating distances between pairs of subjects. Therefore, we define weights for mean and variance signals at CpG site j as follows: where p m j and p v j are the P-values from the two-sided twosample t-test testing if the mean methylation measures are the same between cases and controls and from the one-sided Levene's test testing if the variance of the methylation measures in cases is greater than that in controls at CpG site j , that captures weighted dissimilarities between individuals s and t, s, t = 1, ..., 2N can be defined as (3) Step 2: Calculate pseudo-F statistic We apply distance-based regression originally developed in the field of ecology (31,32) where is an 2Ndimensional column vector with elements 1, and I is an 2N × 2N identity matrix. The pseudo-F statistic is used to evaluate the association between epigenetic similarity of a gene with n CpG sites and the case/control status.
Step 3: Assess statistical significance using permutations To access significance of all G genes tested, we use permutation procedures, where we randomly shuffle cases (Y = 1) and controls (Y = 0) and repeat Steps 1-2 on the permuted data. In order to have more granular P-values, we pool pseudo-F statistics of all G genes from all permutations, as well as those from the observed data, to compute the empirical P-value (34). We repeat the permutation procedure 999 times, and calculate the empirical P-value for gene g, g = 1, ..., G, as follows: In the real data application, we have G= 19 271 genes, which helps to have high resolution gene-level empirical Pvalues.

Comparing methods
We compare the performance of the proposed method D w−DM−DV that considers site-level weights for mean and variance signals to that of several comparing methods, including the weighted distance-based methods that consider mean signals only D w−DM or variance signals only D w−DV , and distance-based methods without weights that consider both mean and variance signals D DM−DV , mean signals only D DM , variance signals only D DV , and standard EWAS methods on each CpG site with multiple comparisons adjustment of number of CpGs in a gene based on mean signals EWAS DM or variance signals EWAS DV .

Simulation study
We conducted simulation studies to evaluate type I error rate and power of the proposed method D w−DM−DV and those of the comparing methods described above. Type I error rate is defined as the proportion of simulations with any significant genes when the data is generated under the null hypothesis of no genes are associated with case-control status. Power is defined as the proportion of simulations with observed pseudo-F statistics smaller than that of the permuted values from all genes across all permutations.

Simulation setup
We simulated methylation measures X of cases (Y = 1) and controls (Y = 0) at every CpG site in a gene from beta distributions: where shape parameters a 0 and b 0 for samples in the control group were chosen based on estimates from the 50 normal tissue samples from cancer-free women in the GEO BRCA data (GSE69914), and shape parameters a 1 and b 1 for samples in the case group were chosen based on estimates from the 42 normal-adjacent tissues in the GEO BRCA data. More specifically, the average of the methylation means and standard deviations (SDs) of all CpG sites with gene information for the 50 normal tissue samples is 0.47 and 0.05, respectively. Therefore, we set a 0 = 46.36 and b 0 = 52.28 for noise CpGs such that the corresponding mean and SD of the beta distribution are 0.47 and 0.05, respectively. We generated methylation measures for 40 cases and 40 controls to mimic the size of the GEO BRCA study. We set a 1 = a 0 and b 1 = b 0 for all CpG sites in case and control groups to evaluate type I error rates. For power scenarios, we considered scenarios when signal CpGs have different mean or variance signals through varying shape parameters a 1 and b 1 . We conducted 1000 simulations in each simulation setting.
Simulation settings with one gene. We first considered one gene with different number of CpGs with different signalto-noise ratios of the CpGs. That is, the ratio between number of signal CpGs and number of noise CpGs in this gene ranges from 1:0, 1:24, 1:49, 3:47, to 5:45. We considered scenarios when signal CpGs have different mean or variance signals by varying shape parameters a 1 and b 1 such that the mean differences in methylation measures between cases and controls are 0.02, 0.04 0.06, 0.08 and 0.1, and the ratios of SDs for cases and controls are 1.25, 1.50, 1.75, 2, 2.25 and 2.50, respectively. The values of a 1 , b 1 in those scenarios and the corresponding effect sizes are summarized in the Supplementary Table S1. We consider a gene to be significant at the 0.05 significance level.
Simulation settings with 10 genes. We then considered 10 genes with one gene having signals when there are 25 CpGs in each of the 10 genes. In the signal gene, we set one CpG to have mean or/and variance signals with different effect sizes. Here we test for the global null and consider a simulation study to be significant if any gene is significant after Bonferroni adjustment for testing 10 genes. The empirical P-value for each gene is calculated using formula 5, where G = 10.
Simulation settings with outliers. Since epigenetic field defects are often characterized by increased variation in DNA methylation due to a few outlier normal-adjacent tissue samples (3,21), we considered simulation scenarios with outlier samples. Here, we only considered one gene with 50 CpGs for illustration purposes. We considered two signalto-noise ratios in this gene to be either 5:45 or 10:40. We set 10%, 15% or 20% of cases to be outlier samples with DNA methylation alterations at some signal CpGs, while the rest cases have the same methylation measures as controls at those signal CpGs when different outlier samples could have DNA methylation alterations at different signal CpGs, For each signal CpG, we generated methylation measures X for cases from a mixture distribution X = (1 − Z)X 1 + ZX 2 , and methylation measures for controls from X 1 ∼ Beta(a 0 , b 0 ). Specifically, at each signal CpG, we randomly assigned 40 cases to be either outlier samples (Z = 1) or non-outlier samples (Z = 0) by Z ∼ Bernoulli( p), where p is the probability of any case being an outlier sample. We then generated methylation measures of outlier samples from X 2 ∼ Beta(a 2 , b 2 ) and non-outlier samples from Simulation settings with one gene considering correlations among CpGs. Since neighboring CpGs are known to be correlated, we considered simulation scenarios that assume an AR(1) correlation among CpGs in a gene with a correlation coefficient 0.5. The detailed information for simulation setup for this scenario is summarized in the Supplementary File section 2 Simulation settings with one gene considering correlations among CpGs.

Simulation results
Type I error rate. Type I error rates are well controlled at the 0.05 significance level in settings with one gene and 10 genes after Bonferroni adjustment for multiple comparisons (Table 1), respectively.
Power for simulation settings with one gene. Power results for simulation settings with one gene are summarized in Figure 1. When there are only mean signals at signal CpGs, D w−DV , D DV and EWAS DV that consider variance signals only do not have any power as expected. When there is only one CpG in the gene, the non-weighted distance-based methods are the same as the weighted versions, as well as the EWAS method as expected. When there is one signal CpG and increasing number of noise CpGs in the gene, power of D DM decreases drastically while power of the weighted version D w−DM are well maintained. This suggests that incorporating weights to CpGs indeed helps to up-weight signal CpGs and down-weight noise CpGs in constructing the distance matrix, thus improves the performance. When the size of a gene, i.e., number of CpGs in a gene, is fixed, among which when the number of signal CpGs increases, power of D w−DM increases much slower than that of D DM while D w−DM always has greater power than that of D DM . This implies that adding weights is most effective when a small percent of CpGs in a gene are signals. Similar power patterns are observed between weighted and non-weighted versions of the distance-based methods that consider both mean and variance signals, D w−DM−DV and D DM−DV . We also notice that D w−DM−DV is slightly less powerful than D w−DM because the overall mean signals are diluted by the inclusion of pseudo-sites for variance when there are only mean signals in the data. Moreover, D w−DM slightly outperform EWAS DM when there are several signal CpGs. This is because the distance-based method has the advantage to accumulate weak signals and thus boost the overall power.
Similar power patterns are observed when signal CpGs are set to have variance signals only. D w−DM , D DM and EWAS DM that consider mean signals only do not have any power, and the weighted distance-based methods outperform the non-weighted versions in the presence of noise CpGs, and D w−DV performs better than D w−DM−DV , and D w−DV outperforms EWAS DV when there are several signal CpGs.
Power for simulation settings with 10 genes. Power results for simulation settings with 10 genes are summarized in Figure 2. When signal CpGs have either mean or variance signals, we observed similar patterns as in the simulation settings with one gene. When signal CpGs have non-negligible mean signals and variance signals ranging from weak to strong, D w−DM−DV performs the best when variance signals are also weak to moderate as expected. This confirms that the potential area of usage for distance-based methods to be most effective is when there are weak signals that could be accumulated to boost the study power. When there are very strong signals at some sites, any methods will perform well. One observation that we need to point out is, powers of D w−DM , D DM and EWAS DM that only consider mean signals actually decrease as variance signals increase when mean signals exist. This is due to the fact that we worked on the standardized data in X mv = [X m , X v ], and the effect sizes of mean signals (standardized mean difference) decrease as the effect sizes of variance signals (ratio of standard deviation for cases and controls) increase after standardization.
Power for simulation settings with outliers. Power results for simulation settings with outlier samples are summarized in Figure 3. We observe that power of all methods increases as the signal-to-noise ratio increases and as the proportion of outlier samples increases as expected, and distancebased methods outperform non-distance-based methods while EWAS DM and EWAS DV have very little power when there are only 10% outlier samples. Among distance-based methods, D w−DM and D DM that consider mean signals only have lower power compare to other methods as the mean signals introduced by a few outlier samples are usually too weak to be detected by methods that consider mean signals only. On the other hand, D DM−DV that considers both mean and variance signals outperforms methods that consider variance signals only, D DV .The two weighted distancebased methods D w−DM−DV and D w−DV are among the best performed methods consistently. This implies the superiority of D w−DM−DV in the presence of weak signals in both DM and DV.
Power for simulation settings with one gene considering correlations among CpGs. The type I error rates under this scenario are summarized in Supplementary Table S2. The power results are summarized in Supplementary Figure S1. We note that the power patterns are very similar to those observed in simulations ignoring correlations among CpG sites. This implies that the correlations among neighboring CpGs do not have much impact on the performance of the proposed distance-based methods.

Real data application
We

Discovery analysis using the GEO BRCA data
We applied the proposed method D w−DM−DV and the comparing methods to the GEO 450K DNA methylation data of normal-adjacent tissue of breast invasive carcinoma (BRCA) patients and normal tissue from independent agematched cancer-free women (GSE69914). In the original GEO BRCA data, there are DNA methylation measures on 485,512 CpGs for 42 tumor and normal-adjacent pairs from breast cancer patients, 50 normal tissue of independent age-matched cancer-free women and 263 additional tumor tissue of independent breast cancer patients. We conducted standard quality control steps where we removed CpGs on sex chromosomes and those contain either a SNP at the CpG interrogation or at the single nucleotide extension (SBE) based on UCSC dbSNP table version 147 using the R package 'IlluminaHumanMethyla-tion450kanno.ilmn12.hg19' (35). We also required at least 95% CpG coverage per sample and 70% sample coverage per CpG, and only kept CpGs with gene annotations. We ended up with 344 947 CpGs, covering 19 271 genes, from 42 normal-adjacent tissues, 50 normal tissues and 263 independent tumor tissues.
Since Bonferroni adjustment for multiple comparisons of the 19 271 genes is too conservative, especially with the small sample size in the GEO BRCA dataset, we used a less stringent threshold 0.0005 on empirical gene-level Pvalues obtained from the permutation procedure ( Figure  4). Our main purpose is to demonstrate the superior performance of the proposed method D w−DM−DV over several comparing methods, especially the EWAS methods. Results using D w−DM−DV and EWAS min −P comparing 42 normal-adjacent tissues to 50 normal tissues are shown in the Manhattan plots in Figure 4. At the 0.0005 threshold for gene-level P-values, D w−DM−DV identified 21 genes (Table 2), of which 18 were previously reported to be associated with breast cancer; EWAS min −P identified 14 genes (Table 3), of which 9 were previously reported to be associated with breast cancer. There are 7 overlapping genes, TMC4, NAA35, THY1, CXCL6, KDM5A, FKBP4, and TMEM200B that were identified by both methods. Except for the PLS1 gene, the 7 genes uniquely identified by EWAS min −P all rank very high in D w−DM−DV results out of the 19 271 genes (Table 3) (Supplementary Tables S3-S7), respectively.
We further examined the 14 and 7 genes uniquely identified by D w−DM−DV and EWAS min −P , respectively. We plotted heatmaps of the original DNA methylation measures of CpG sites on these genes for the 50 normal tissues, 42 normal-adjacent tissues together with the 42 matched tumor tissues ( Supplementary Figures S2 and S3). In general, the 14 genes uniquely identified by D w−DM−DV are genes with multiple CpGs of weak signals, i.e. weak dense signals. Moreover, some of these weak dense signals were mainly due to a few outlier normal-adjacent tissue samples, thus were missed by EWAS min −P . The seven genes uniquely identified by EWAS min −P are those with just one or two CpGs with very strong signals, i.e. strong sparse signals. We also plotted heatmaps of seven genes identified by both D w−DM−DV and EWAS min −P (Supplementary Figure S4).
We then investigated the two genes, CFTR and PLS1, that were uniquely identified by D w−DM−DV and EWAS min −P , respectively, but ranked the last using the other method among all uniquely identified genes. We similarly plotted the heatmap of the original DNA methylation measures of CpG sites in these two genes ( Figure 5A). For the CFTR gene that has 16 CpGs, it is clear that variation in methylation measures increases in the progression from normal tissues to normal-adjacent tissues and to tumor tissues in multiple CpGs when there are several samples among the 42 normal-adjacent tissue samples that are very different from the normal samples.
On the other hand, for the PLS1 gene that also has 16 CpGs, it was identified uniquely by EWAS min −P because of one signal CpG site cg00137209 ( Figure 5A), mainly due to the very small variation in the methylation measures of the normal tissues. We then plotted DNA methylation measures of the top 4 P-value ranked CpGs, ranked by CpG site-level P-values from both mean and variance tests each after adjusting for multiple comparisons for the number of CpGs in the CFTR gene ( Figure 5B), which      clearly shows elevated methylation levels in the progression to tumor. For the PLS1 gene, we similarly plotted the DNA methylation measures of the top 2 P-value ranked CpGs ( Figure 5B), where the #1 ranked CpG cg00137209 is the one that shows strong variance signal due to very small variation in the methylation measures of the normal tissues, when neither CpGs showed any enrichment in methylation measures in the progression to tumor. This suggests that genes uniquely identified by EWAS min −P due to extreme P-values at one or two CpGs may not be reliable, while genes identified uniquely by D w−DM−DV are generally characterized by multiple signal CpGs, thus are more reliable.
We also plotted the DNA methylation measures of all CpGs in these two genes CFTR and PLS1 (Supplementary Figures S5 and S6, respectively). It is again clear that almost half of the CpGs in the CFTR gene have weak mean signals and weak variance signals, thus missed by EWAS min −P due to stringent multiple comparisons adjustment. In addition, we plotted the weighted distance matrices of the 50 normal tissues and the 42 normal-adjacent tissues for the CFTR gene and the PLS1 gene (Supplementary Figure S7). For the CFTR gene, we observe little variation in distances among normal samples and increased variation in distances between several pairs of normal and normal-adjacent samples, while for the PLS1 gene, we observe no clear pattern.
We also plotted the DNA methylation measures of CpGs in the TMC4 gene (Supplementary Figure S8) that was identified by both D w−DM−DV and EWAS min −P and ranked #1 and #2 in the two methods, respectively. There are 13 CpGs in the TMC4 gene, 3 CpGs have strong variance signals when two of the three CpGs also have mean signals. Thus, the TMC4 gene was identified by both D w−DM−DV and EWAS min −P .
In our previous work on differentially methylated regions (DMRs) using the same GEO BRCA data, we identified 2 DMRs of epigenetic field defects using both mean and variance signals (21). The two DMRs cover two genes, NKX6−2 and CCND2, which rank #113 and #359 in the D w−DM−DV results. Further investigation revealed that the two DMRs only cover part of the two genes. We therefore broke down the two genes into smaller parts so that there is one part that covers exactly the identified DMR. We then treated these smaller parts as individual regions and repeated D w−DM−DV across the whole genome. The rank of the NKX6−2 part that matches with the DMR moved up to #90 from #113 while the other two parts rank #107 and #4855, respectively. The rank of the CCND2 part that matches with the other DMR moved up to #154 from #359 and the other part ranks #1116. Overall, the 2 DMR-covered genes previously identified as epigenetic field defects also rank on top in the results of D w−DM−DV . This suggests that we may combine DMR detection techniques with distance-based methods to first better define 'regions of interest' using DMR ideas and then assess significance more powerfully with distancebased methods.
We also investigated the relation between the number of CpGs in a gene and the probability that the gene is selected, where we binned genes based on their sizes and calculated the selection probability of a gene in a bin as the proportion of genes identified out of all genes in the bin. We plotted the selection probabilities against gene sizes (Supplementary Figure S9) and found that the selection probabilities for different versions of the distance-based methods and EWAS min −P method are not systematically affected by gene sizes.

Validation of the identified epigenetic field defects in the GEO BRCA data
We further validated the 21 genes of epigenetic field defects identified by D w−DM−DV through comparing methylation measures of the 21-gene-covered CpGs between 263 independent tumor tissues and 42 normal-adjacent tissues to examine if the methylation levels at these CpGs exhibit progression to tumor. Specifically, we performed the two-sample t-test at each of these CpGs and plotted the −log 10 ( p−value) from the two comparisons, 50 normal tissues versus 42 normal-adjacent tissues and 42 normaladjacent tissues vs. 263 tumor tissues ( Supplementary Figure S10). In general, the majority of these CpGs show more significant signals in the progression from normal tissues to normal-adjacent tissues to tumor.

Replication analysis using an independent data of normal tissues
As epigenetic field defects identified in one set of normal vs. normal-adjacent comparison maybe driven by a few 'outlier' normal-adjacent samples (3,4,21), different epigenetic field defects could be identified in a different set of normal versus normal-adjacent comparison that are driven by different 'outlier' normal-adjacent samples. Therefore, we propose to conduct a replication analysis that uses the same normal-adjacent tissue samples but compare to an independent data of normal samples. We used 450K DNA methylation data of 18 normal tissue of 18 breast reduction mammoplasty subjects (GSE67919) (36). The original data have methylation measures on 485 577 CpG sites. We followed the same quality control steps as for the discovery GEO At the same 0.0005 threshold for gene-level P-values, 7 out of the 21 previously identified genes with epigenetic field defects in the discovery analysis using the GEO BRCA data were replicated by D w−DM−DV . The seven genes are DPH3B, NAA35, ANKRD13B, CXCL6, FKBP4, PRSS48 and CFTR. We similarly validated these 7 genes by comparing P-values from the two-sample t-tests comparing the 18 replication normal samples to the 42 GEO BRCA normaladjacent samples and P-values from the two-sample t-tests comparing the 42 GEO BRCA normal-adjacent samples to the 263 independent GEO BRCA tumor samples (Supplementary Figure S11). All 7 genes, except the NAA35 and To investigate our hypothesis that different epigenetic field defects maybe identified when comparing normal samples to a different set of normal-adjacent samples, we obtained a new set of BRCA normal-adjacent samples (n = 90) from the Cancer Genome Atlas (TCGA) project together with their matched tumor samples (n = 90). We plotted DNA methylation measures of CpGs in the 7 replicated genes (Supplementary File 2) of the 18 replication normal samples, the 50 discovery GEO BRCA normal samples, the 42 discovery GEO BRCA normal-adjacent samples, the 42 discovery GEO BRCA matched tumor samples, and the 90 TCGA normal-adjacent samples and the 90 TCGA matched tumor samples. It is clear that methylation patterns of the TCGA normal-adjacent tissues are very different from that of the discovery GEO BRCA normaladjacent tissues in most of these CpGs. This supports our hypothesis that methylation patterns can be very different in different pre-cancer tissues (using normal-adjacent tissue as a surrogate) thus different epigenetic field defects maybe identified when normal samples are compared to different sets of pre-cancer tissues.

DISCUSSION
In this study, we developed a weighted epigenetic distancebased method D w−DM−DV that accumulates both DM (mean) and DV (variance) signals across CpGs in a gene or a genetic region. One known advantage of distancebased methods is, there is no need to preselect outcomeassociated features, avoiding the potential to mis-screen features with weak signals. In our proposed weighted epigenetic distance-based method D w−DM−DV , we used CpG sitelevel association strengths as weights for individual CpGs aiming to up-weight signal CpGs and down-weight noise CpGs. If the feature preselection step could be conducted perfectly, it is equivalent to the case when weight '0' is correctly assigned to noise CpGs and weight '1' is correctly assigned to signal CpGs. Results from simulation studies suggest that when the signal-to-noise ratio in a gene decreases, power of non-weighted epigenetic distance-based methods decreased drastically, while power of the weighted version was well maintained. This suggests that incorporating CpG-site-level association strengths as weights for individual CpGs indeed help to up-weight signal CpGs and down-weight noise CpGs, thus improve the overall study performance. Simulation results also suggest that the weighted epigenetic distance-based methods will be most effective when applied to genes or genetic regions with a small percentage of CpGs having weak signals. This makes the detection of epigenetic field defects, i.e., early epigenetic alterations that are usually infrequent across samples and identifiable as outlier samples, the ideal application of the proposed method D w−DM−DV . Using the GEO BRCA 450K DNA methylation data, D w−DM−DV identified 21 genes with epigenetic field defects, when 7 out of the 21 genes overlap with the genes identified by EWAS min −P . Majority of the genes uniquely identified by D w−DM−DV were previously reported to be associated with breast cancer. Most of the genes uniquely identified by EWAS min −P also ranked on top in the D w−DM−DV results except for the PLS1 gene. However, further investigations suggested that the PLS1 gene may not be a real epigenetic field defect. On the other hand, most of the genes uniquely identified by D w−DM−DV also ranked on top in the EWAS min −P results except for the CFTR gene, in which the enrichment in the progression to breast cancer was confirmed in further analyses. This suggests that genes identified by D w−DM−DV , which are generally characterized by multiple signal CpGs, are more reliable. It is worth noticing that the 2 DMR-covered genes identified in our previous work (21) also ranked on top in the D w−DM−DV results. We validated the identified epigenetic field defects by showing a progression to tumor in an independent dataset of tumor tissues. We also conducted a replication analysis by comparing the same set of normaladjacent tissues to an independent set of normal tissues, and found that 7 out of the 21 genes of epigenetic field defects identified by D w−DM−DV in the discovery analysis were replicated.
In general, distance-based methods have a better performance than that of site-level EWAS methods when site-level signals are weak. As discussed in our previous work (21) and work of others (3,4), epigenetic field defects are often characterized by increased variation in DNA methylation measures due to a few outlier normal-adjacent tissue samples. So the site-level EWAS methods are usually underpowered due to small mean differences as well as stringent multiple comparisons adjustment. Distance-based methods accumulate weak signals to improve power. Distance-based methods are flexible and can be applied to a CpG site, a gene, a pathway, or an entire genome. A closer investigation on what we identified in our previous work (21) in DMR detection and the current work suggests that we may take advantages of the techniques in DMR detection and combine that with distance-based methods in future works to more efficiently identify regions of epigenetic field defects.
In summary, we proposed a new weighted distancebased method D w−DM−DV that considers both DM and DV in DNA methylation and incorporates site-level association strengths as weights on individual CpGs to up-