Winter-dormant shoot apical meristem in poplar trees shows environmental epigenetic memory

The winter-dormant shoot apical meristem of the tree species Populus keeps an epigenetic memory of environmental variations that arose during the preceding vegetative period.

Genes localized in 'hypo-methylated' DMRs in Experiment 1 and DEGs; DMRs common to Experiments 1 and 2 and DEGs; and DMRs common to the 3 Experiments and DEGs. DEGs have been recently reported (Lafon-Placette et al., 2018) and are available on GEO GSE46605).  Rectangle size is adjusted to reflect the abs log10 p-value of the GO term in the underlying GOA database. 'n' corresponds to homologous Arabidopsis gene annotations.     Hypo-methylated DMRs (C) Hyper-methylated DMRs between favorable and unfavorable growth conditions. Scores for DNA variation are indicated as 1 for hyper-methylated, 0 for non-significant and -1 for hypo-methylated. 'Common' DMRs means that the same locus was detected as a DMR between favorable and unfavorable growth conditions in Experiment 2 and Experiment 3. Each rectangle is a single cluster representative of the corresponding TAIR10 poplar V 3.0 annotations. The representatives are joined into 'superclusters' of loosely related terms, visualized with different colors. Rectangle size is adjusted to reflect the abs log10 p-value of the GO term in the underlying GOA database. 'n' corresponds to homologous Arabidopsis gene annotations.  Scores for DNA variation are indicated as 1 for hyper-methylated, 0 for non-significant and -1 for hypo-methylated. 'Common' DMRs means that the same locus was detected as DMR between favorable and unfavorable growth conditions in Experiment 1 and Experiment 2.
'Conserved' versus 'inversed' mean that the direction of variation (hypo-or hyper-methylation) of the 'common' DMRs between favorable and unfavorable conditions is identical ('conserved') or different ('inversed') between Experiments 1 and 2. Each rectangle is a single cluster representative of TAIR10 corresponding to poplar V 3.0 annotation. The representatives are joined into 'superclusters' of loosely related terms, visualized with different colors. Rectangle size is adjusted to reflect the abs log10 p-value of the GO term in the underlying GOA database.
'n' corresponds to homologous Arabidopsis gene annotations. DRMs were shown to preferentially affect genes responding to abiotic stress. Potentially, some of the DMRs playing a role in immediate phenotypic plasticity could be maintained during the next vegetative period and may contribute to acclimation and stress memory (priming); they may also be transmitted to the next generation resulting in trans-generational stress memory and adaptive potential.
Methods S1 FDR estimation for Experiments 1 and 2. This document describes the statistical study conducted to estimate the False Discovery Rate (FDR) of the multiple tests performed for

Mixture models and False Discovery Rate (FDR) estimation
Abstract This document describes in details the statistical study conducted for estimating the False Discovery Rate (FDR) of the multiple tests performed for both Experiments 1 and 2: (i) estimation of reference means based on mixture models on a subsample; (ii) multiple tests with FDR control per windows and scaffolds; (iii) mixture-based clustering of window means, for comparisons with results from the FDR procedure; (iv) computation of the Differentially Methylated Regions (DMR). Some additional summary plots are displayed.
1 Experiment 1: ORL W W and ORL W D

Determination of reference means from Gaussian mixtures
The purpose of this preliminary step is to obtain "reference means" for each of the conditions (called treatments), and for all or each scaffold. The two responses for this experiment are ORL W W and ORL W D (see the main document). Denote µ t 0 this reference mean for treatment t, against which the mean response for each window w, that we denote µ w , will be compared using a statistical test for a null hypothesis such as H 0 : µ w = µ t 0 in the parametric case, or distribution equality in nonparametric cases (see below). To insure independence between the mean responses per window and the reference mean (since both need to be estimated from the data), we first randomly select 20% of the data as a "reference sample", n = 61, 439 responses. The remaining dataset (n = 245, 756) is used for building the multiple test procedure.
We consider the responses for each treatment separately, i.e. univariate models only. The plots of empirical distributions (histograms in Fig.1) show that 2-component mixture models are suitable (see McLachlan and Peel, 2000, for references on mixture models). We further assume here that the responses for each treatment form an iid sample from a m = 2 component univariate Gaussian mixture. Informally, the distribution associated to each treatment t is given by , where the statistical model parameters are component 1 weight λ t 1 , component means (µ t 1 , µ t 2 ) and variances (v t 1 , v t 2 ), hence 5 scalar parameters per treatment. Other mixture models, e.g., nonparametric as in Bordes et al. (2007) or Chauveau et al. (2015) could be tried as well.
For each treatment, the parameters of the model are then estimated using a standard EM algorithm. We use here the normalmixEM function of the Benaglia et al. (2009) mixtools package for the R statistical software (R Core Team, 2016). The initialization of the EM algorithm is data-driven, based on an initial k-means clustering of the data to which we provide initial centroïd corresponding approximately to the modes visible in the histograms. Alternative initialization methods have also been checked to insure that the estimates correspond to the maximum of the loglikelihood of the model for these data. These estimates are computed on the data from all the 19 scaffolds. We have also fit mixture models for each scaffold separately. Fig.2 shows that, for the 2 treatments, the reference means of the leftmost component per scaffold are comparable to the global reference.

Multiple tests on consecutive windows
We assume that the µ t 0 's estimated in Section 1.1 from the 20% reference sample provide reference mean parameters that are now considered non-random. The next step consists in building windows of consecutive observations (spots) for some given "nominal" size, and to test equality of means or distributions between each window and the reference. For a nominal size set to 50 kb (see the main document), we obtain the configuration of the windows per scaffold summarized in Fig. 3. More precisely, Fig. 4 illustrates typical responses for one treatment, ORL W D , Scaffold 1, and the first ten windows w = 1, . . . , 10. This shows that window responses may vary in dispersion (as e.g., between scaffolds 3 and 5) or/and in localization (as e.g., between scaffolds 4 and 10), and that the null hypothesis of mean equality may be accepted even though the spots in a window are not close to the reference mean, because of averaging.

Statistical tests per window
In view of some boxplots in Fig. 4, distribution within windows often show some skewness (e.g., window 2), and heavy tails. Hence it seems not reasonnable to assume Gaussian distributions of responses within each window, so that Student t-tests cannot be used. When the number of observations per window is large enough (n w ≥ 30 would be typically acceptable) tests using the asymptotic normality (also known as z-tests) may be used. But in most windows, depending on the nominal size, there are fewer obervations, so that nonparametric framework must be used, at least in these cases (for the nominal size 50 kb, about 56% of the windows have fewer than 30 observations). A standard test for this situation is the Wilcoxon signed rank test of a null hypothesis that the distribution of the sample from each window is symmetric about µ t 0 . This test can be viewed as a nonparametric version of the t-or z-tests for localization. However, note that this test may reject (i.e. declare significant) windows appropriately centered toward the reference mean (i.e. for whichμ w ≈ µ t 0 ), in case of lack of symmetry (skewness). In the sequel, we only apply Wilcoxon signed rank tests on all windows for simplicity (parametric and nonparametric tests per windows could also be mixed, depending on the number of observations per window).

FDR control
The purpose of the study is to declare as significant the windows (i.e. the spots within the windows) for which the null hypotheses have been rejected, according to the p-values of the tests. As stated before, rejection means significant shift (or non symmetry) from the reference distribution located on µ t 0 for each treatment t. For each scaffold, this experiment results in multiple hypothesis testing problem with hundreds of test responses to consider simultaneously (see Fig. 3, top, for the number of windows per scaffold; for instance the largest configuration with windows of size 50 is 965 windows for scaffold 1). Note also that decreasing the nominal window size (e.g., from 50 kb to 20 kb) results in a larger number of windows per scaffold, but with less observations per window, resulting in less powerful tests.
In this multiple inference setup, the unguarded use of single-inference procedures may results in an increased false positive rate among the simultaneous tests. The first criterion was the FamilyWise Error Rate (FWER), that is the probability of observing at least one false rejection among the n tests. The historical approach since the early 1950s, called the Bonferroni approximation procedure (see e.g. Benjamini and Hochberg (1995) and references therein) consists in applying each test at a level α/n, resulting in a FWER lesser than α. However, Bonferroni-type procedures appear to be far too conservative when n gets large because α/n gets too small, leading to too few rejections. Since then, the point of view on the problem has changed, focusing in the number (or ratio) of erroneous rejections instead of controlling the FWER. In this vein, the most common concept of error control in such multiple testing inference is the expected proportion of falsely rejected hypotheses among the simultaneous tests, or False Discovery Rate (FDR, Benjamini and Hochberg, 1995). Several statistical algorithms have been proposed in the literature for estimating the FDR, the recent and unified procedure based on a nonparametric approach from Strimmer (2008) appearing to be one of the current standard for practitioners. Alternative approaches to FDR estimation consists in viewing the problem as the statistical estimation of the parameters of a finite mixture model. Parametric mixtures for FDR estimation have been considered in, e.g., McLachlan et al. (2006), and semiparametric alternatives have also been proposed (see, e.g., Robin et al., 2007;Chauveau et al., 2014).
For simplicity and easy reproducibility, we only apply here FDR control from the most conservative procedure given by the fdrtool package Strimmer (2008), the local fdr labeled loc fdr in the figures. Typical results for the first 4 scaffolds are displayed in Fig. 6. The legend in this Figure also provides the number of cases corresponding to the smallest p-values, that can be rejected at the FDR level α = 5%. This number correspond to the first "time" the curve of the adjusted p-values crosses the horizontal line corresponding to α. Running this procedure on all scaffolds gives the table below, of number of rejected windows per scaffold and treatment, to be compared with number of windows i.e. of multiple tests.

Mixture-based clustering of means per window
It is also possible to perform a MAP clustering based on the posterior probabilities obtained by fitting a Gaussian mixture model on the sample of window empirical means (μ w , w = 1, 2, . . .) up to the number of windows for a scaffold, or the total number of windows, for each treatment. This information can be added to the results of the multiple tests procedure and FDR control. Windows leading to significant rejection of H 0 with an empirical window meanμ w > µ t 0 should belong to the rightmost component. Fig. 5 gives these EM fits done from the window means for the 19 scaffolds (7556 windows).

Output results and DMR statistics for FDR level 5%
The final result is a data frame with 7556 rows (one row per window and per scaffold). The columns give Start, number of spots, mean responses and p-values per window, columns for MAP indicators plus a summarizing decision rule in {−1, 0, 1} based on the FDR and the sign of the mean shift for each treatment. The decision indicator is: −1 if the FDR lead to significant case (reject) and the sign is negative; +1 if the FDR lead to significant case (reject) and the sign is positive; 0 if the FDR concludes to non significant case (nosig).
Then the DMR (Differentially Methylated Regions) between the two treatments are computed: these equal TRUE only if the two treatments decision rules are different. All these statistics are provided for (and depends upon) the desired FDR level set to 5% here. A summary detailed for each or all scafflod can be also proposed through "Manhattan-like" plots. Fig. 7 displays these Manhattan plots for all the scaffolds.

Experiment 2: ECH-SCV-GMN
We follow in this Section the method already presented in Section 1. Thus we just present below the results and specific details.

Determination of reference means from Gaussian mixtures
We first randomly select 20% of the data as a "reference sample", n = 47, 651 responses after removal of the 1961 LOCI in Scaffold 19. The remaining dataset (n = 198, 448) is used later for building the multiple test procedure. In view of Fig.8, it is reasonable to assume that the responses for each treatment form an iid sample from a m = 2 components univariate Gaussian mixture. For each treatment, the parameters of the model are estimated using the gaussian EM algorithm in the normalmixEM function of the Benaglia et al. (2009) mixtools package. Fig.8 displays the mixture model fits on top of the empirical distributions. The estimated mean of the leftmost, smallest component mean labeled "1" here, is assumed to be the reference for each treatment "t". Results for the reference meansμ t 0 are indicated on Fig.8  These estimates are computed on the data from all the 19 scaffolds. We have also fit mixture models for each scaffold separately. Fig.9 shows that, for the 3 conditions, the reference means of the leftmost component per scaffold are comparable to the global result.

Multiple tests on consecutive windows
We assume that the µ t 0 's estimated in Section 1.1 from the 20% reference sample provide reference mean parameters that are now considered non-random. We then build windows of consecutive observations (spots) for the same "nominal" size as before, and test equality of means or distributions between each window and the reference. The windows configuration per scaffold is summarized in Fig. 10. Fig. 11 illustrates typical responses for treatment, SCV, Scaffold 1, and the first windows w = 1, . . . , 10. This shows that window responses may vary in dispersion or/and in localization, and that the null hypothesis of mean equality may be accepted even though the spots in a window are not close to the reference mean, because of averaging.

Statistical tests per window
As in Experiment 1, it seems not reasonnable to assume Gaussian distributions of responses within each window, so that Student t-tests cannot be used. In addition, about 71% of the windows have fewer than 30 observations here so that asymptotic normality cannot be used either. We use as previously the (non-parametric) Wilcoxon signed rank test of a null hypothesis that the distribution of the sample from each window is symmetric about µ t 0 .

FDR control
We apply as for Experiment 1 the FDR control using the most conservative local fdr procedure given by the fdrtool package Strimmer (2008). Running this procedure on all scaffolds gives the table below, of number of rejected (R) windows per scaffold and treatment, to be compared with number of windows i.e. of multiple tests. We also perform the MAP clustering based on the posterior probabilities obtained by fitting a Gaussian mixture model on the sample of window empirical means (μ w , w = 1, 2, . . .) up to the total number of windows, for each treatment. Fig. 12 gives these EM fits for the 19 scaffolds (7550 windows).

Output results and DMR statistics for FDR level 5%
The final result for this experiment is a data frame with 7550 rows (one row per window for each scaffold). The columns give Start, number of spots, mean responses and p-values per window, columns for MAP indicators plus a summarizing decision rule in {−1, 0, 1} based on the FDR and the sign of the mean shift for each treatment, already defined in Section 1.3. Then the DMR (Differentially Methylated Regions) between treatments two-by-two (resp. for the 3 treatments) are computed: these equal TRUE only if the two (resp. 3) treatments decision rules are different. All these statistics are provided for (and depend upon) the desired FDR level set to 5% here. Note that for these data only a single case returns a TRUE DMR between the 3 treatments: Scaffold 12, window number 50, Start = 1053390, decision rules SVC:0, ECH:1, and GMN:-1. Fig. 13 gives the "Manhattan-like" plot for scaffold 12, the one for which DMR for the 3 treatments has the single true value.