QTL Mapping on a Background of Variance Heterogeneity

Standard QTL mapping procedures seek to identify genetic loci affecting the phenotypic mean while assuming that all individuals have the same residual variance. But when the residual variance differs systematically between groups, perhaps due to a genetic or environmental factor, such standard procedures can falter: in testing for QTL associations, they attribute too much weight to observations that are noisy and too little to those that are precise, resulting in reduced power and and increased susceptibility to false positives. The negative effects of such “background variance heterogeneity” (BVH) on standard QTL mapping have received little attention until now, although the subject is closely related to work on the detection of variance-controlling genes. Here we use simulation to examine how BVH affects power and false positive rate for detecting QTL affecting the mean (mQTL), the variance (vQTL), or both (mvQTL). We compare linear regression for mQTL and Levene’s test for vQTL, with tests more recently developed, including tests based on the double generalized linear model (DGLM), which can model BVH explicitly. We show that, when used in conjunction with a suitable permutation procedure, the DGLM-based tests accurately control false positive rate and are more powerful than the other tests. We also find that some adverse effects of BVH can be mitigated by applying a rank inverse normal transform. We apply our novel approach, which we term “mean-variance QTL mapping”, to publicly available data on a mouse backcross and, after accommodating BVH driven by sire, detect a new mQTL for bodyweight.


KEYWORDS
Cao's tests reweighting vQTL variable transformation background variance heterogeneity heteroskedastic heteroscedastic A standard modeling assumption in quantitative trait locus (QTL) mapping is that all individuals, regardless of differences in their phenotypic mean, have the same residual variance. In reality, the residual variance-sometimes termed the environmental variance, and in general relating to the apparent noisiness of the phenotype-can differ between individuals. These differences in residual variance can arise from many sources, both extrinsic, such as environmental factors, and intrinsic, such as sex, or, more broadly, genetics. Environmental sources of residual variance heterogeneity have been well-documented, and include, for example, soil nitrogen and irrigation (Makumburage and Stapleton 2011), temperature (Shen et al. 2014), and even the age at which young birds begin to experience the environmental insults outside of the nest (Snell-Rood et al. 2015). Genetic sources of residual variance heterogeneity have attracted increasing interest, with multiple studies finding instances of the residual variance being heritable (Sorensen and Waagepetersen 2003;Hill and Mulder 2010;Sørensen et al. 2015;Gonzalez et al. 2016;Lin et al. 2016;Mitchell et al. 2016), and in some cases substantially attributable to allelic variation in individual genes (Paré et al. 2010;Wolc et al. 2012;Yang et al. 2012;Hulse and Cai 2013;Wang et al. 2014;Ayroles et al. 2015;Forsberg et al. 2015;Yadav et al. 2016;Ivarsdottir et al. 2017).
The presence of residual variance heterogeneity, however, regardless of its source, can be problematic for analysis protocols that disregard it. Differences in residual variance between groups of individuals affect the precision of estimated means and, in turn, tests of significance or association (Cochran 1937;Yates and Cochran 1938). In the context of QTL mapping, ignoring such differences discards information that could be exploited to increase the power to detect QTL; and in the case of mapping vQTL, it can covertly increase the false positive rate to well above the nominal level.
Specifically, the background presence of a major variance-controlling factor (e.g., sex, housing, strain, a vQTL, etc.) implies that inferences about any other effect (e.g., that of a QTL elsewhere in the genome) occur against a backdrop of systematically heterogeneous residual variance. This "background variance heterogeneity" (BVH) acts to disrupt the natural observation weights: rather than every individual being subject to equal noise variance and therefore meriting equal weight, with BVH present some individuals' phenotypes are inherently more (or less) noisy and so due less (or more) weight. And just as reweighting accordingly should lead to a more powerful analysis, then assuming all weights are equal (i.e., variance homogeneity) risks overleveraging outliers and increasing the potential for both false negatives and false positives. This is likely to be true not only for studies detecting mQTL but also those detecting vQTL, which rely on the accurate attribution of residual noise.
Nonetheless, consideration of variance effects-whether as the target of inference or as a feature of the data to be accommodated-has thus far remained outside of routine genetic analysis. This could be in part because vQTL are sometimes considered of esoteric secondary interest, intrinsically controversial in their interpretation (Sun et al. 2013;Shen and Ronnegard 2013), or a priori too hard to detect (Visscher and Posthuma 2010). But it is also likely to be in part because standard protocols for finding and reporting vQTL are currently lacking, and because the advantages of modeling heterogeneous variance, even when targeting mQTL, remain under-appreciated and largely undemonstrated.
A number of statistical models and methods have been developed or adapted specifically to detect vQTL. These include: Levene's test (Struchalin et al. 2010) and its generalizations (Soave et al. 2015;Soave and Sun 2017); the Fligner-Killeen test (Fraser and Schadt 2010); Bartlett's test (Freund et al. 2013); and methods based on, or related to, the double generalized linear model (DGLM) and similar (Rönnegård and Valdar 2011;Cao et al. 2014;Dumitrascu et al. 2018). Tests have also been developed to detect genotype associations with arbitrary functions of the phenotype, for example higher moments, and these include a variant of the Komolgorov-Smirnov test (Aschard et al. 2013) and a semi-parametric exponential tilt model (Hong et al. 2016).
Of the above methods, the ability to accommodate BVH of known source is limited to the DGLM of Rönnegård and Valdar (2011) (as well as a very recent Bayesian counterpart, described in Dumitrascu et al. 2018), which can include variance effects of arbitrary covariates as well as those belonging to the target (or foreground) QTL.
When the source of BVH is unknown, strategies to protect against it are less obvious. Since the threat manifests through sensitivity to distributional assumptions, possible remedies include side-stepping such assumptions via non-parametric approaches, e.g., permutation testing, or reshaping the distribution prior to analysis through variable transformation. Both have been considered in the vQTL context, with permutation used in Hulse and Cai (2013) and Yang et al. (2012) and transformation in Rönnegård and Valdar (2011), Yang et al. (2012), Sun et al. (2013), and Shen and Carlborg (2013), but not specifically for controlling mQTL or vQTL false positives in the presence of BVH.
Here we examine the effect of modeled and unmodeled BVH on power and false positive rate when mapping QTL affecting the mean, the variance, or both. In doing so we: 1. Describe how the DGLM can be used develop a robust, straightforward procedure for routine mQTL and vQTL analysis, which we term "mean-variance QTL mapping"; 2. Compare alternative proposed methods for mQTL and vQTL analysis; 3. Show how accommodating BVH with the DGLM can improve power for detecting mQTL, vQTL, and mvQTL compared with other methods; 4. Show how sensitivity to model assumptions can be rescued by variable transformation and/or permutation; and 5. Demonstrate the discovery of a new QTL for mouse bodyweight from an existing F2 intercross data resource (Leamy et al. 2000).
In two companion papers, we describe R package vqtl, which implements our procedure , and in  apply it to two published QTL mapping experiments detecting a novel mQTL in one and a novel vQTL in the other. In particular,  demonstrates a principle investigated here: that when an mQTL also has variance effects, those variance effects induce a type of proximal BVH, and modeling them explicitly therefore improves mQTL detection.

STATISTICAL METHODS
This section reviews the tests and evaluation procedures that we studied through simulation. First, we describe eight statistical tests that can be used to model the effect of a single locus on phenotype mean and/or variance: the standard linear model, Levene's test, Cao's three tests, and three DGLMbased tests. We also describe four procedures for evaluating the statistical significance (i.e., calculating p-values) of these tests-a standard asymptotic evaluation and three procedures that reasonably could be expected to provide protection against violations of model assumptions.

Definitions
We start by defining three partially overlapping classes of QTL: mQTL: a locus containing a genetic factor that causes heterogeneity of phenotype mean, vQTL: a locus containing a genetic factor that causes heterogeneity of phenotype variance, and mvQTL: a locus containing a genetic factor that causes heterogeneity of either phenotype mean, variance, or botha generalization that includes the other two classes. [Note: this usage is distinct from that of Yadav et al. (2016)] In addition, since we restrict our attention to QTL mapping methods that test genetic association with a phenotype one locus at a time, we distinguish two sources of variance effects: Foreground Variance Heterogeneity (FVH): effects on the phenotype variance that arise from the locus under consideration (the focal locus); Background Variance Heterogeneity (BVH): effects on the phenotype variance that arise from outside of the focal locus, e.g., from another locus or an experimental covariate.
Procedures to evaluate the significance of a single test In comparing different statistical tests and their sensitivity to BVH, namely the effect of BVH on power and false positive rate (FPR), it is important to acknowledge that various measures could be taken to make significance testing procedures more robust to model misspecification in general and to BVH specifically. The significance testing methods considered here are frequentist, involving the calculation of a test statistic T on the observed data followed by an estimation of statistical significance based on a conception of T's distribution under the null. BVH, however, will often constitute a departure of distributional assumptions, and in any rigorous applied statistical analysis when departures are expected it would be typical to consider protective measures such as, for example, transforming the response to make asymptotic assumptions more reasonable, or the use of computationally intensive procedures to evaluate significance empirically, such as those based on bootstrapping or permutation. Nominal significance (i.e., the p-value for a single hypothesis test) is evaluated using four distinct procedures. The first two rely on asymptotics: 1. Standard: The test statistic T is computed on the observed data and compared with its asymptotic distribution under the null. 2. Rank-based inverse normal transform (RINT): As for standard, except observed phenotypes fy i g n i¼1 are first transformed to strict normality using the function where F is the normal c.d.f. and rankðy i Þ is gives the rank (from 1; . . . ; n) (Beasley et al. 2009).
The second two determine significance empirically based on randomization: the test statistic T is recomputed as T ðrÞ under randomizations of the data r ¼ 1; . . . ; R, and the resulting set of statistics fT ðrÞ g R r¼1 is used as the empirical distribution of T under the randomized null. Two alternative randomizations are considered: 3. Residperm: we generate a pseudo-null response fy ðrÞ i g n i¼1 based on permuting the residuals of the fitted null model, (Freedman and Lane 1983;Good 2013), a process recently applied in the field of QTL mapping by Cao et al. (2014). 4. Locusperm: we leave the response intact, instead permuting the rows of the design matrix (or matrices) that differentiate(s) the null from alternative model.

Procedure to evaluate genomewide significance
In the context of a genome scan, where many hypotheses are tested, we aim to control FPR genomewide through a family-wise error rate (FWER), the probability of making at least one false positive finding across the whole genome. This is done following the general approach of Churchill and Doerge (1994), which is closely related to the locusperm procedure described above, and which we refer to as genomeperm. Briefly, we perform an initial genome scan, recording test statistics fT l g L l¼1 for all L loci. Then for each randomization r ¼ 1; . . . ; R; and for only the parts of the model that distinguish the null from the alternative model, the genomes are permuted among the individuals; the scan is then repeated to yield simulated null test statistics fT ðrÞ l g L l¼1 of which the maximum, T ðrÞ max ; is recorded. The collection of fT ðrÞ max g R r¼1 from all R such permutations is then used to fit a generalized extreme value distribution (GEV) (Dudbridge and Koeleman 2004;Valdar et al. 2006), and the quantiles of this are used to estimate FWER-adjusted p-values for each fT l g L l¼1 : Standard linear model (SLM) for detecting mQTL The standard model of quantitative trait mapping uses a linear regression based on the approximation of Haley and Knott (1992) and Martínez and Curnow (1992) to interval mapping of Lander and Botstein (1989). The effect of a given QTL on quantitative phenotype y i of individual i ¼ 1; . . . ; n is modeled as where s 2 is the residual variance and m i is a linear predictor for the mean, defined, in what we term the "full model", as where m is the intercept, x i is a vector of covariates with effects b, and q i is a vector encoding the genetic state at the putative mQTL with corresponding mQTL effects a. In the case considered here of biallelic loci arising from a cross of two founders, A and B, the genetic state vector q i ¼ ða i ; d i Þ T is defined as follows: when genotype is known, for genotypes ðAA; AB; BBÞ; the additive dosage is a i ¼ ð0; 1; 2Þ and the dominance predictor is d i ¼ ð0; 1; 0Þ; when genotype is available only as estimated probabilities pðAAÞ; pðABÞ and pðBBÞ; following Haley and Knott (1992) and Martínez and Curnow (1992), we use the corresponding expectations, a i ¼ 2pðAAÞ þ pðABÞ and d i ¼ pðABÞ: The test statistic for an mQTL is based on comparing the fit of the full model, acting as an alternative model, with that of a null that omits the locus effect, namely, Since the regression in each case provides a maximum likelihood fit, the test statistic used here is likelihood ratio (LR) statistic, T ¼ 2ðℓ 1 2 ℓ 0 Þ; where ℓ 1 and ℓ 0 are the log-likelihoods under the alternative and the null respectively. For the biallelic model, the asymptotic test is the likelihood ratio test (LRT) whereby under the null, T x 2 2 : (Note: Alternative evaluation using the F-test is in general more precise but for our purposes provides equivalent results.) The residperm approach to empirical significance evaluation of T proceeds as follows. We first fit the null model (Equation 3) to obtain predicted valuesm i ¼ x T ib and estimated residualsê i such that y i ¼m i þê i : Then, for each randomization r ¼ 1; . . . ; R; we generate pseudo-null phenotypes fy where if p r is a vector containing a random permutation of the indices i ¼ 1; . . . ; n, then p r ðiÞ is its ith element, mapping index i to its rth permuted version. The null and alternative models are then fitted to fy In the locusperm approach to empirical significance, the response is unchanged but permutations are applied to the locus genotypes. For each randomization r, the full model m i is where p r ðiÞ is as defined for residperm above. This full model fit yields ℓ ðrÞ 1 ; and then T ðrÞ ¼ 2ðℓ ðrÞ 1 2 ℓ 0 Þ: Note that ℓ ðrÞ 0 need not be recomputed after randomization because because only the rows of the design matrices that are unique to the alternative model are permuted and thus ℓ ðrÞ 0 ¼ ℓ 0 : Levene's Test (LV) for detecting vQTL Levene's test is a procedure for differences in variance between groups that can be used to detect vQTL. Suppose individuals are in G mutually exclusive groups g ¼ 1; . . . ; G: Let g½i denote the group to which individual i belongs, denote gth group size as n g ¼ P n i¼1 I fg½i¼gg ; and gth group mean as y g ¼ n 21 g P n i¼1 y i I fg½i¼gg : Then denote the ith absolute deviation as z i ¼ jy i 2 y g½i j; the group mean of these as z g ¼ n 21 g P n i¼1 z i I fg½i¼gg and overall mean z ¼ n 21 P n i¼1 z i : Levene's W statistic is then W ¼ P G g¼1 n g z g 2 z 2 G 2 1 which under the null model of no variance effect follows the F distribution as W FðN 2 G; G 2 1Þ (Levene 1960). Note that replacing means of y with medians gives the related Brown-Forsythe test (Brown and Forsythe 1973), and replacing all instances of z with y in Equation 5 gives the ANOVA F statistic. Levene's test does not lend itself naturally to the residperm approach because it does not explicitly involve a null model to split the data into hat values and residuals. We therefore use the null model from the SLM (Equation 3) to approximate the residperm procedure with Levene's test. To execute the locusperm procedure, for each randomization r, the group labels are permuted among the individuals, which is equivalent to replacing all instances of g½i above with g½p r ðiÞ; with p r ðiÞ defined as above. A corresponding genomewide procedure, although not performed here, would ensure that each randomization r applies the same permutation p r across all loci. Cao et al. (2014) elaborates the SLM to have a variance parameter that differs by genotype, i.e.,

Cao's Tests
where m i is the linear predictor, s 2 i is the variance of the ith individual. These are defined in what we term the "full model" as where g½i indexes the genotype group to which i belongs, and ff g g G g¼1 are the variances of the g ¼ 1; . . . ; G genotype groups. Thus an individual's variance is entirely dictated by its genotype, and that genotype must be categorically known (or otherwise assigned). Cao et al. (2014) fits this model using a two-step, profile likelihood method, which in our applications we observe to be indistinguishable from full maximum likelihood ( Figure S8). Cao et al. (2014) describes tests for mQTL, vQTL and mvQTL based on comparing a full model against three different null models; we detail these tests below in our notation, denoting them respectively Cao M , Cao V , and Cao MV .

Cao M test for detection of mQTL: The Cao M test involves an LRT between Cao's full model and Cao's no-mQTL model:
Cao's no-mQTL model : To execute the residperm procedure for Cao M , pseudo-null phenotypes are generated usingm i andê i from Cao's no-mQTL model (Equation 8). The locusperm procedure respecifies the full model (Equation 7), leaving the variance model unchanged and specifying the mean predictor as Cao V for detection of vQTL: The Cao V test involves an LRT between Cao's full model and Cao's no-vQTL model: Cao's no-vQTL model : where the unsubscripted s 2 is a single, overall residual variance. This null model is identical to the alternative model in the SLM (Equation 2). To execute the residperm procedure for Cao V , pseudo-null phenotypes are generating usingm i andê i from Cao's no-mQTL model (Equation 9). The locusperm procedure respecifies the full model (Equation 7), leaving the mean sub-model unchanged and specifying the variance predictor as s 2 i ¼ f g½pðiÞ : Cao MV for detection of generalized mvQTL: The Cao MV test involves an LRT between Cao's full model and Cao's no-QTL model: Cao's no-QTL model : This null model is identical to the null model in the SLM (Equation 3).
To execute the residperm procedure for Cao MV , pseudo-null phenotypes are generated usingm i andê i from Cao's no-QTL model (Equation 10). The locusperm procedure specifies the mean predictor as m i ¼ m þ x T i b þ q pðiÞ and the variance predictor as s 2 g½i ¼ f pðiÞ :

Double Generalized Linear Model (DGLM)
The DGLM models the phenotype y i via two linear predictors as where m i predicts the phenotype mean and v i predicts the extent to which the baseline residual variance s 2 is increased in individual i. In what we term the "DGLM full model", these are specified as Full model : where m is the intercept, z i is a vector of covariates (which may be identical to x i ), g is a vector of covariate effects on v i , and u is a vector of locus effects on v i . As with Cao's full model, the DGLM full model can be compared, in a likelihood ratio test, with various null models to test for mQTL, vQTL (Rönnegård and Valdar 2011), or mvQTL. A full maximum likelihood fitting procedure for the DGLM was provided by Smyth (1989).
DGLM M for detecting mQTL: For detecting mQTL, we use an LRT of the DGLM full model in Equation 11 against the no-mQTL model: where the LR statistic has asymptotic distribution T x 2 2 . To execute the residperm procedure for DGLM M , pseudo-null phenotypes are generated usingm i andê i from Equation 12. The locusperm procedure respecifies the mean predictor as a and does not modify the variance predictor.
DGLM V for detecting vQTL: For detecting vQTL, we use an LRT of the DGLM full model in Equation 11 against the no-vQTL model: where the LR statistic has asymptotic distribution T x 2 2 . To execute the residperm procedure for DGLM V , pseudo-null phenotypes are generated usingm i andê i from the Equation 13. The locusperm procedure does not modify the variance predictor and respecifies the mean predictor as DGLM MV for detecting mvQTL: For detecting mvQTL, we use an LRT of the DGLM full model in Equation 11 against the no-QTL model: where the LR statistic has asymptotic distribution T x 2 4 . To execute the residperm procedure for DGLM MV , pseudo-null phenotypes are generated usingm i andê i from the Equation 14. The locusperm procedure respecifies the mean predictor as

SIMULATION METHODS
The eight methods and four significance testing procedures described in the previous section, amounting to 32 test-procedure combinations in total, were compared by simulation. The simulations examined the performance of each combination, in terms of false and true positive rate, under eight distinct scenarios relating to the presence or absence of a QTL (and if present, then what type), and the presence or absence of BVH. We describe the general simulation setup below, followed by a detailed description of the eight scenarios and then describe the metrics by which performance was judged.

Simulating locus and covariate
Each simulated experiment consisted of 300 individuals, where each individual was defined by one single-locus genotype, one covariate, and one phenotype. The genotype for individual i, denoted by q i , was simulated according to a random process to mimic an F2 intercross: q i f21; 0; 1g with probability f0:25; 0:5; 0:25g The covariate for individual i, denoted z i , was specified as a five-level categorical factor, with levels assigned to individuals as where z i is an indicator vector such that, for example, z i ¼ ð1; 0; 0; 0; 0Þ denotes membership of level 1. This covariate, which was fixed across simulations, was intended to mimic a generic, fixed aspect of experimental design in a typical QTL mapping study (for example, batch, technician, housing, etc.) that could plausibly influence the precision of the observations. When BVH is simulated, it is driven by this covariate.

Scenarios
We conducted simulated experiments under eight different scenarios. These scenarios varied conceptually across two dimensions. First, we considered four types of locus: 1. null locus: The locus has no effect on phenotype. 2. pure mQTL: The locus has an additive effect on the phenotype mean. 3. pure vQTL: The locus has an additive effect on the log of the residual phenotype variance. 4. mixed mvQTL: The locus has both an additive effect on phenotype mean and an additive effect on the log of residual phenotype variance.
Then, we considered whether or not BVH was present, i.e.: 1. BVH absent: The covariate does not influence the residual variance of the phenotype. 2. BVH present: The covariate influences the residual variance of the phenotype (in addition to the locus, if a vQTL or mvQTL).
The resulting eight scenarios (i.e., all combinations) were realized in silico with three parameters: the effect of the locus on phenotype mean (a), the effect of the locus on phenotype variance (u), and the effect of the covariate on phenotype variance (g). Values assigned to these parameters are listed in Table 1. The rationale for selecting values of a and u was as follows: 1. pure mQTL: The effect size of the pure mQTL was chosen so that it always explains 5% of the phenotype variance, which is consistent with smaller effect sizes typically sought and identified in QTL mapping experiments. Such an mQTL is detectable with approximately 70% power at a 5% false positive rate by the traditional mQTL test (the standard linear model) when 300 individuals are simulated, a typical population size for QTL mapping experiments. 2. pure vQTL: vQTL analysis is much less established, so the vQTL effect size was chosen to match the detectability of the mQTL. Thus, the vQTL effect size was defined such that the traditional vQTL test (Levene's test) has 70% power at 5% FPR in a population of 300 individuals in the absence of BVH. 3. mixed mvQTL: The mvQTL effect sizes were chosen such that the mean and variance signals are equally detectable, and the aggregate signal is detectable by Cao MV and DGLM MV with 70% power at an FPR of 5% in a population of 300 individuals in the absence of BVH.
The values of g used for simulating BVH were 0 ¼ ½0; 0; 0; 0; 0 and g BVH ¼ ½20:4; 2 0:2; 0; 0:2; 0:4. The former chosen to ensure constant residual variance for simulations where BVH is absent; the latter to mirror the extent of BVH we noted in experimental data, while having a concise expression as equally spaced effects centered at zero. In null locus and mQTL simulations, g BVH results in group-wise standard deviations of approximately ½0:67; 0:82; 1:00; 1:22; 1:49. In vQTL and mvQTL simulations, g BVH and u combine additively on the log standard deviation scale and result in fifteen unique variances as detailed in the Supplementary Materials.

Phenotype simulation
For each of the eight scenarios, we conducted 10; 000 simulated experiment. For scenario s, the phenotype for individual i, denoted y i , was simulated from a normal distribution based on the genotype and covariate (q i and x i ) and the scenario parameters (a s , u s , and g s ) as: where m i ¼ q i a s , and (Further details in Supplementary Materials.)

Testing significance
To each simulated experiment, eight tests were applied, and four procedures were used to assess the statistical significance of each test, for a total of 32 test-procedures. The eight tests comprise three tests for detecting mQTL: SLM, Cao M , and DGLM M ; three for detecting vQTL: Levene's test, Cao V , DGLM V ; and two for detecting mvQTL: Cao MV and DGLM MV . These tests are detailed in the Statistical Methods and summarized in Table 2.
The four procedures for evaluating the statistical significance of results were: standard, RINT, residperm, and locusperm, as described in the Statistical Methods. The RINT procedure was selected because it returns any phenotype distribution, no matter how exotic, to a standard normal distribution. The fact that it is commonly used in genetics research demands that its properties, and its effects on QTL mapping, be better understood. The residperm was selected because it was recently proposed for use in mQTL, vQTL, and mvQTL mapping studies (Cao et al. 2014). The locusperm procedure was developed in response to suspected shortcomings of the above robustifying procedures.

Evaluation of tests and procedures
Tests and procedures for assessing statistical significance were evaluated based on their empirical false positive rate (FPR) and power at a nominal FPR of 0.05. The empirical FPR of a given test-procedure combination in a given scenario was taken as the fraction of null simulations (where the phenotype was simulated with no dependence on genotype) that resulted in p , 0:05. Similarly, the empirical power was computed as the fraction of non-null simulations that resulted in p , 0:05. These quantities are naturally considered as estimates of a binomial proportion, so their standard errors were calculated by the method of Clopper and Pearson (1934).
The above evaluations focused only on the cutoff of p ¼ 0:05. Also considered, however, were all possible cutoffs, using QQ plots and ROC plots, which allow examination of the empirical FPR as a function of nominal FPR and the empirical power as a function of empirical FPR, respectively; these illustrate the spectrum of trade-offs that each test makes available, but do not meaningfully change the overall interpretation of the results, so we relegate them to the Supplementary Materials.

DATA AND SOFTWARE
Leamy et al. summary of original study Leamy et al. (2000) backcrossed mice from strain CAST/Ei, a small, lean strain, into mouse strain M16i, a large, obese strain. Nine F1 males were bred with 54 M16i females to produce a total of 421 offspring (208 female, 213 male), which were genotyped at 92 microsatellite markers across the 19 autosomes and phenotyped for body composition and morphometric traits. We retrieved all available data on this cross, which included marker genotypes, covariates, and eight phenotypes (body weight at five ages, liver weight, subcutaneous fat pad thickness, and gonadal fat pad thickness), from the Mouse Phenome Database (Grubb et al. 2014), and estimated genotype probabilities at 2cM intervals across the genome using the hidden Markov model in R/ qtl (Broman et al. 2003).
This mapping population has been studied for association with several phenotypes: asymmetry of mandible geometry (Leamy et al. 2000), limb bone length (Leamy et al. 2002;Wolf et al. 2006), organ weight (Leamy et al. 2002;Wolf et al. 2006;Yi et al. 2006), fat pad thickness (Yi et al. 2005(Yi et al. , 2006(Yi et al. , 2007, and body weight (Yi et al. 2006). The most relevant prior study to this reanalysis, Yi et al. (2006), used standard methods to identify QTL for body weight at three weeks on chromosomes 1 and 18. However, we were not able to reproduce this result, despite following their analysis as described.
Availability of data and software Analyses were conducted in the R statistical programming language (R Core Team 2017). The simulation studies used the implementation of the standard linear model from package stats, Levene's test from car, Cao's tests as published in Cao et al. (2014) and the DGLM tests in package dglm. Files S1, S2, and S3 contain the R scripts necessary to replicate the simulation studies and their analysis, relying on the plotROC package to make ROC plots (Sachs 2017). File S4 contains the data from Leamy et al. (2000) that was reanalyzed. File S5 contains the attempted replication of the original analysis (Yi et al. 2006) and file S6 contains the new analysis, using package vqtl .
The reanalyzed dataset is available on the Mouse Phenome Database (Grubb et al. 2014)

RESULTS
Simulation study on single locus testing Simulations were performed to examine the ability of the eight tests listed in Table 2 to detect nonzero effects belonging to their target QTL types (mQTL, vQTL, mvQTL), and to control the number of false positives when no such QTL effects were present. Simulations were conducted in the presence and absence of background variance heterogeneity (BVH), and for each test, with p-values calculated by each of the four significance assessment procedures (standard, RINT, residperm, locusperm). The full combination of settings is listed in Table 3, which also lists results pertaining to a nominal FPR of 0.05, and described in more detail in Simulation Methods section.
n Table 1 Eight scenarios were simulated, as determined by the values of three parameters. a indicates the additive effect of the locus on phenotype mean, u the additive effect of the locus on phenotype variance, and g the effect of the covariate on phenotype variance. The two possible values of g are 0 ¼ ½0; 0; 0; 0; 0 and g BVH ¼ ½20:4; 2 0:2; 0; 0:2; 0:4   (Figure 1 and Table 3). DGLM M was slightly anti-conservative under the standard and RINT procedures with FPR = 0.057 and 0.055, respectively. With either permutation procedure used to assess significance, however, DGLM M accurately controlled FPR. SLM and Cao M had indistinguishable power in the detection of mQTL under all significance assessment procedures (Figure 2). DGLM M , however, had equal power to those tests only under the standard and RINT procedures, which have inflated FPR. Under the permutation-based procedures, DGLM M was less powerful than the other test-procedures.
These results reflect the reality that, when a simple model is exactly true, a more elaborate model tends to be less powerful. Additionally, they highlight the capability of the permutation-based procedures to accurately control FPR even when the standard and RINT procedures fail to do so (as in the case of DGLM M ).
Testing for mQTL with BVH present: DGLM M dominates: SLM and Cao M accurately controlled FPR under all four procedures to assess statistical significance ( Figure 1). As in the absence of BVH, DGLM M exhibited a slightly inflated FPR under the standard and RINT procedures (0.059 and 0.057, respectively), but accurately controlled FPR under the permutation-based procedures (Table 3).
Under all four procedures, DGLM M was more powerful than SLM and Cao M (Figure 2). The two procedures under which DGLM M accurately controlled FPR had power of 0.822 and 0.818, greatly exceeding the power of Cao M and SLM, which were in the range [0.694, 0.719] (Table 3).
Based on the results of these simulations, DGLM M -residperm and DGLM M -locusperm are the recommended test-procedure combinations for mQTL testing in the presence of BVH.
For each mQTL test-procedure combination, the AUC (Table S1), standard error of the positive rate at a ¼ 0:05 (Table S2), QQ plots illustrating the empirical FPR at each nominal FPR level ( Figure S4), and ROC curves illustrating the spectrum of trade-offs between available FPR and power ( Figure S1) are provided in the Supplementary Materials.
Testing for vQTL with BVH absent: Cao V and DGLM V outperform Levene's test: In the absence of BVH, all vQTL tests had nearly-accurate n Table 3 Empirical positive rates of all tests under all significance assessment procedures in all scenarios based on 10,000 simulations, 1,000 permutations each to estimate empirical null distributions (residperm and locusperm), and a nominal false positive rate (FPR) of a ¼ 0:05. Entries in column 1 and 5 through all rows, columns 3 and 7 in the top third, and columns 2 and 6 in the middle third represent empirical FPR. Where the empirical FPR is within one standard error of the nominal FPR of 0.05, it is written in normal font. Where it is overly conservative, it is underlined. Where it is anti-conservative, it is in boldface. The entries in the rest of the table represent power. Given the sample size of 10,000, the standard error for the values in this table are all between 0.005 and 0.01. Generally, values near 0.05 have a standard error near 0.005 and values near 0.5 have a standard error near 0.01. All standard errors are listed in Table S2 and    FPR control (Figure 1). All tests had FPR within one standard error of 0.05 under both empirical significance assessment procedures (Table 3  and Table S2) But under either asymptotic procedure, Levene's test was slightly conservative. And Cao V and DGLM V were both slightly anticonservative under the standard procedure and conservative under the RINT procedure. Despite the variation in FPR control among the test-procedure combinations, Cao V and DGLM V had more power to detect vQTL than Levene's test under all procedures. Specifically, under the well-calibrated (empirical) procedures, Cao V and DGLM V had power in the range [0.698, 0.721], while under those same well-calibrated (empirical) procedures, Levene's test had power in the range [0.664, 0.665] (Table 3).
Thus, in the specific situations simulated here, the empirical procedures of Cao V and DGLM V are the preferred vQTL tests in the absence of BVH. The additional power of Cao V and DGLM V relative to Levene's test is consistent with the fact that they make strong parametric assumptions that are exactly true in these simulations and Levene's test does not.
Testing for vQTL with BVH present: DGLM V outperforms Levene's test and Cao V : In the presence of BVH, there were three test-procedure combinations with major departures from accurate FPR control ( Figure 3). Cao V under the standard procedure was drastically anticonservative with FPR of 0.135 (Table 3). DGLM V under both the RINT and residperm procedures was drastically conservative with FPR of 0.023 and 0.020, respectively. Additionally, DGLM V under the standard procedure was moderately anti-conservative with FPR of 0.058. The remaining test-procedure combinations accurately controlled FPR, namely Levene's test under all procedures, Cao V under the RINT, residperm, and locusperm procedures, and DGLM V under the locusperm procedure.
Of the tests that accurately controlled FPR, DGLM V under the locusperm procedure was uniquely powerful, with power of 0.708, while the others had power in the range [0.539, 0.576] (Figure 3 and Table 3).
Direct interpretation of these results might lead one to consider the trade-off between DGLM V -standard and DGLM V -locusperm. DGLM V -locusperm requires considerable computational effort and serves only to reduce the FPR from a modestly-inflated level of 0.058 to accurate control at 0.052. Application of the (computationally non-intensive) DGLM V -standard, however, comes with a caveat: if there were some additional, unknown (and therefore unmodeled) BVH-driving factor, DGLM V -standard would be anti-conservative anti-conservative-similar to Cao V -standard with BVH present. The locusperm procedure, in contrast, ensures accurate FPR control whether all BVH-driving factors are modeled (as in DGLM V ) or not (as in Cao V ). DGLM V -locusperm therefore emerges as the most robust test-procedure for vQTL mapping in the presence of BVH.
For each vQTL test-procedure combination, the AUC (Table S1), standard error of the positive rate at a ¼ 0:05 (Table S2), QQ plots illustrating the empirical FPR at each nominal FPR level ( Figure S5), and ROC curves illustrating the spectrum of trade-offs between available FPR and power ( Figure S2) are provided in the Supplementary Materials.
Testing mvQTL with BVH absent: Cao MV and DGLM MV similar: Continuing the pattern from the vQTL tests, in the absence of BVH most mvQTL tests accurately control FPR (Figure 1). The exceptions are similar to the vQTL tests as well, with Cao MV -RINT slightly conservative and DGLM MV -standard slightly anti-conservative (Table 3).
The standard version of Cao MV and DGLM MV were similarly powerful (Figure 4), both exceeding the power of the other mvQTL testprocedures.
Testing mvQTL with BVH present: DGLM MV dominates Cao MV : In the presence of BVH, Cao MV accurately controlled FPR with the RINT, residperm, and locusperm procedures, whereas DGLM MV did so only under the locusperm procedure (Figure 1).
Of the test-procedure combinations that accurately controlled FPR, DGLM MV -locusperm was the most powerful with power of 0.790 as compared to the others in the range [0.632, 0.650].
As with the vQTL tests, the DGLM MV -standard is attractive is terms of computational effort and good statistical properties, but it is expected to have drastically inflated FPR in the presence of any unmodeled BVH-driving factor, similar to Cao MV -standard. DGLM MV -locusperm therefore emerges as the most robust test-procedure for mvQTL testing.
For each mvQTL test-procedure combination, the AUC (Table S1), standard error of the positive rate at a ¼ 0:05 (Table S2), QQ plots illustrating the empirical FPR at each nominal FPR level ( Figure S6), and ROC curves illustrating the spectrum of trade-offs between available FPR and power ( Figure S3) are provided in the Supplementary Materials.
In the presence of BVH, the rank-based inverse normal transformation fails to correct anti-conservative behavior of DGLM M and overcorrects that of DGLM V and DGLM MV : A consistent feature of the simulations involving detection of variance effects, whether vQTL or mvQTL, is that FPR control and power is affected, for better or worse, by applying the RINT to the response.
In the presence of BVH, DGLM M under the standard procedure was anti-conservative (FPR = 0.059 at a ¼ 0:05). The RINT procedure had little efficacy in returning this test to accurate FPR control (FPR = 0.057).
In the case of vQTL detection in the presence of BVH, Cao V under the standard procedure had a drastically inflated FPR (0.135) and the RINT procedure slightly over-corrected it (FPR = 0.046). Similarly, the RINT procedure disrupted DGLM V , which was modestly anti-conservative under the standard procedure, causing overly conservative behavior (FPR = 0.023).
As always, in the presence of BVH, the mvQTL tests exhibited a mixture of the patterns observed in mQTL tests and vQTL tests. Both Cao MV and DGLM MV were anti-conservative under the standard procedure, illustrating their relations to Cao V and DGLM M respectively. In the case of Cao MV , the RINT procedure corrected the FPR, but in in the case of DGLM MV , it resulted in an over-correction into the realm of over conservatism (FPR = 0.049 and 0.038 respectively).
In summary, the RINT procedure was unhelpful in the context of the DGLM M : it did not repair the modest FPR inflation present under the standard procedure. But, in the context of vQTL testing with BVH, it had one useful and important property: pre-processing the phenotype with the RINT, led to vQTL tests that were conservative rather than anti-conservative, decreasing the probability of false positives at the expense of false negatives.

Genomewide reanalysis of bodyweight in Leamy et al. backcross
To understand the impact of BVH on mean and variance QTL mapping in real data, we applied both traditional QTL mapping, using SLM, and mean-variance QTL mapping, using Cao's tests and the DGLM, to body weight at three weeks in the mouse backcross dataset of Leamy et al. (2000).
Analysis with traditional QTL mapping identifies no QTL: We first used a traditional, linear modeling-based QTL analysis, with sex and father as additive covariates and genomewide significance based on 1000 genome permutations (Churchill and Doerge 1994). Although sex was found not to be a statistically significant predictor of body weight (p ¼ 0:093 by the likelihood ratio test with 1 degree of freedom), it was included in the mapping model because, based on the known importance of sex in determining body weight, any QTL that could only be identified in the absence of modeling sex effects would be highly questionable. Father was found to be a significant predictor of body weight in the baseline fitting of the SLM (p ¼ 9:6 · 10 25 by the likelihood ratio test with 8 degrees of freedom) and therefore was included in the mapping model.
No associations rose above the threshold that controls family-wise error rate to 5% ( Figure 5, green line). One region on the distal part of chromosome 11 could be considered "suggestive" with FWER-adjusted p 0:17.
To test the sensitivity of the results to the inclusion/exclusion of covariates, the analysis was repeated without sex as a covariate, without father as a covariate, and with no covariates. No QTL were identified in any of these sensitivity analyses.
Analysis with Cao's tests identifies no QTL: The same phenotype was analyzed with Cao's tests, again including sex and father as mean covariates, and using the genome permutation procedures described in Statistical Methods were used to control FWER. No statistically significant mQTL, vQTL, nor mvQTL were identified ( Figure S10).
Analysis with DGLM-based tests identifies an mQTL: The same phenotype was analyzed with the DGLM-based tests. In a baseline fitting of the DGLM, sex was found not to be a statistically significant predictor of mean or residual variance (mean effect p ¼ 0:18, variance effect p ¼ 0:22, and joint p ¼ 0:19 by the LRT with 1, 1, and 2 d.f.). But father was found to be a statistically significant predictor of both mean and variance (mean effect p ¼ 2:0 · 10 27 ; variance effect p ¼ 1:8 · 10 211 , and p ¼ 4:8 · 10 214 by the LRT with 8, 8, and 16 d.f.). Therefore, following the same reasoning as in the mean model described above, both sex and father were included in the mapping model Figure 1 Empirical false positive rate (FPR) of all tests and significance assessment procedures at a nominal FPR of 0.05, as assessed through simulation of non-associated loci and phenotypes both with and without BVH. Dot indicates point estimate and line indicates 95% confidence interval. The vertical line indicates the ideal empirical FPR of 0.05. Some test-procedure combinations led to FPR outside the plotted range. In such cases the FPR is written on the left edge of the plotting area if the value was too low to plot, and the right edge if it was too high. An un-zoomed version of this plot is available in Figure S7. as covariates of both the mean and the variance. As with the other tests, the genome permutation procedures described in Statistical Methods were used to control FWER.
A genomewide significant mQTL was identified on chromosome 11 ( Figure 5, blue line). The peak was at 69.6 cM with FWER-adjusted p ¼ 0:011; with the closest marker being D11MIT11 at 75.7 cM with FWER-adjusted p ¼ 0:016: Nonparametric bootstrap resampling, using 1,000 resamples (after Visscher et al. 1996), established a 90% confidence interval for the QTL from 50 to 75 cM. This region overlaps with the "suggestive" region identified in the traditional analysis.
By the traditional definition of percent variance explained, following from a fitting of the standard linear model, this QTL explains 2.1% of phenotype variance. Though, given the variance heterogeneity inherent in the DGLM that was used to detect this QTL, this quantity is better considered the "average" percent variance explained. The ratio of the QTL variance to the sum of QTL variance, covariate variance, and residual variance ranges from 1 to 6% across the population, based on the heterogeneity of residual variance.
Understanding the novel QTL: The mQTL on chromosome 11 was identified by the DGLM M test, but not by the standard linear model or Cao's mQTL test. The additional power of the DGLM M test over these other tests relates to its accommodation of background variance heterogeneity (BVH).
Specifically, the DGLM reweighted each observation based on its residual variance, according to the sex and F1 father of the mouse. This BVH is visually apparent when the residuals from the standard linear model are plotted, separated out by father ( Figure 6). Some fathers, for example fathers 2 and 7, appear to have offspring with less residual variance than average, whereas others, for example father 1, seem to have offspring with more residual variance than average. The DGLM captured these patterns of variance heterogeneity, and estimated the effect of each father on the log standard deviation of the observations (Figure 7). Based on these estimated variance effects, observations were upweighted (e.g., fathers 2 and 7) and downweighted (e.g., father 1). This weighting gave the DGLM-based mapping approach more power to reject the null as compared with the SLM.
Other phenotypes: For brevity, we described in detail only the results of the DGLM-based analysis of body weight at three weeks; but, of the eight phenotypes from this cross available on the Mouse Phenome Database, the mean-variance approach to QTL mapping discovered new QTL in four. Five of the eight phenotypesbody weight at twelve days, three weeks, and six weeks, as well as subcutaneous and gonadal fat pad thicknessexhibited BVH due to father, and for each we performed both traditional QTL mapping using the SLM and mean-variance QTL mapping using the DGLM. This reweighting changed the results in three cases: For body weight at three weeks ( Figure S15) and six weeks ( Figure S16), we identified one new mQTL and two new vQTL respectively. For subcutaneous fat pad thickness, we discovered one mQTL and "undiscovered" one mQTL ( Figure S17). That is, after reweighting the observations based on the observed variance of each father, one locus that was overlooked by SLM was identified as an mQTL and one locus that was identified by SLM as an mQTL was no longer found to have a statistically significant association with the phenotype.

DISCUSSION
Since the recognition that variance effects can be attributable to individual genes, a growing body of research has asked questions about the prevalence of such effects , their evolutionary origins (canalization, robustness), their ramifications (decanalization in disease, increased variation) (Gibson 2009;Freund et al. 2013;Lin et al. 2016), and how the identification of such genes can provide a signal of, and thereby serve as a route to identify, higher order interactions such as epistasis or GxE (Struchalin et al. 2010;Rönnegård and Valdar 2012;Forsberg and Carlborg 2017). These studies have promoted detection of variance heterogeneity as path to new biological discovery. But less attention has been paid to this corollary: if a phenotype is subject to variance-controlling factors, then, whether or not identifying those factors is of direct interest, they will induce background variance heterogeneity that can affect inference of more standard targets, including mean-affecting QTL. In other words, interest in identifying sources of BVH may be of most interest to a subset of researchers, but interest in accommodating it should be more widespread.  Our simulation studies showed that modeling BVH when it is present increases power to detect mQTL, vQTL and mvQTL. Our reanalysis of the Leamy et al. dataset demonstrated that accommodating BVH can lead to detection of mQTL that would otherwise be overlooked.
In both cases, of the methods compared, the most powerful were those based on the DGLM, with the most robust versions of those using the locusperm significance procedure. These results should not be too surprising. The DGLM was the only method examined that can accommodate variance effects arising from both the locus and from other covariates; and the locusperm method (and genomeperm, its genomewide analog) is least reliant on parametric assumptions. We would expect other methods that allow flexible modeling of covariate effects on variance to be competitive in these regards, e.g., the recent Bayesian hierarchical model of Dumitrascu et al. (2018).
Beyond advocating any particular method, however, our results can be used to draw attention to a number of more general points about 1) the relationship between increased residual variance, observation weighting and downstream inference; 2) how knowledge of variance effects can be exploited in experimental design, analysis and reanalysis; 3) the sensitivity of variance effect detection to distributional assumptions and how this can be mitigated by strategies such as variable transformation or permutation; and, 4) how to report quantitative genetic parameters under heteroskedasticity.
Residual variance, weighting, and inference for mean effect QTL The additional power of mean-variance QTL mapping to detect mQTL in general-and of DGLM M to detect mQTL in the presence of BVH in particular-can be seen as deriving from how data are reweighted. Consider heteroskedastic data modeled as y i Nðm i ; s 2 =w i Þ, with known weights w 1 ; . . . ; w n and known baseline variance s 2 .
The log-likelihood can be written as ℓ ¼ const 2 WRSS=2s 2 , such that the key quantity to be minimized in a maximum likelihood fit is the weighted residual sum of squares, that is, the squared discrepancies between the observed phenotype y i and its predicted value m i , weighted by w i . The weights therefore affect how much, relatively speaking, each data point contributes to the likelihood: highly imprecise measurements, such as from individuals whose phenotypes are expected to have high variance, have low weight and diminished contribution, whereas as more precise measurements are correspondingly upweighted. In the DGLM, the weight of each observation is determined in the model-fitting process based on the phenotype, the experimental covariates, and the QTL genotype. In the SLM, weights can be specified, but they cannot be co-estimated with covariate and QTL effects. The improvement of the DGLM over the SLM and Cao M under BVH stems entirely from its greater ability to capture this additional information, and thereby give more credence to phenotype values that are more precise. We note a related approach to correcting inference of mean effects in the face of heteroskedasticity not considered here is the use of heteroskedastic consistent covariance matrix estimators (HCCMEs) [Long and Ervin (2000) and refs therein]. Also known as "sandwich" estimators, these use estimated residuals from the SLM to characterize heteroskedasticity empirically and thereby estimate adjusted, heteroskedastic-consistent versions of the effect standard errors. Importantly, HCCMEs do not require the source of heteroskedasticity to be identified, and they have seen recent use in genetic association [e.g., Barton et al. (2013); Rao and Province (2016)]. However, this comes at a cost: when a variable that does predict heteroskedasticity can be identified, Figure 5 FWER-controlling association statistic at each genomic locus for body weight at three weeks. The linear model (green, "traditional") does not detect any statistically-significant associations. The mQTL test takes into account the heterogeneity of both mean and variance due to which F1 male fathered each mouse in the mapping population and detects one mQTL on chromosome 11. Figure 6 Residuals from the standard linear model for body weight at three weeks, with sex and father as covariates, stratified by father. It is evident that fathers differed in the residual variance of the offspring they produced. For example, the residual variance of offspring from fathers 1 and 2 is less than that of offspring from fathers 8 and 9. Here, points are colored by their predicted residual variance in the fitted DGLM with sex and father as mean and variance covariates.
HCCMEs will tend to be inefficient compared with a model-based estimator (Wakefield 2013), such as the DGLM.

Implications for experimental design, analysis and reanalysis
The possibility that some individuals could be predictably more variable than others has clear implications for experimental design. A key parameter in the design of experiments is the number of replicates, typically specified to provide adequate precision of, and thereby power to detect, an estimated effect. But foreknowledge that residual variance will differ for certain groups suggests a more nuanced approach that explicitly weighs replicates against intrinsic variability.
For example, when designing an experiment on a population that happens to have a known, segregating vQTL that is not itself the focus of interest but would induce BVH, it may be preferable to allocate a disproportionate share of the replication to individuals in the highvariability genotype class. In such cases, it then becomes additionally helpful to understand at what level(s) the heterogeneous variance manifests. Specifically, increased variability could arise from greater between-individual variation or greater within-individual variation [cf more levels of variability described in Table 1 of Rönnegård and Valdar (2011)]; whereas the between-individual case warrants additional biological replicates, the within-individual case could be addressable (potentially more cheaply) with additional technical replicates.
Alternatively, the recognition that some individuals are predictably high variance may be a reason to exclude them entirely, or, more generally, to opt for conditions and population subsets for which residual variance is predicted to be minimal. If such a variance-minimizing population can be achieved without changing the genetic effects present, it would have an improved signal-to-noise ratio and provide better power to detect genetic effects.
A more standard situation is that a vQTL (or other BVH factor) is not recognized until the experiment is first analyzed. In this case, it would make sense to perform a re-analysis, with the vQTL included as a variance-affecting covariate. Doing so should increase power to detect both mQTL and other vQTL.
vQTL mapping: pros and cons of the rank inverse normal transformation The presence of BVH can be disruptive to a test for a vQTL. A simplistic test compares a heteroskedastic alternative model with a homoskedastic null. BVH confuses the comparison by making the true null heteroskedastic. In doing so, it increases the false positive rate for asymptotic tests that disregard BVH and reduces power when FPR is empirically controlled (see, e.g., Cao V results in Table 3).
In this context it is therefore interesting to consider the crude-but often used-device of the rank inverse normal transformation. The RINT reshapes away any kurtosis (fatter tails), a key signature of heteroskedasticity, without any reference to its source. As such, it is logical that in the detection of vQTL it would have both beneficial and harmful properties.
In the case where there is no known driver of BVH, represented by the simulations examining Cao V , the RINT procedure acts as an insurance policy: if there truly is no BVH, the test suffers a modest decrease in power; but if there truly is BVH from an unknown source, it averts the drastic FPR inflation under the standard (i.e., non-empirical) p-value procedure.
In the case where researchers are confident that, after correcting for known BVH drivers, the residuals are homoskedastic (represented by the DGLM V simulations), the RINT procedure is unnecessary, costing power with its conservatism in the absence of BVH and paradoxically creating even more conservative behavior in the presence of BVH.
The aforementioned disadvantages of RINT, however, assume the phenotype data has an underlying normal distribution, either as given or after a deducible transformation [e.g., via the Box-Cox procedure or similar; (Box and Cox 1964)]. When the data are highly non-normal, both the RINT and the locusperm procedure would provide valid inference, and perhaps the most robust approach would be to use the two in combination. Nonetheless, where normality approximately holds, whether as given or after a simple transformation, we strongly prefer the locusperm procedure without RINT: across all simulation scenarios it exhibited at worst slight conservatism when applied to DGLM-based tests and represents a useful step toward FWER control.

Permutation schemes for other populations
Our preferred permutation scheme, locusperm (or its genomewide equivalent, genomeperm), is applicable to populations in which genotypes under the null are exchangeable. As such, it holds not only for F2 and backcrosses but also, for example, in approximately equally-related recombinant inbred line panels such as the Collaborative Cross and other similar replicable multiparent populations. For example, in the (mQTL) study of Mosedale et al. (2017), the use of locus genotypes (or genotype probabilities) would simply be replaced by founder haplotypes that could then be randomly exchanged across lines.
In non-exchangeable populations, however, such as those requiring polygenic random effect terms [e.g., Kennedy et al. (1992)], although the DGLM could be applied via its random effects generalization, DHGLM (Felleki et al. 2012), the permutation scheme may need revision. In Figure 7 The predictive mean and standard deviation of mice in the mapping population based on father and genotype at the top marker, D11MIT11 on chromosome 11. The genotype effect, illustrated by the colored ribbons is almost entirely horizontal, indicating a difference in means across genotype groups but no difference in variance, consistent with the identification of this QTL as a pure mQTL. The father effects, illustrated by the spread of colored crossbars, have both mean and variance components. For example, father 1 (red) has the highest predictive mean and lowest predictive standard deviation. His offspring were upweighted in the QTL analysis based on their low standard deviation. Father 9 (pink) has an average predictive mean and the highest predictive standard deviation. His offspring were downweighted in the QTL analysis based on their high standard deviation. Note: the effect of sex on phenotype mean and variance was modeled, then marginalized out for readability. particular, a permutation scheme in which all permutations are equally likely may not comport with a reasonable null, and it may be more appropriate to allocate higher probabilities to permutations that preserve overall genetic similarity (Abney 2015;Roach and Valdar 2018;Berrett et al. 2018). Although we not have a specific solution, we suspect that the necessity of such revisions, at least for the DGLM V test, will depend on the extent to which the observed heteroskedasticity is polygenic.

Percent variance explained
Variance heterogeneity complicates the notion of percent variance explained (PVE) by a QTL. Assuming the QTL has the same effect on the expected value of the phenotype of all individuals, it will explain a larger percent of total variance for individuals with lower than average residual variance, and vice versa for individuals with higher than average residual variance. In light of this observation, the percent variance explained can either be reported as "average percent variance explained" or can be calculated for some representative sub-groups. For example, if there is variance heterogeneity across sexes, it would be reasonable to report the PVE of a QTL for both males and females, or if a vQTL is known to be present elsewhere in the genome, report the PVE for each vQTL genotype as in Yang et al. (2012).

Guidelines for detecting and QTL mapping in the presence of BVH
To select the right test and procedure to assess significance, it is important to establish whether there is any BVH present. We advocate fitting the DGLM with all potential BVH drivers as variance covariates, then including any that are statistically significant as variance covariates in the mapping model to improve power to detect QTL. Then, given that 1. The DGLM-based tests dominate all other tests in the presence of BVH, 2. the locusperm procedure accurately controls the FPR of the DGLM-based tests in the presence of BVH, whether the source is known or not, and 3. the locusperm procedure can be extended into the genomeperm procedure to control FWER, we advocate for the analysis of experimental crosses that exhibit BVH with the three DGLM-based tests (DGLM M , DGLM V , and DGLM MV ) and, where the individuals in the population are exchangeable (as in an F2 or backcross) or where partial exchangeability can be suitably identified [e.g., see (Churchill and Doerge 1994;Zou et al. 2006;Churchill and Doerge 2008)], the use of our described genomeperm procedures, which permute the genome in selective parts of the model, to assess genomewide significance. Because this procedure involves three families of tests rather than one family as would be typical with an SLM-based analysis, an additional correction may be desired to control experiment-wise error rate. DGLM M and DGLM V are orthogonal tests (Smyth 1989), but DGLM MV is neither orthogonal nor identical to either, so the effective number of families is between two and three. One reasonable, heuristic approach to control experiment-wise error rate is simply to lower the acceptable FWER, e.g., replacing the standard 0.05 with 0.02.

Conclusion
In summary, we demonstrate the effect of BVH on QTL mapping of both mQTL and vQTL, and the value of accommodating it using the DGLM. In doing so, we propose a standard protocol for mapping mQTL, vQTL and mvQTL in standard genetics crosses.

CALCULATION OF AN ADDITIVE EFFECT TO EXPLAIN A GIVEN PROPORTION OF TOTAL VARIANCE IN AN F2 INTERCROSS
The variance attributable to a genetic factor with alleles (AA, AB, BB) at frequency (0.25, 0.5, 0.25), additive effect a and no dominance effect is: For a genetic factor that explains a fraction p of total phenotype variance: Combining and solving for a gives a ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ps 2 =ð1 2 pÞ p .