Accounting for selection and correlation in the analysis of two-stage genome-wide association studies

The problem of selection bias has long been recognized in the analysis of two-stage trials, where promising candidates are selected in stage 1 for confirmatory analysis in stage 2. To efficiently correct for bias, uniformly minimum variance conditionally unbiased estimators (UMVCUEs) have been proposed for a wide variety of trial settings, but where the population parameter estimates are assumed to be independent. We relax this assumption and derive the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. One area of application is the estimation of odds ratios (ORs) when combining a genome-wide scan with a replication study. Our framework explicitly accounts for correlated single nucleotide polymorphisms, as might occur due to linkage disequilibrium. We illustrate our approach on the measurement of the association between 11 genetic variants and the risk of Crohn's disease, as reported in Parkes and others (2007. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Gen. 39(7), 830–832.), and show that the estimated ORs can vary substantially if both selection and correlation are taken into account.


INTRODUCTION
The two-stage "learn and confirm" strategy is widely implemented across biomedical research. For example, in the sphere of randomized clinical trials, the vast majority of adaptive trial designs follow a two-stage strategy, where design adaptations are made at the first interim analysis, with the trial then proceeding under a fixed design until its completion. See Bauer and others (2016), Bretz and others (2009), and Posch and others (2005) for a thorough review of the literature.
A key application has been in two-stage adaptive seamless trials incorporating mid-trial treatment selection (Bowden and Glimm, 2014;Kimani and others, 2013;Sampson and Sill, 2005). Typically, multiple experimental treatments are simultaneously compared with a control in stage 1, and the most promising treatment is selected for confirmatory analysis in stage 2.
In biomedical science more generally, there is an increasing focus on the identification of genetic markers predictive of phenotypic variation, in order to better understand disease etiology and to target future drug development. The design of choice is the genome-wide association study (GWAS), in which upwards of one million single nucleotide polymorphisms (SNPs) can be tested simultaneously for their association with a given phenotype. Those SNPs achieving "genome-wide" significance (typically defined as a p-value of < 10 −8 ) are then validated in a replication study (Ioannidis and others, 2009).
A whole variety of formal two-stage design procedures have been proposed in this setting; see, for example, Satagopan and Elston (2003), others (2004, 2002), Thomas and others (2004), with an overview provided in Thomas and others (2009). In practice, the two-stage GWAS approach has been successfully applied to many disease areas, including sclerosis (Chiò and others, 2009), epilepsy (Guo and others, 2012), and breast cancer (Sapkota and others, 2012).
However, ranking and selecting candidates in a two-stage trial can induce bias in the final estimates of the parameter of interest. Indeed, as a general rule, the more aggressive the selection, the worse is the bias (Bauer and others, 2010). Specifically, a candidate has to perform "well" in stage 1 in order to proceed to stage 2, which often leads to overly optimistic estimates, or the so-called "winner's curse". When there is heavy selection from many candidates of roughly equivalent parameters, then the selected candidate is typically based on chance variability rather than true superiority (Sill and Sampson, 2007). All this calls into question the generalizability of the trial's findings to future patients.
The need to correct for this selection bias in the analysis of two-stage trials has long been recognized. An efficient and unbiased estimator can be obtained through the technique of Rao-Blackwellization. This involves taking the unbiased stage 2 data and conditioning on a complete, sufficient statistic for the parameter in question. By the Lehmann-Scheffé theorem, this gives the uniformly minimum variance conditionally unbiased estimator (UMVCUE).
This two-stage framework was introduced by Cohen and Sackrowitz (1989), who derived the UMVCUE for the selected mean, where the stage 1 populations are independent and normally distributed. The method lends itself naturally to estimation in two-stage adaptive seamless phase II/III clinical trials (Bowden and Glimm, 2008;Kimani and others, 2013), with equivalent work carried out in biomarker research (Pepe and others, 2009;Koopmeiners and others, 2012;Robertson and others, 2015). In the GWAS setting, Bowden and Dudbridge (2009) extended the Cohen and Sackrowitz method to derive the UMVCUE for the SNP-disease log odds ratios (ORs), where either a one-sided or two-sided p-value test is used for ranking the SNPs.
A crucial assumption in all of the above analyses is that the stage 1 population parameter estimates are independent random variables. However, this may not be a reasonable assumption to make. For example, in the GWAS setting, SNPs in the same chromosomal region may be in linkage disequilibrium (LD).
The aim of this paper is to present general theory to enable the calculation of UMVCUEs for each setting that explicitly accounts for correlation. We achieve this by deriving the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. We illustrate the utility of the new theory using an application in the GWAS setting.
Since the focus of this paper is on point estimation, our primary criterion for comparing the performance of different estimators is the magnitude of the bias. Of course, bias is not the only criterion with which to judge an estimator, especially considering the bias-variance trade-off. However, it is important to quantify the bias of an estimator, not least due to regulatory concerns. As a secondary criterion, we also compare the mean squared error (MSE) of the estimators, which takes into account the variance. In practice, the relative importance attached to the criterion of bias and MSE will vary from setting to setting.
In Section 2, we describe the general model framework and derive the form of the multivariate normal UMVCUE. We then present a short simulation study in Section 3 for the bivariate normal case. We extend our framework to estimating ORs in genome-wide scans in Section 4, and apply the resulting UMVCUE to data reported in Parkes and others (2007). We conclude with a discussion in Section 5.

Model description
Consider the following two-stage trial design. Suppose that we have K correlated continuous stage 1 parameter estimates, defined by X = (X 1 , . . . , X K ), for the candidate treatments, biomarkers or genetic variants C 1 , . . . , C K . Let X follow a multivariate normal distribution, defined by X ∼ N (µ, V ), where µ is the vector of (unknown) means and V = (V i j ), i, j ∈ {1, . . . , K } is the covariance matrix. We assume that V is known and is positive definite. For notational convenience, we let V ii = σ 2 i for i = 1, . . . , K . We denote the ordered stage 1 estimates as X (i) , where the ordering is simply by effect size: X (1) > X (2) > · · · > X (K ) . We consider other ordering mechanisms in due course. Let Y i be the stage 2 estimate of the ith ranked candidate, and so Y i ∼ N (μ (i) , τ 2 i ). At the end of stage 2, the aim is to efficiently estimate μ ( j) for some j ∈ {1, . . . , K }.
The MLE for μ ( j) is a weighted average of the first-and second-stage data: This estimator may be biased, because it does not take into account the first-stage selection procedures or the correlation. An unbiased estimator can easily be found by just using the stage 2 data Y j . However, this estimator suffers from lower precision since we are neglecting the stage 1 data. Hence we look for an unbiased estimator that uses data from both stages.

Calculating the UMVCUE
Let Q be the event {X : X 1 > X 2 > · · · > X K }, which implies that X (i) = X i for i = 1, . . . , K . Without loss of generality, we condition on Q for the remainder of the derivation of the UMVCUE.
To start with, consider estimating the mean of the highest ranked population, μ 1 . For notational convenience, we let Y 1 = Y and τ 1 = τ . In Section 1 of the supplementary materials (available at Biostatistics online), we prove the following theorem for the complete and sufficient statistic for µ = (μ 1 , . . . , μ K ).
Then, by the Lehmann-Scheffé theorem,Û = E(Y | Z = z, Q) is the UMVCUE for μ 1 under Q. Hence we have the following result, as proved in Section 2 of the supplementary materials (available at Biostatistics online).
REMARK In the independent normal setting, Cohen and Sackrowitz (1989) note that all dependency on observations X 3 , X 4 , . . . , X K vanished in the expression for the UMVCUE-which they interpreted as a negative result when K > 2. However, we see in Equation (2.1) that accounting for correlation means that all of the observations can contribute to the sets A 1 and A 2 and be implicitly used in the estimator. In fact, any of the observations X 3 , X 4 , . . . , X K can be used explicitly, depending on which feature in min(A 1 ) and max(A 2 ).
REMARK 2 It is often the case that the stage 1 estimates are asymptotically normal rather than precisely normal. For example, log ORs have an asymptotic normal distribution. In these settings, our theorems hold asymptotically.
Suppose that, instead of only selecting the single top-ranking treatment or biomarker for continuation to stage 2, we take forward the top K from a larger group of K . For example, in the GWAS setting, typically K ≈ 500 000 and K < 100. We now require the UMVCUE for the jth best candidate out of K and the following corollary covers this case.
REMARK As a simple check of our methodological development's consistency with previous results, if we set the off-diagonal terms of the covariance matrix V equal to zero, then we recover the independent normal setting. The UMVCUE was derived in this special case by Bowden and Glimm (2008). We compare our results in Section 3 of the supplementary materials (available at Biostatistics online) and show that they are the same.
In Section 4, we show the form of the UMVCUE when ranking by p-value, and subsequently show how to construct bootstrapped confidence intervals (CIs) in this setting.

Bivariate normal UMVCUE
We now consider the UMVCUE derived in the bivariate normal setting, where K = 2. Suppose we have two correlated continuous stage 1 outcome measures, defined by X = (X 1 , X 2 ), for the candidates C 1 and C 2 . We let X follow a bivariate normal distribution, with where ρ is the correlation coefficient between X 1 and X 2 , and we assume ρ = ±1. From Equations (2.1), the statistic (Z 1 , Z 2 ) is sufficient for (μ 1 , μ 2 ), where REMARK Note that if σ 1 /σ 2 = ρ, then the UMVCUE is in fact equal to the MLEμ 1 .

Simulation results
In this subsection, we compare various estimators for the largest mean μ 1 in the bivariate normal setting: Y = the observed stage 2 data for the highest ranked population. μ 1 = the MLE for μ 1 , which ignores selection and correlation. U B = the UMVCUE derived by Bowden and Glimm (2008), which accounts for selection but ignores correlation. Note that this is equivalent toÛ when we set ρ = 0. U = our UMVCUE given in Equation (3.1), which accounts for selection and correlation.
To evaluate the performance of a generic estimator for μ 1 , say μ * 1 , we calculate the bias and MSE as defined in Posch and others (2005), Sill and Sampson (2007), and Bowden and Glimm (2008): In our simulation study, we set σ 1 = 0.05 and σ 2 = 0.1 and then consider four simple scenarios (i)-(iv) with varying values of the true means µ = (μ 1 , μ 2 ) and stage 2 variance τ 2 : Note that, for scenarios (i) and (ii), MSE sel (Y ) = 0.05 2 . For scenarios (iii) and (iv), the exact MSE of Y is as follows: Figure 1 shows the mean relative bias (defined as b sel /μ 1 ) of the MLE (μ 1 ) and the UMVCUE (Û B ) (which ignores correlation) as a function of ρ for the four scenarios. Since the mean bias of the stage 2 data (Y ) and the UMVCUE (Û ) are virtually identical to zero for all values of ρ (by definition), we do not show the simulation results here.
The plots show that the bias of both estimators is at a maximum for ρ close to −1. The MLE is positively biased for ρ < 0.5, and in scenarios (ii) is in fact substantially biased for all values of ρ. We see thatÛ B is positively biased for ρ < 0, unbiased when ρ = 0 (as would be expected), and negatively biased for ρ > 0. The bias of both estimators is worse when τ = σ 1 compared with τ = 0.05. Figure 2 shows the MSE of the estimators (μ 1 , Y,Û B ,Û ) for the four scenarios. As would be expected, the MLE always has the lowest MSE, while Y always has the highest. We also see that there is a substantial reduction in the MSE when using the UMVCUEsÛ orÛ B .
In scenarios (i)-(iii), the MSE ofÛ decreases as ρ approaches 0.5, and then increases again. This is to be expected given Equation (3.1), which shows thatÛ reduces down to the MLE when σ 1 = ρσ 2 . In scenarios (iii) and (iv), we also see that, for ρ close to 1, the MSE ofÛ and the MLE are very close.
In all of the scenarios, the MSE ofÛ is greater than that forÛ B when ρ < 0, equal when ρ = 0 (as would be expected), and less than that forÛ B when ρ > 0. However, these differences in the MSE are much less than the differences between the MSE of Y and eitherÛ orÛ B . In conclusion, there is little or no advantage in terms of MSE when using the estimatorÛ B as opposed to our UMVCUEÛ , which properly accounts for correlation.

APPLICATION TO THE PARKES AND OTHERS' GWAS DATA
We now apply our methodology to data from the genome-wide association scan for Crohn's disease published in 2007 by the Wellcome Trust Case Control Consortium (2007) (WTCCC). There were 1748 cases for the disease, and 2938 controls in this cohort, which identified 12 SNPs associated with disease status at genome-wide significance.
A replication study was then reported by Parkes and others (2007) in a follow-up cohort of 1182 cases and 2024 controls, in which 12 SNPs were successfully replicated in this study. Following Bowden and Dudbridge (2009), we exclude one (rs6887695) from our illustration (as its significance in the WTCCC scan was severely reduced after data cleaning). Table 1 shows the estimated ORs for stages 1 and 2 (i.e. the WTCCC data and the replication data), as well as the overall MLE. This is then followed by two previously described estimators that combine the data from the two stages in a way that attempts to correct for bias. The first estimator (denoted by ZP) is the corrected MLE of Zhong and Prentice (2008). The second (denoted byÛ B ) is the estimator considered in Bowden and Dudbridge (2009), who calculated the UMVCUE assuming that all of the log ORs for the SNPs are uncorrelated.
As the bold entries in Table 1 indicate, two pairs of SNPs are on the same chromosomal region. The first-and second-ranked SNPs (rs17234657 and rs9292777) are both on chromosomal region 5p13, while the fifth-and seventh-ranked SNPs (rs13361189 and rs4958847) are both on chromosomal region 5q33.
It can be safely assumed that SNPs from distinct genomic regions are in linkage equilibrium. However, in an effort to assess the potential correlation between SNPs in the same genomic region, we used the data from 498 Europeans in the 1000 genomes project (1000Genomes Project Consortium, 2012. The empirical correlation between the SNPs (encoded as 0,1,2 to indicate the number of risk increasing alleles) on chromosomal regions 5p13 and 5q33 were −0.28 and 0.78, respectively.  Parkes and others (2007) in an initial scan (stage 1) and replication study (stage 2). SNPs are ranked in order of stage 1 statistical significance, with those on the same chromosomal region highlighted in bold Although the populations considered by Parkes and others (2007) and the 1000 genomes project are different, these correlations do suggest LD exists between these SNPs. Hence a natural question to ask is how the OR estimates are affected by correlation. To do so, we first need to extend our framework to account for ranking by p-value.

Calculating the UMVCUE
Consider a two-stage design with a genome-wide association scan (stage 1) followed by a replication study (stage 2). Only those SNPs that meet selection criteria (see below) in stage 1 continue on to stage 2. A common approach in the genome-wide setting is to rank the SNPs according to the statistical significance of the effects, with a p-value threshold p crit . We assume that K of the SNPs pass p crit and are then ranked.
For a one-sided test, we condition (WLOG) on the event The UMVCUE in this case is given in Section 4 the supplementary materials (available at Biostatistics online). For a two-sided test, we instead condition (WLOG) on the event Then the UMVCUE for μ j is as follows, with the proof found in Section 5 of the supplementary materials (available at Biostatistics online): THEOREM 4.1 The UMVCUE for μ j given Q 2 iŝ and we define min{∅} = +∞ and max{∅} = −∞.
REMARK The introduction of the variable M above is for notational convenience. In particular, we reexpress the intersection of sets For either test, if all the off-diagonal terms of the covariance matrix are equal to zero, then we recover the independent normal setting as in Bowden and Dudbridge (2009). In Section 6 of the supplementary materials (available at Biostatistics online), we compare our estimators and show they are the same.

Results
Rather than assuming we know ρ, we take each pair of SNPs that are on the same chromosomal region and calculate the UMVCUEs as the correlation between the log ORs varies between ±1. More explicitly, given a pair of SNPs on the same chromosomal region which have stage 1 ranks j 1 and j 2 , respectively, then we assume that the log ORs X j 1 and X j 2 follow a bivariate normal distribution, with correlation coefficient ρ Starting with the SNPs on chromosomal region 5p13, the upper half of Figure 3 shows the UMVCUEs for the ORs as a function of the correlation coefficient ρ. The first plot for SNP rs17234657 shows that the UMVCUE can vary substantially if there is positive correlation between the SNPs. As ρ increases from 0 to 0.88, the UMVCUE increases from 1.16 to 1.32, with a particularly rapid increase for ρ greater than about 0.5. For ρ > 0.88 we actually see a (small) decrease in the UMVCUE. In contrast, for negative values of ρ, the UMVCUE is hardly affected. The second plot for SNP rs9292777 shows that the UMVCUE decreases as ρ increases, but in absolute terms correlation has little effect. The main reason for this is because the stage 1 and stage 2 estimates for the OR were very similar (1.38 and 1.34, respectively), so we would not expect the UMVCUE to be dramatically different regardless of any correlation.
Performing the analysis for the SNPs on chromosomal region 5q33 gives the results displayed in the lower half of Figure 3. Looking at the first plot for SNP rs13361189, in absolute terms correlation has little effect on the UMVCUE. However, note the piecewise nature of the UMVCUE: it increases for −1 < ρ < −0.58, decreases for −0.57 < ρ < −0.33, remains constant for −0.33 < ρ < 0.1, and then (rapidly) decreases again for ρ > 0.1. Also note that this time, there is a large discrepancy between the stage 1 and stage 2 log ORs (1.51 and 1.38, respectively). The fact that the UMVCUE hardly varies indicates that the substantial correction to the stage 1 estimate is in fact robust to correlation.
As for the second plot for SNP rs4958847, again the correlation hardly changes the value of the UMVCUE (in fact, the UMVCUE remains constant until ρ > 0.5). However, this should be expected since the stage 1 and stage 2 estimates for the OR were very similar (1.35 and 1.36, respectively). Constructing CIs. After constructing a point estimate for the OR, it is natural to construct CIs as well. We use a parametric bootstrap procedure inspired by Bowden and Glimm (2008) and Pepe and others (2009), where we assume that the true log OR mean μ j =Û j (the UMVCUE calculated from the observed data).
Starting with the SNPs on chromosomal region 5q33, these had the fifth and seventh stage 1 ranks. Given the observed stage 1 log ORs x = (x 1 , . . . , x 11 ) and the assumed correlation coefficient ρ between Further details of how the variables are generated in step 1 can be found in Section 7 of the supplementary materials (available at Biostatistics online), as well as an extension to the general multivariate case.
As for the SNPs on chromosomal region 5p13, these had the first and second stage 1 ranks. The same bootstrap procedure described above can be used, except that in step 1, we condition on |X (B) 1 |/σ 1 |X (B) 2 |/σ 2 |x 3 |/σ 3 . Figure 4 shows the bootstrapped 95% CIs for SNP rs17234657 on chromosomal region 5p13. When the correlation ρ = 0, the CI for the UMVCUE is virtually identical to that achieved by just using the stage 2 data. However, we see that, for ρ > 0.5, the CIs become appreciably different, since the stage 2 OR ignores the correlation.  Table 2 compares the bootstrapped 95% CIs for the various estimators for the two highest ranked SNPs (rs17234657 and rs9292777) when the correlation ρ = 0. For the SNP rs9292777, there is a 12% reduction in the CI width (from 0.304 to 0.268) when using the UMVCUE compared to just using the stage 2 data. For both SNPs, the ZP estimator and the MLE give very similar CIs.

DISCUSSION
In this paper, we present a general framework for accounting for selection and correlation in two-stage trials, with an emphasis on the GWAS setting. We achieve this by deriving the UMVCUE in the multivariate normal setting with an arbitrary known covariance structure. Our framework relaxes the assumption present in the literature that the stage 1 population parameter estimates are independent.
As the bivariate normal simulation study demonstrated, it is important to correctly account for correlation. Indeed, if the correlation coefficient is not close to zero, then the UMVCUE that ignores correlation can be substantially biased. Due to the bias-variance trade-off, the MSE of the UMVCUE is greater than that of the MLE, but is substantially less than the MSE of using the (unbiased) stage 2 data alone.
The GWAS example showed how our estimation strategy can be applied in practice. We also described how to construct CIs using a parametric bootstrap procedure. Our results demonstrate how correcting for correlation is necessary in the GWAS setting. Indeed, for the Parkes and others (2007) data, the OR estimate for the highest ranking SNP varied substantially for high positive values of the correlation coefficient.
In the GWAS setting, our framework assumes that the LD structure of the SNPs is known. In practice, we would envisage using large external datasets such as the International HapMap Project (The International HapMap Consortium, 2003) to give an estimate of SNP LD. Alternatively, the correlation could be estimated based on the stage 1 data itself if individual participant data were available, but further research is needed to give guidance as to whether doing so would induce bias into our UMVCUE and/or appreciably increase the MSE. Even if it is not possible to have reliable estimates of the correlation between the (log) ORs, a sensitivity analysis (like the one we carried out) is still useful to probe the robustness of the reported ORs to any LD between the SNPs.
Due to the structure of the Parkes and others (2007) data, our results in Section 4.2 focused on the bivariate setting. Of course, in other GWAS studies the data may mean that it is necessary to consider multiple (i.e. K > 2) SNPs in a local region that are in LD. Our UMVCUE, although still tractable, would then have a much more complex dependence on K (K − 1)/2 correlation coefficients.
We note that some progress was made to account for correlated outcomes by Sill and Sampson (2007), who looked at two-stage trials with mid-trial treatment selection based on a surrogate endpoint. They derived the UMVCUE for the final endpoint when the surrogate and final endpoint follow a bivariate normal distribution. However, our framework is a more general one, and allows application to other trial settings. Indeed, as current work, we have applied our estimator to two-stage adaptive seamless phase II/III clinical trials, allowing for unequal stage 1 (and stage 2) variances as well as generalized selection rules.
One of the limitations of our work is the construction of CIs. In the bivariate case, we can simply use a parametric bootstrap procedure. As for the general multivariate case, in the supplementary materials (available at Biostatistics online) we show theoretically how to extend our method. However, it is not clear what the coverage of such CIs would be in practice. An alternative approach could be based on the work of Sampson and Sill (2005), who developed exact CIs in the independent normal setting. It may be possible to extend their analysis to correlated multivariate normal outcomes.
In the framework for our UMVCUE, we treat the stage 2 data as coming from independent univariate normal distributions, i.e. we ignore the potential stage 2 correlation structure. As current work, we are deriving efficient unbiased estimators that make use of, and fully account for, all the available correlated stage 2 outcomes. However, preliminary results appear to show that no UMVCUE exists for this case, despite the obvious appeal of making use of more data, when available.
Finally, another extension would be trials with more than two stages. Bowden and Glimm (2014) derived Rao-Blackwellized estimators for multistage drop-the-loser trials assuming independent normal outcomes, and this could be a starting point for looking at multistage trials with multiply correlated outcomes.