Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case–control studies

Abstract Two-phase sampling design, where biomarkers are subsampled from a phase-one cohort sample representative of the target population, has become the gold standard in biomarker evaluation. Many two-phase case–control studies involve biased sampling of cases and/or controls in the second phase. For example, controls are often frequency-matched to cases with respect to other covariates. Ignoring biased sampling of cases and/or controls can lead to biased inference regarding biomarkers' classification accuracy. Considering the problems of estimating and comparing the area under the receiver operating characteristics curve (AUC) for a binary disease outcome, the impact of biased sampling of cases and/or controls on inference and the strategy to efficiently account for the sampling scheme have not been well studied. In this project, we investigate the inverse-probability-weighted method to adjust for biased sampling in estimating and comparing AUC. Asymptotic properties of the estimator and its inference procedure are developed for both Bernoulli sampling and finite-population stratified sampling. In simulation studies, the weighted estimators provide valid inference for estimation and hypothesis testing, while the standard empirical estimators can generate invalid inference. We demonstrate the use of the analytical variance formula for optimizing sampling schemes in biomarker study design and the application of the proposed AUC estimators to examples in HIV vaccine research and prostate cancer research.

account for the sub-sampling of cases and controls in the second phase, we can construct an IPW AUC estimator where the contribution of each participant to a case-control pair is weighted by the known sampling probability of the participant in phase two:

Y. Huang
(II) Second, note that an IPW estimator of N D ND, the total number of case-control pairs in the phase-one sample, can be constructed as ND j=1 δ Di δD j /(p Di pD j ). Therefore, one could replace N D ND in (0.1) with its IPW estimator. This leads to an alternative AUC estimator An alternative way to view this is that AU C where v Di and vD j are weights assigned to case i and control j respectively, with v Di = N D / p Di ND k=1 δ Dk /p Dk and vD j = ND/ pD j ND k=1 δD k /pD k . Since Finally, note the AU C x (p) as presented in (2.1) of main paper replaces known sampling probability in AU C(p) with their estimated sampling probabilities.
The four AUC estimators: AU C x (p), AU C x (p), AU C x (p), and AU C x (p) in general differ from each other. In the special case where the phase-two sampling probabilities for cases and controls are constant within each covariate stratum with p D and pD estimated empirically within discrete covariate strata, AU C(p) in (0.3) and AU C(p) in (2.1) of main paper are equivalent as shown below. Moreover, if in phase two cases and controls are each randomly sampled, then AU C(p), AU C(p), and AU C(p) are equivalent to the empirical AUC estimator.
Let W D and WD indicate the discrete stratum for sampling the biomarker, for a case and a control respectively. The strata defined can be same or different between cases and controls. Suppose there are K D strata for cases and KD strata for controls such that W and thus Appendix B: Proof of Theorem 1

B1. Asymptotic normality of AU C(p)
In Appendix B, we use AU C to indicate AU C x . The subscript x is omitted for simplicity. Let σ ND indicate the sigma field of information, or complete data, potentially available in all cases in phase-one sample, and σ ND the sigma field of information for all controls in phase-one sample. We have Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"5 B2. Asymptotic normality of AU C(p) Next we consider the asymptotic distribution of Note that Note that A + B E + F can be written as the sum of two independent terms where each term follows bivariate normal distribution following multivariate central limit theorem.
We thus have

Y. Huang
The asymptotic normality of √ N AU C(p) − AU C follows according to Delta method. That is, as N → ∞, √ N { AU C(p) − AU C} converges to a normal random variable with mean 0 and variance Σ (II) x , which is computed as below. We have Similarly, we can show that var Thus by Delta method, Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"7 The comparison of efficiency between AU C x (p) and AU C x (p) depends on the sampling design and the biomarker distribution. Among other conditions, one condition that would ensure Σ x is the non-negativity of the covariances between 1/p D and FD x (X D ) and between 1/pD and S Dx (XD). A special case when this holds is when cases and controls each are randomly sampled in the second phase such that the phase-two sampling probability of a subject is independent of one's biomarker value.

B3. Asymptotic normality of AU C(p)
Suppose we modelp D as a function of covariates with finite-dimensional parameters θ D = {θ D1 , . . . , θ DKD }, and model pD as a function of covariates with finite-dimensional parameters θD = {θD 1 , . . . , θD KD }. Letθ D andθD be maximum likelihood estimators of θ D and θD, and let p D andpD be corresponding estimators of p D and pD. Through Taylor's expansion, we have Considering estimatingθD by maximizing the likelihood of observing markers in phase two. Loglikelihood for sampling a control from the phase-one sample is The corresponding score function is the Hessian is and the information matrix is Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"9 Similarly we can show that B4. Asymptotic normality of AU C(p) Next we derive asymptotic distribution of √ N { AU C(p) − AU C}. Following similar arguments as in B2 and B3, is asymptotically bivariate normal, where And the asymptotic normality of √ N AU C(p) − AU C follows. Denote its asymptotic variance as Σ

Y. Huang
By Delta method, In addition, Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"11 x . That is, the asymptotic variance of AU C x (p) is smaller than or equal to the asymptotic variance of AU C x (p), and the asymptotic variance of AU C x (p) is smaller than or equal to the asymptotic variance of AU C x (p). So estimating the sampling weight can lead to improvement in efficiency in Bernoulli sampling even if the sampling probability is known.

Appendix C. Proofs of Theorem 2
We sketch the proofs for Theorem 2, which follow similar argument as the proofs of Theorem 1.
, which is asymptotically normally distributed with mean zero and variance var( Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"13 As a result, From similar argument as in single marker case, we have Again, we can see that asymptotic variance of ∆ AU C(p) is less than or equal to the asymptotic variance of ∆ AU C(p) and asymptotic variance of ∆ AU C(p) is less than or equal to the asymptotic variance of ∆ AU C(p).

Appendix D. Finite-Population Stratified Sampling
Here we demonstrate the equivalence in asymptotic variance of AU C(p) between Bernoulli sampling and finite-population stratified sampling whenp is estimated from discrete stratum and when sampling fraction for cases and controls in each stratum in finite-population stratified sampling converges to the corresponding sampling probability in Bernoulli sampling.
In particular, suppose there are K D strata among cases with stratum indicator W D = w D1 , . . . , w DKD and KD strata among controls with stratum indicator WD = wD 1 , . . . , wD KD .
Let N DkD , k D = 1, . . . , K D be number of cases in stratum k D and ND kD , kD = 1, . . . , KD be number of controls in stratum kD in the phase-one cohort sample. In second phase of the study, we sample fixed number n DkD of cases without replacement from stratum k D and sampling fixed number nD kD of controls without replacement from stratum kD. Suppose as N → ∞, the sampling fractions for cases among stratum k D ∈ (1, . . . , K D ) converge with n DkD /N DkD → π DkD ∈ (0, 1], and the sampling fractions for controls among stratum kD ∈ (1, . . . , KD) converge with nD kD /ND kD → πD kD ∈ (0, 1]

Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"15
Here we use subscript DkD,i to indicate the i th cases among stratum k D , and use subscript DkD,j to indicate the j th controls among stratum kD. We have Next we derive the asymptotic variance of AA. The derivation follows the strategy as in Breslow and Wellner (2007). The asymptotic variance of BB can be derived following similar arguments.
Let P π ND be the IPW empirical measure of cases.
δ DkD ,j ζ XDk D ,i /N DkD is a 'finite sampling empirical measure' for stratum k D among cases and ζ XDk D ,i is the Dirac measure placing unit mass on X DkD ,i . Let P ND denote the empirical measure of cases: ζ XDk D ,i /N DkD . Let P D0 denote the true marker distribution among cases. Let G ND = √ N D (P ND − P D0 ) denote the standard empirical process for cases. We have Y. Huang

as shown in Breslow
and Wellner (2007). Thus according to theorem 3.6.13 and 1.12.4 in Van der Vaart and Wellner (1996), for almost every sequence of complete data, G δ where G DkD denote the P D0|kD Brownian Bridge. Let σ ND be the sigma field of information, potentially available for N D cases. Conditional on σ ND , the process G δ kD ,Nk D are mutually independent because of the independence of the {δ DkD ,j } in different strata. Furthermore, they are (unconditionally) uncorrelated with G ND = √ N D (P ND − P 0 ), following corollary 2.9.3 of Van der Vaart and Wellner (1996) and Breslow and Wellner (2007). The vector of processes ) converges weakly to the vector of independent Brownian bridge processes (G, G δ 1 , . . . , G δ KD ), which leads to by continuous mapping theorem. Therefore, Now look at the asymptotic variance of AU C(p) or equivalently AU C(p) in Bernoulli sam-

Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"17
pling wherep DkD andpD kD are empirical estimates from each stratum. From Appendix B2, ity contributed by cases and var(B) − var(D) is variability contributed by controls. We have and therefore, which equals (0.4).
Appendix E. Comparing variance of AU C(p) and AU C(p) in Bernoulli Sampling

With Discrete Strata
Consider Bernoulli sampling design whereπ DkD andπD kD are empirical estimates of phase-two sampling proportion from each case and control stratum. From appendix B2, analytical variance where (0.5) and (0.6) are variance components due to variability of cases and controls respectively.
We take a close look at (0.6) since results about (0.5) follow similarly. We made the following observations: i) oftentimes decreasing sampling probability for controls tends to increase (0.6); ii) (0.6) will be non-negative if 1/pD − 1 and E S XD (XD kD ) are independent or positively correlated; iii) there exist scenarios where (0.6) is negative. For example, if AU C x = 0.664, π DkD = 0.9, 0.1 for kD = 1, 2, E S XD (XD kD ) = (0.254, 0.701) for kD = 1, 2, then the term in (0.6) equals -0.0479. In general, we found AU C(p) to be more efficient or have very similar variability compared to AU C(p). Supplementary Figure 1 explored the magnitude of (0.6) as a function of several parameters for the following setting.
Appendix E1: Setting for Supplementary Figures 1 and 2 We consider binary disease D with prevalence λ = 0.1. Biomarker X and covariate W * are jointly normally distributed conditional on D, with ρ xw * = 0.5 conditional on D. Among controls, X and W * each follows a standard normal distribution. Among cases, X follows N (1, 1) and W follows N (0.6, 1). We have AU C x = 0.76. Let W be a discrete covariate stratum derived from W * , which Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"19 takes two levels: where Φ is the CDF of standard normal. Consider data from two-phase studies. In the first phase, N = 5, 000 subjects are randomly sampled from the population, whose D and W values are measured. In the second phase, Bernoulli sampling of cases and controls are performed for measuring marker X stratified on strata W , assuming on average nD = 250 controls are sampled from the two strata. Note η = P (W = 1|D = 0) is the probability a control in the population is from stratum 1, and let α = P (W = 1|D = 0, Sampled in phase two) be the proportion of phase-two control samples from stratum 1. We have πD In Supplementary Figure 1, we plot (0.6) as function of η, α, and ρ * xw individually while keeping the other two factors constant. It appears that (0.6) tends to increase with η or α getting closer to 0 or 1.
Appendix F. Variance reduction due to weight estimation in Bernoulli Sampling from Discrete Strata Consider Bernoulli sampling design whereπ DkD andπD kD are empirical estimates of phase-two sampling proportion from each case and control stratum.
Similarly, reduction in analytical variance of Note that analytic variance reduction due to weights estimation for AU C or ∆ AU C can be represented as a weighted sum of squared terms across strata; the term for each stratum is the AU C comparing cases in this particular stratum with a random control or vice versa for estimation of AUC, or the difference in these AUCs between markers for estimation of ∆AU C.
The latter can be much smaller in magnitude since AUC difference is smaller than AUC itself.
Similarly, reduction in analytical variance of √ N ∆ AU C(p) − ∆AU C compared to variance of √ N ∆ AU C(p) − ∆AU C equals the sum of following two terms: We take a close look at (0.8), i.e. reduction in asymptotic variance of AU C(p) attributed to variability of controls by estimating sampling probability. Note that (0.8) tends to be big if there exist some large strata kD with small sampling probability and if there is large difference between E{S XD (XD kD )} and AU C x . Using the same setting as presented in Supplementary Appendix E1, we plot (0.8) as function of η = P (W = 1|D = 0), α = P (W = 1|D = 0, Sampled in phase two), and ρ * xw = cor(X, W * |D = 0) individually while keeping the other two factors constant. In

Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"21
general, we see a U-shape for (0.8) as functions of α or ρ xw * and a upside down U-shape for (0.8) as a function of η (Supplementary Figure 2).

Appendix G. Expanded Tables for Simulation Studies with bi-binormal marker
This section present expanded tables for simulation studies described in Section 3 of the main text by including AU C(p) and AU C(p) in addition to AU C(p) and AU C em .
Supplementary  We consider a binary disease outcome D with prevalence P (D = 1) = 0.1 in the population. Let W * be a continuous covariate, which belongs to standard normal among controls (D = 0) and belongs to N (0.6, 1) among cases (D = 1). Let W be a discrete covariate stratum derived from W * , which takes three levels: We consider two biomarkers X and Y , each follows gamma distribution conditional on D (Dorfman and others, 1997). We assume shape parameter for each marker is equal between cases and controls, denoted as κ x and κ y for X and Y . Scale parameters for X and Y are σ Dx and σ Dy among cases and equal to 1 among controls. Let ρ xy , ρ xw * , and ρ yw * be correlations between X

22
Y. Huang and Y , between X and W * and between Y and W * respectively conditional on D. We simulate the joint distribution of X, Y , W * conditional on D using Gaussian copula model (Nelsen, 2013).
The sampling scheme for Monte-Carlo simulation studies is the same as that in Section 3 of the main text. Supplementary Table 9 shows performance of various estimators of AU C x .
Supplementary Appendix H2. Implication on efficiency of sampling scheme for biomarkers following bi-gamma model Consider a binary disease outcome D with prevalence P (D = 1) = 0.1 in the population. Let W * be a continuous covariate, which belongs to standard normal among controls (D = 0) and belongs to N (0.6, 1) among cases (D = 1). Let W be a discrete covariate stratum derived from W * , which takes two levels: W = 1 if W * < Φ −1 (1/2), W = 2 otherwise. Suppose marker X follows gamma distribution conditional on D, with shape parameter 1/3, scale parameter 1 among controls, and scale parameter σ Dx among cases. Correlation between X and W * is ρ x,w * conditional on D.
We compare two sampling designs. Both are two-phase studies with a random cohort sample of size N drawn in the first phase. In the second phase, both designs include all cases from phaseone sample, i.e., π D = 1; a simple random sample of controls of size nD = n D are drawn without  Figure   3 shows the relative asymptotic efficiency of AU C(p) in Design 2 versus AU C em in Design 1 for two different AU C x values, as the correlation ρ x,w * changes.   Table 3: Performance of different estimators of ∆AU C = AU Cy − AU Cx when the two markers have same variability but different correlation with covariate W * conditional on D, for the bi-normal marker model described in Section 3. Disease prevalence is 0.1. Marker X and Y is each standard normal among controls. Here we have µDy = 1, σDx = σDy = 1, AU Cy = 0.76, ρyw * = 0.5, ρxy = 0.5. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finitepopulation stratified sampling. Results are based on 5,000 Monte-Carlo Simulations.
Bernoulli Sampling * FPS sampling  Table 4: Performance of different estimators of ∆AU C = AU Cy − AU Cx when the two markers have different variability among cases, for the binormal model described in Section 3. Disease prevalence is 0.2. Marker X and Y is each standard normal among controls. Here we have µDy = 1, σDy = 1, AU Cy = 0.76, ρyw * = 0.5, ρxy = 0.5. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finitepopulation stratified sampling. Results are based on 5,000 Monte-Carlo Simulations.
Bernoulli Sampling * FPS sampling 30 Y. Huang  Table 5: Performance of different AU Cx estimators for the bi-gamma marker model described in Appendix H1. Disease prevalence is 0.1. Biomarker X follows gamma distribution with shape parameter κx conditional on D, scale parameter 1 among controls, and scale parameter σDx among cases. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finite-population stratified sampling. Results are based on 5,000 Monte-Carlo Simulations.  Table 6: Performance of various estimators of ∆AU C = AU Cy − AU Cx when the two markers have same shape parameter and same correlation with covariate W * conditional on D, for the bi-gamma model described in Appendix H1. Disease prevalence is 0.1. Marker X and Y each follows gamma distribution conditional on D with shape parameter κx = κy. Scale parameter is 1 among controls for each marker and is σDx for X and σDy for Y . Here we have ρxw * = ρyw * = 0.5, and ρxy = 0.5. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finite-population stratified sampling. Results are based on 5,000 Monte-Carlo Simulations. 99.9 99.9 100 99.9 100 * finite-population stratified sampling Table 7: Performance of different estimators of ∆AU C = AU Cy − AU Cx where the two markers have same shape parameter but different correlation with covariate W * conditional on D, for the bi-gamma marker model described in Appendix H1. Disease prevalence is 0.1. Marker X and Y each follows gamma distribution conditional on D with shape parameter κx = κy. Scale parameter is 1 among controls for each marker and is σDx for X and σDy for Y . Here we have ρyw * = 0.5 and ρxy = 0.5. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finite-population stratified sampling. Results are based on 5,000 Monte-Carlo Simulations.

Bernoulli Sampling
Bernoulli Sampling * FPS sampling κx σDy σDx AU Cy AU Cx ∆AU C ρxw * nD Supplementary Material for "Evaluating and Comparing AUC in Two-Phase Case-Control Studies"39 Table 8: Performance of different estimators of ∆AU C = AU Cy − AU Cx when the two markers have different shape parameter conditional on D, for the bi-gamma model described in Appendix H1, with κy = 3, κx = 1/3. Disease prevalence is 0.1. Here we have ρyw * = 0.5, ρxy = 0.5. nD and nD indicate expected number of cases and controls sampled in phase two for Bernoulli sampling and exact number of cases and controls sampled for finite-population stratified sampling. Results are based on 5,000 Monte-Carlo Simulations.
Bernoulli Sampling * FPS sampling σDy σDx AU Cy AU Cx ∆AU C ρxw * nD ∆ AU C(p) ∆ AU C(p) ∆ AU C(p) ∆ AU C  for the variance component that is due to variability in controls, as a function of η, α, and ρxw , for the setting described in Appendix E1. The value presented in Y-axis is the value in (0.6), assuming AU Cx = 0.76. Here η = P (W = 1|D = 0), α = P (W = 1|D = 0, Sampled in phase two), and ρxw * = cor(X, W * |D = 0).  for the variance component that is due to variability in controls, as a function of η, α, and ρxw , for the setting described in Appendix E1. The value presented in Y-axis is the value in (0.8) assuming AU Cx = 0.76. Here Here η = P (W = 1|D = 0), α = P (W = 1|D = 0, Sampled in phase two), and ρxw * = cor(X, W * |D = 0).