Abstract

With the increasing availability of data in the public domain, there has been a growing interest in exploiting information from external sources to improve the analysis of smaller scale studies. An emerging challenge in the era of big data is that the subject-level data are high dimensional, but the external information is at an aggregate level and of a lower dimension. Moreover, heterogeneity and uncertainty in the auxiliary information are often not accounted for in information synthesis. In this paper, we propose a unified framework to summarize various forms of aggregated information via estimating equations and develop a penalized empirical likelihood approach to incorporate such information in logistic regression. When the homogeneity assumption is violated, we extend the method to account for population heterogeneity among different sources of information. When the uncertainty in the external information is not negligible, we propose a variance estimator adjusting for the uncertainty. The proposed estimators are asymptotically more efficient than the conventional penalized maximum likelihood estimator and enjoy the oracle property even with a diverging number of predictors. Simulation studies show that the proposed approaches yield higher accuracy in variable selection compared with competitors. We illustrate the proposed methodologies with a pediatric kidney transplant study.

1 Introduction

Personalized medicine has gained a great deal of attention over the past decade. The objective of personalized medicine is to tailor the treatment/intervention plan based on individual characteristics. To this end, summarizing evidence regarding treatment effect heterogeneity is a crucial step. In practice, however, individual studies are usually underpowered to detect important interactions or perform subgroup analyses. To increase the power, researchers have employed meta-analysis to synthesize information from different data sources. Traditional meta-analysis methods concerned with combining aggregate study-level data, such as odds ratios and mean differences. The major drawback of aggregate data analyses is that covariate-treatment interactions are usually not provided in the reports of primary analysis findings, making it difficult to conduct subgroup analysis (Simmonds and Higgins, 2007). An increasingly popular approach is meta-analysis of individual patient data (IPD), where the raw data from each study are obtained and analyzed directly. However, acquiring data from different institutions and generating a consistent data format across studies usually require a substantial amount of time and resource. Moreover, critical information about factors that influence the choice of treatment and clinical outcome may not be available for all studies.

In this paper, we present a unified framework for synthesizing information from a mixture of subject-level data and aggregate data. Various approaches have been proposed in the literature to leverage information from external sources to improve efficiency in analyzing smaller scale studies (Thall and Simon, 1990; Boonstra et al., 2016) and to build more accurate risk prediction models (Gail et al., 1989; Costantino et al., 1999; Chen et al., 2006; Gail, 2011; Liu et al., 2014; Chatterjee et al., 2016). An emerging challenge in the era of big data is that the subject-level information, such as genomics and proteomics profile, is high dimensional, whereas external information is typically at an aggregate level and low dimensional. Although meta-analysis has gained its popularity in the discovery of weak or sparse genetic features (Evangelou and Ioannidis, 2013; He et al., 2016; Pasaniuc and Price, 2017), existing approaches are not readily applicable when the external information is at an aggregate level and of a lower dimension. To tackle this problem, we propose a unified penalized empirical likelihood (PEL) approach for synthesizing information from different sources in a high-dimensional setting.

The empirical likelihood method was originally developed to obtain confidence regions that respect the boundaries of the support of the target parameter (Owen, 1988, 1990; Qin and Lawless, 1994). It has been applied to incorporate auxiliary information to improve estimation efficiency (Qin, 2000; Wu and Sitter, 2001; Chen et al., 2002; Chen and Qin, 2014; Qin et al., 2015; Huang et al., 2016; Han and Lawless, 2019). In a high-dimensional setting, Chen et al. (2009) and Hjort et al. (2009) studied the asymptotic properties of empirical likelihood-based inferences, while others, including Tang and Leng (2010), Leng and Tang (2012), and Chang et al. (2018), considered PEL methods for parameter estimation and variable selection. The existing work, however, is not applicable to a mixture of high-dimensional subject-level data and low-dimensional aggregate data. In this paper, we demonstrate that various forms of auxiliary aggregated information can be re-expressed as population estimating equations under a postulated logistic regression model for the subject-level data and thus can be readily incorporated as constraints under the proposed PEL framework.

The issue of whether the results from an external study are applicable to the current study arises when combining information from different sources. In particular, study design and inclusion and exclusion criteria often vary across studies, leading to population heterogeneity among different sources of information. Such heterogeneity, if not addressed properly, can lead to biased estimates of predictor effects. To tackle this problem, we construct a PEL ratio test statistic to check if the external information is consistent with the subject-level data. When the homogeneity assumption fails, we postulate a semiparametric density ratio model to account for population heterogeneity between different sources of information. The regression parameters in the density ratio model quantify the degree of heterogeneity in the predictor distributions and can be incorporated in the estimating equations derived from the auxiliary information. Moreover, as will be shown later, the additional set of parameters in the density ratio model is identifiable with a sufficient number of summary statistics of the predictors in the external studies. The proposed PEL approach yields higher variable selection accuracy and improved estimation efficiency; moreover, it enjoys the oracle property in the sense that it can achieve sparsity and optimal estimation efficiency simultaneously.

Finally, it is possible that the external aggregated information was derived from studies of similar sample sizes as the internal study and hence uncertainty in the auxiliary information is not negligible. In this case, we study the asymptotic results to show explicitly the impact of uncertainty on the efficiency gain of the proposed PEL approach. We further propose a modified variance estimator that properly accounts for the uncertainty in the external information. To the best of our knowledge, it is the first paper that considers the population heterogeneity and uncertainty in the auxiliary information when combining information from both high-dimensional subject-level data and low-dimensional aggregate data.

2 Proposed Methodologies

2.1 Different Forms of Auxiliary Information

Let formula denote a binary outcome of interest. Consider a logistic regression model

(1)

where formula contains p predictors and formula is a formula vector of regression parameters. To lay the groundwork, we begin with a brief discussion of the general method for summarizing different forms of auxiliary information. For ease of discussion, we assume that the external aggregated information is consistent with the subject-level data, that is, the internal and external studies were conducted on the same population. Moreover, we further assume that the auxiliary information was derived from large databases so that its uncertainty can be ignored. The two assumptions will be relaxed later in Sections 2.3 and 2.4. Our goal is to approximate the population moments by the sample moments with the available subject-level data. We will then treat the sample moment estimating equations as constraints under the empirical likelihood framework to synthesize information from external sources. We give three examples for illustration.

Example 1. For a binary outcome, the most common form of auxiliary information is the event rate in subgroups. For example, in kidney transplant studies, the acute rejection rates in patients who received living versus cadaveric donor organs are usually reported. Note that the subgroups are not necessarily mutually exclusive. Write formula, where ϕ is the event rate in the subgroup and Ω is the subgroup whose event rate was reported in an external study. This equation can be re-expressed as formula. By double expectation and under Model (1), the auxiliary information can be expressed by the population estimating equation formula, where formula.

Example 2. Summary statistics for the predictors are commonly available in study reports. For example, the proportion of patients who received cadaveric transplants and the mean donor age are usually available in kidney transplant research. Obviously, summary statistics can be expressed as population estimating equations formula, where formula. Common choices of formula include formula, formula, and formula, representing, respectively, the proportion, mean, and variance of the predictor in the target population. Note that in this example, formula does not involve formula directly. However, efficiency gain may be expected under the empirical likelihood framework due to correlation between constraints formed using the same set of subject-level data.

Example 3. Many publications also report the results of univariate and multivariate logistic regression analyses. However, covariates included in these models may not be the same as those considered in Model (1). For example, in studying the effect of steroid avoidance protocols on the risk of acute rejection in kidney transplant patients, the previous studies may have included a smaller set of potential confounders than what are considered in the current data analysis. Write formula, where formula and formula are disjoint subsets of formula with dimensions K and formula, respectively. Consider the case that the coefficient formula of formula in a reduced logistic regression model formula was available. It is known that the true regression coefficient formula is the solution to the expected score equation derived from the reduced model, that is, formula. Under the full model (1) and following the law of total expectation, the auxiliary information can be expressed as formula, where formula.

2.2 PEL for Synthesizing Auxiliary Information

In this section, we introduce a unified PEL method that can synthesize various forms of external aggregated information in variable selection and parameter estimation. Suppose the subject-level data formula, are independent and identically distributed (i.i.d.) realizations of formula. Denote by G the marginal distribution function of formula. Consider the log full likelihood formula, where formula is the log conditional likelihood of Y given formula and formula is the log marginal likelihood of formula. Denote by formula the jump size of G at formula, formula, and write formula. Then the log marginal likelihood can be re-expressed as formula. We are interested in a sparse high-dimensional setting where p is allowed to increase with n. Write formula, where formula is a formula vector of parameters that includes the intercept β0 and the nonzero coefficients, and formula is a collection of zero parameters.

As illustrated above, various forms of aggregated information can be summarized as a system of K-dimensional population estimating equations formula, where formula and formula. Note that, although the subject-level data are high dimensional, the external information is often of a low dimension and thus we assume that K is fixed and finite. To combine information from different sources, one can minimize the negative log full likelihood formula subject to the constraints,

The first two constraints ensure G is a proper distribution function. The last constraint is constructed by approximating the population estimating equation derived from the external aggregated information, formula, using the observed subject-level data. For stability in implementations, a nested optimization algorithm based on the corresponding dual problem is recommended to solve the constrained minimization problem (Chen et al., 2002; Imbens, 2002; Donald et al., 2003; Han and Lawless, 2019).

With a diverging number of predictors, direct minimization of formula subject to the constraints often leads to estimators with large variances; as a result, the fitted model could have poor prediction performance. To tackle this problem, we consider incorporating the adaptive least absolute shrinkage and selection operator (lasso) penalty (Zou, 2006) into the constrained minimization for variable selection and parameter estimation simultaneously. Specifically, the adaptive lasso PEL estimator is defined by

(2)

where formula is a tuning parameter and formula are adaptive weights. Following Zou and Zhang (2009), we propose to use the adaptive weights formula, where formula is a constant and formula is the penalized maximum likelihood estimator (MLE) with elastic net (Zou and Hastie, 2005). Since formula is a consistent estimator of formula, the value of formula reasonably reflects the importance of formula. With proper weights, the adaptive lasso method puts smaller penalties on important predictors than unimportant ones and thus the important predictors are more likely to be selected.

In what follows, we tackle the penalized constrained minimization problem in (2) by solving the corresponding dual problem. The Lagrange function for the penalized constrained minimization problem in (2) is given by formula, where η and formula are Lagrange multipliers. With some calculations, we can derive formula and formula, where the Lagrange multiplier formula satisfies formula. Hence the penalized negative constrained log full likelihood function can be expressed as formula, where, up to a constant,

(3)

Arguing as in Newey and Smith (2004), the penalized constrained minimization can be carried out equivalently by

(4)

The large sample properties of formula are summarized below in Theorem 1 and Theorem 2, with the proofs given in the Supporting Information.

Theorem 1.  Under conditions (C1)–(C3) in the Appendix, suppose that as formula, formula, formula and formula, then formula.

Theorem 2.  Write formula, where formula and formula are estimators of formula and formula, respectively. Let formula be the formula predictors corresponding to formula. Suppose that the conditions specified in Theorem 1 are satisfied and formula as formula. Moreover, assume that formula and formula are positive definite. Then, as formula, we have (i) formula with probability tending to 1; and (ii) formula converges in distribution to N(0, 1), where formula is an arbitrary formula vector satisfying formula and formula.

Under the proposed PEL framework, both the dimensionality of predictors and the number of nonzero coefficients are allowed to increase with the sample size n; moreover, the minimal signal strength formula is allowed to decrease with n. Theorem 1 shows that the proposed estimator formula is a root-formula consistent estimator and Theorem 2 establishes the sparsity property and asymptotic normality. By Theorem 2, the proposed estimator formula enjoys the oracle property; that is, formula can be estimated as zero with probability tending to one and formula is asymptotically as efficient as the empirical likelihood estimator for estimating formula by incorporating auxiliary information as if we knew formula in advance. Define

(5)

which is the adaptive lasso penalized MLE without incorporating the auxiliary information. Write formula, where formula and formula are estimators of formula and formula, respectively. Following Zou and Zhang (2009), we can show that formula converges in distribution to N(0, 1) as formula. Hence Theorem 2 implies that the proposed PEL estimator formula is asymptotically more efficient than the conventional penalized MLE formula.

Applying the local quadratic algorithm (Fan and Li, 2001; Fan and Peng, 2004), the covariance matrix of formula can be estimated by

(6)

For the nonzero parameters, the local quadratic algorithm is applied to approximate the adaptive lasso penalty and thus leads to formula in (6). As formula, we have formula and (6) converges in probability to the asymptotic covariance matrix in Theorem 2. More details are given in the Supporting Information.

2.3 Synthesizing Auxiliary Information in the Presence of Population Heterogeneity

The premise of information synthesis is that the underlying relationship between the response and predictors is the same between different data sources. The above estimation procedure additionally assumes homogeneity in the predictor distribution and thus motivates us to approximate the population estimating equations by the sample estimating equations. Due to differences in the inclusion and exclusion criteria, however, this assumption may not hold. Following Qin and Lawless (1994, 1995), we develop a PEL ratio test to check the homogeneity assumption, that is, formula. Note that when formula, formula is minimized by formula, where formula is defined by (3) and formula is defined by (5). Therefore, we consider a PEL ratio test with the test statistics formula, where formula is the solution to formula. By Theorem 3 below, we reject the null hypothesis of homogeneity at a type I error rate of 5% if formula, where formula is the 95th percentile of the chi-squared distribution with K degrees of freedom.

Theorem 3.  Under the conditions specified in Theorem 2 and the null hypothesis that formula, the test statistic formula converges in distribution to a χ2 random variable with K degrees of freedom as formula.

When the homogeneity assumption fails to hold, we extend the proposed procedure to allow the predictors in the external source data to have a different distribution. In most applications, a subvector of formula accounts for the predictor distribution heterogeneity between the data sources. In practice, summary statistics available for the predictors in the external source usually inform the selection of the subvector that accounts for the predictor distributional differences. With formula denoting the fixed-length p0-dimensional predictors accounting for the predictor distributional differences, we propose a semiparametric density ratio model

(7)

where formula and formula denote the density functions of formula in the external data and the subject-level data, respectively, with formula unspecified. The model is equivalent to imposing a logistic regression model for the odds of membership in the external study relative to that in the internal study, where nonzero regression coefficients in the logistic regression model only involve that in the subvector formula. The coefficient formula characterizes the degree of heterogeneity in covariate distribution relative to the internal study. Obviously, formula indicates fidelity to the homogeneity assumption. For model identifiability, we assume that formula. In most situations, aggregated information in the form of summary statistics for predictors is often available for checking the heterogeneity in the predictor distribution. Such aggregated information, as illustrated below, can be readily incorporated under the proposed PEL framework and thus the identifiability condition can be easily met.

As illustrated in Section 2.1, the external auxiliary information can be summarized by population estimating equations formula, where the expectation formula is evaluated with respect to the covariate distribution in the external study. In other words, the external information is given by formula. Under the density ratio model (7), the population moment constraints can be expressed as formula, where formula and the expectation is evaluated with respect to the covariate distribution in the internal study.

To synthesize information from a mixture of the subject-level and aggregate data in the presence of population heterogeneity, we consider minimizing the penalized negative log full likelihood formula subject to the constraints

The third constraint follows from the density ratio model (7) such that formula and the last constraint is obtained by approximating formula using the observed subject-level data in the presence of population heterogeneity. In this case, the Lagrange function for the penalized constrained minimization is given by formula, where formula, and formula are Lagrange multipliers. Some calculations lead to formula and formula, where ξ0 and formula are determined by formula and\break formula. Then the penalized negative constrained log full likelihood function is formula, where, up to a constant, formula. Hence the PEL estimator that accounts for the population heterogeneity is defined as

(8)

Write formula, where formula and formula are the corresponding estimators for formula and formula, respectively. Theorem 4 below shows that the proposed estimator formula also enjoys the oracle property and is asymptotically more efficient than the conventional penalized MLE formula defined by (5). The proof is given in the Supporting Information.

Theorem 4.  Under the conditions specified in Theorem 2, as formula, we have (i) formula with probability tending to 1 and (ii) formula converges in distribution to N(0, 1) for an arbitrary formula vector formula satisfying formula with formula and formula being defined in the Supporting Information.

2.4 Synthesizing Auxiliary Information with Uncertainty

We have assumed in the previous discussions that the auxiliary information comes from a large database where the sample size of the external data, denoted by m, satisfies formula as formula. Therefore, the variability in the estimates derived from the external data is much smaller than that in the estimation of formula using the subject-level data and hence can be ignored. We now consider the case where m is of the same order of n, that is, assume that formula converges to a constant κ as formula. In this case, incorporating uncertainty of the auxiliary information in the proposed inference procedure is desired. Denote by formula the root-m consistent estimator of the population parameter formula using the external data and assume that formula is asymptotically normal. Along the same line as in Section 2.2, one can replace formula with formula in the derivation and estimate formula by

The large sample properties of the proposed estimator formula are summarized below in Theorem 5, with the proof given in the Supporting Information.

Theorem 5.  Under the conditions specified in Theorem 2, as formula, we have (i) formula with probability tending to 1 and (ii) formula converges in distribution to N(0, 1) for an arbitrary formula vector formula satisfying formula with formula and formula being defined in the Supporting Information.

By Theorem 5, the proposed estimator formula enjoys the oracle property and is asymptotically more efficient than the penalized MLE formula given in (5). The efficiency improvement decreases with κ. When formula, that is, κ is a very small constant, formula is approximately formula and thus formula enjoys a substantial efficiency gain. When formula, the efficiency gain is close to 0. Moreover, we propose a modified variance estimator that properly adjusts for the uncertainty in the auxiliary information. Specifically, the covariance matrix of nonzero coefficients can be consistently estimated by

where formula, formula, and formula are given in the Supporting Information.

3 Simulations and Data Analysis

3.1 Numerical Simulations

Simulation studies were conducted to evaluate the finite-sample performances of the proposed PEL approaches and the penalized maximum likelihood (PML) approach that does not incorporate the auxiliary information. In each simulation setting, we generated X1 from a Bernoulli distribution with formula and formula independently from the standard normal distribution with formula. The binary outcome Y was generated from the logistic regression model formula, where formula, formula, and thus the first five predictors are considered important. We assumed formula under the density ratio model (7). In each simulation, we generated 500 simulated data sets, each with a sample size of 200.

We first assumed that the homogeneity assumption holds, that is, formula in the density ratio model, and considered two forms of external aggregated information: (I) the conditional probabilities of formula for two subgroups formula and formula and the means of X1 and X2 and (II) the regression coefficients in a reduced logistic model with predictors formula. For the PEL method and its extension that accounts for the population heterogeneity (PELformula), the sample size of the external study was large and the uncertainty in the auxiliary information was ignored. For the external information form (I), the probabilities of formula for the two subgroups are approximately 0.31 and 0.38, and the means of X1 and X2 are approximately 0.5 and 0. For the external information form (II), the coefficients in the reduced model given formula are approximately ( − 0.95, 0.28, 0.19). For the approach that accounts for uncertainty in the auxiliary information (PEL*), the auxiliary information was estimated from an independent sample with sample size formula.

Table 1 summarizes the variable selection results by percentages of selecting each important predictor (i.e., true positive percentages), and the average percentage of selecting unimportant predictors (i.e., average false positive percentage). As shown in Table 1, the proposed PEL approaches yield higher variable selection accuracy when compared with the PML approach that does not incorporate auxiliary information. The variable selection accuracy improvement is most clear over X1 and X2 where the auxiliary information was available, ranging from 6.6% to 12.0% in selecting X1 and from 4.0% to 8.8% in selecting X2. Table 2 summarizes the parameter estimation results. As shown in Table 2, all three proposed PEL approaches yield a substantial efficiency gain over the PML approach. The relative efficiency ranges from 1.81 to 2.29 in estimating β1 and ranges from 1.21 to 1.66 in estimating β2.

TABLE 1

Summary of variable selection results when formula

PMLPELPELformulaPEL*
External information form (I)
X132.843.039.441.6
X243.249.451.447.2
X343.247.044.846.2
X476.081.477.281.6
X576.280.279.080.8
Average FP20.720.620.521.1
External information form (II)
X132.843.442.444.8
X243.248.452.049.0
X343.244.047.844.4
X476.081.877.482.0
X576.280.278.880.0
Average FP20.721.021.220.8
PMLPELPELformulaPEL*
External information form (I)
X132.843.039.441.6
X243.249.451.447.2
X343.247.044.846.2
X476.081.477.281.6
X576.280.279.080.8
Average FP20.720.620.521.1
External information form (II)
X132.843.442.444.8
X243.248.452.049.0
X343.244.047.844.4
X476.081.877.482.0
X576.280.278.880.0
Average FP20.721.021.220.8

Note. PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; FP, false positive (%).

TABLE 1

Summary of variable selection results when formula

PMLPELPELformulaPEL*
External information form (I)
X132.843.039.441.6
X243.249.451.447.2
X343.247.044.846.2
X476.081.477.281.6
X576.280.279.080.8
Average FP20.720.620.521.1
External information form (II)
X132.843.442.444.8
X243.248.452.049.0
X343.244.047.844.4
X476.081.877.482.0
X576.280.278.880.0
Average FP20.721.021.220.8
PMLPELPELformulaPEL*
External information form (I)
X132.843.039.441.6
X243.249.451.447.2
X343.247.044.846.2
X476.081.477.281.6
X576.280.279.080.8
Average FP20.720.620.521.1
External information form (II)
X132.843.442.444.8
X243.248.452.049.0
X343.244.047.844.4
X476.081.877.482.0
X576.280.278.880.0
Average FP20.721.021.220.8

Note. PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; FP, false positive (%).

TABLE 2

Summary of parameter estimation results when formula

CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−182024−182026−172125
β2−818−7167−7169−71610
β3−818−71616−81616−81616
β4−725−102016−102116−102017
β5−724−102016−102116−102017
External information form (II)
β1−1628−18198−191925−171912
β2−818−9145−91412−9148
β3−818−91516−81616−101516
β4−725−111916−102216−121917
β5−724−102117−102116−102118
CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−182024−182026−172125
β2−818−7167−7169−71610
β3−818−71616−81616−81616
β4−725−102016−102116−102017
β5−724−102016−102116−102017
External information form (II)
β1−1628−18198−191925−171912
β2−818−9145−91412−9148
β3−818−91516−81616−101516
β4−725−111916−102216−121917
β5−724−102117−102116−102118

Note. formula are the regression parameters with the true values of (0.3,0.2,0.2,0.4,0.4); PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; Bias, SD, and SEE, empirical bias (× 100), empirical standard deviation (× 100), and empirical mean of standard error estimates (× 100).

TABLE 2

Summary of parameter estimation results when formula

CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−182024−182026−172125
β2−818−7167−7169−71610
β3−818−71616−81616−81616
β4−725−102016−102116−102017
β5−724−102016−102116−102017
External information form (II)
β1−1628−18198−191925−171912
β2−818−9145−91412−9148
β3−818−91516−81616−101516
β4−725−111916−102216−121917
β5−724−102117−102116−102118
CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−182024−182026−172125
β2−818−7167−7169−71610
β3−818−71616−81616−81616
β4−725−102016−102116−102017
β5−724−102016−102116−102017
External information form (II)
β1−1628−18198−191925−171912
β2−818−9145−91412−9148
β3−818−91516−81616−101516
β4−725−111916−102216−121917
β5−724−102117−102116−102118

Note. formula are the regression parameters with the true values of (0.3,0.2,0.2,0.4,0.4); PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; Bias, SD, and SEE, empirical bias (× 100), empirical standard deviation (× 100), and empirical mean of standard error estimates (× 100).

We then considered the case that the population heterogeneity is present. Two important predictors, formula, and one unimportant predictor, X20, accounted for the predictor distributional differences in the density ratio model with formula. For the proposed PEL approach and its extension PELformula, the uncertainty in the auxiliary information was ignored. Under the heterogeneous population setting, for the external information form (I), the probabilities of formula for the two subgroups are approximately 0.32 and 0.39, and the means of X1 and X2 are approximately 0.62 and 0.5. For the external information form (II), the coefficients in the reduced model given formula are approximately ( − 0.92, 0.27, 0.18). For the proposed PEL* method, the auxiliary information was estimated from independent samples of formula subjects in each simulation. Tables 3 and 4 summarize the variable selection and parameter estimation results, respectively. As expected, the proposed PELformula method performs well, maintaining variable selection accuracy and estimation efficiency gains over the PML method. The variable selection improvement ranges from 8.8% to 9% in selecting X1 and from 6.0% to 7.0% in selecting X2. The relative efficiency ranges from 2.17 to 2.48 in estimating β1 and ranges from 1.07 to 1.59 in estimating β2. Two approaches that proposed under the homogeneity assumption (PEL and PEL*), however, yield larger biases.

TABLE 3

Summary of variable selection results when formula

PMLPELPELformulaPEL*
External information form (I)
X132.842.441.842.8
X243.245.250.247.8
X343.245.444.646.6
X476.079.678.281.6
X576.279.479.281.8
Average FP20.722.221.421.9
External information form (II)
X132.839.641.639.6
X243.246.249.246.2
X343.242.245.242.2
X476.079.677.279.4
X576.277.479.078.0
Average FP20.722.921.122.9
PMLPELPELformulaPEL*
External information form (I)
X132.842.441.842.8
X243.245.250.247.8
X343.245.444.646.6
X476.079.678.281.6
X576.279.479.281.8
Average FP20.722.221.421.9
External information form (II)
X132.839.641.639.6
X243.246.249.246.2
X343.242.245.242.2
X476.079.677.279.4
X576.277.479.078.0
Average FP20.722.921.122.9

Note. PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; FP, false positive (%).

TABLE 3

Summary of variable selection results when formula

PMLPELPELformulaPEL*
External information form (I)
X132.842.441.842.8
X243.245.250.247.8
X343.245.444.646.6
X476.079.678.281.6
X576.279.479.281.8
Average FP20.722.221.421.9
External information form (II)
X132.839.641.639.6
X243.246.249.246.2
X343.242.245.242.2
X476.079.677.279.4
X576.277.479.078.0
Average FP20.722.921.122.9
PMLPELPELformulaPEL*
External information form (I)
X132.842.441.842.8
X243.245.250.247.8
X343.245.444.646.6
X476.079.678.281.6
X576.279.479.281.8
Average FP20.722.221.421.9
External information form (II)
X132.839.641.639.6
X243.246.249.246.2
X343.242.245.242.2
X476.079.677.279.4
X576.277.479.078.0
Average FP20.722.921.122.9

Note. PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; FP, false positive (%).

TABLE 4

Summary of parameter estimation results when formula

CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−201723−181926−201724
β2−818−12116−71712−12119
β3−818−121116−91516−121116
β4−725−211416−112116−211317
β5−724−211416−112116−211416
External information form (II)
β1−1628−21167−191826−201710
β2−818−13114−91412−13107
β3−818−131116−81616−131117
β4−725−191716−102216−191716
β5−724−181816−112116−181817
CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−201723−181926−201724
β2−818−12116−71712−12119
β3−818−121116−91516−121116
β4−725−211416−112116−211317
β5−724−211416−112116−211416
External information form (II)
β1−1628−21167−191826−201710
β2−818−13114−91412−13107
β3−818−131116−81616−131117
β4−725−191716−102216−191716
β5−724−181816−112116−181817

Note. formula are the regression parameters with the true values of (0.3,0.2,0.2,0.4,0.4); PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; Bias, SD, and SEE, empirical bias (× 100), empirical standard deviation (× 100), and empirical mean of standard error estimates (× 100).

TABLE 4

Summary of parameter estimation results when formula

CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−201723−181926−201724
β2−818−12116−71712−12119
β3−818−121116−91516−121116
β4−725−211416−112116−211317
β5−724−211416−112116−211416
External information form (II)
β1−1628−21167−191826−201710
β2−818−13114−91412−13107
β3−818−131116−81616−131117
β4−725−191716−102216−191716
β5−724−181816−112116−181817
CoefPMLPELPELformulaPEL*
BiasSDBiasSDSEEBiasSDSEEBiasSDSEE
External information form (I)
β1−1628−201723−181926−201724
β2−818−12116−71712−12119
β3−818−121116−91516−121116
β4−725−211416−112116−211317
β5−724−211416−112116−211416
External information form (II)
β1−1628−21167−191826−201710
β2−818−13114−91412−13107
β3−818−131116−81616−131117
β4−725−191716−102216−191716
β5−724−181816−112116−181817

Note. formula are the regression parameters with the true values of (0.3,0.2,0.2,0.4,0.4); PML, the adaptive lasso penalized maximum likelihood approach; PEL, the adaptive lasso penalized empirical likelihood approach; PELformula, the adaptive lasso penalized empirical likelihood approach accounting for population heterogeneity; PEL*, the adaptive lasso penalized empirical likelihood approach accounting for uncertainty in the auxiliary information; Bias, SD, and SEE, empirical bias (× 100), empirical standard deviation (× 100), and empirical mean of standard error estimates (× 100).

3.2 Pediatric Kidney Transplantation Data Analysis

End-stage renal disease (ESRD) is the last stage of chronic kidney disease. The incidence of ESRD in children and adolescents, 0 to 21 years of age, is steadily decreasing from 17.5 per million in 2004 to 13.8 per million population in 2016. Over the past few decades, kidney transplantation (KT) has emerged as the optimal treatment of ESRD, providing a significant survival advantage over dialysis. Approximately 800 children in the United States undergo KT, representing about 5% of all kidney transplants in the nation each year. The outcome of pediatric KT has been improved significantly owing to the advances in immunosuppressants, including tacrolimus (TAC) and mycophenolate mofetil (MMF), and surgical techniques. However, rejection after transplant in children remains a serious concern, as it can lead to significant morbidity, graft loss, and death. We investigated important predictors of acute rejection in the pediatric population post wide adoption of the advanced immunosuppressants.

We analyzed data collected at transplant centers in California from July 2004 to June 2014. The binary outcome acute rejection was defined as recorded rejection episode within 6 months post transplantation. Of a total of 833 eligible pediatric kidney transplants, 13 were undetermined cases due to loss to follow-up and were ignored. Among the remaining 820 pediatric kidney transplant recipients (<18 years of age) who received TAC and MMF immunosuppression for prevention of rejection, 61 (7.4%) experienced early acute rejection. We considered 18 potential predictors, including demographic and immunological factors and key interaction terms with donor type, and applied the proposed method for variable selection. We exploited proportions of deceased donors and older donors along with the subgroup acute rejection rates by donor type reported in the entire Organ Procurement and Transplantation Network (OPTN) data. The rate of acute rejection was 9.07% among deceased donor transplants and 6.28% among living donor transplants. Also 61.16% were deceased donor transplants and 25.89% received organs from donors older than 35 years.

We applied the PEL ratio test proposed in Theorem 3 to test the homogeneity assumption. The null hypothesis was rejected at a type I error rate of 5% (formula) and the heterogeneity between the subject-level predictor data collected in California and the entire OPTN data were suggested. We hence applied the proposed PELformula approach in Section 2.3 to account for the population heterogeneity. Four predictors (deceased donor, African American [AA] race, induction therapy, and the per center total number of pediatric transplants conducted during the study period [>100 versus ⩽100]) were included in the density ratio model (7) to account for heterogeneity in the predictor distributions. The selection was based on comparison of the reported summary statistics of the predictors in the entire OPTN data (Nehus et al., 2017) with the corresponding sample statistics in the California data (see Table S1 in the Supporting Information). For comparison, we also analyzed the data with the PML method without incorporating the auxiliary information.

Coefficient estimates derived using penalization methods change as a function of the tuning parameter in the penalty term. Figure 1 shows the coefficient estimates as functions of formula for the PML approach and the proposed PELformula approach. Notably the coefficient estimate curve of the donor age (≥35 years) derived using the proposed PELformula method decreases more slowly than that derived using the PML method. The coefficient estimates with the tuning parameter selected using the Bayesian information criteria (BIC)-type criteria are noted by where their respective curves cross the vertical lines in the figure. The PML method selects four predictors: delayed graft function, panel reactive antibody (formula), deceased donor, and its interaction with recipient age (≥6 years). By incorporating the external aggregated information, the proposed PELformula method additionally selects donor age (≥35 years) as an important predictor. Donor age has been recognized as a prognostic factor. Starting from October 2005, the OPTN has implemented a new allocation policy, known as Share-35 (S35), which gives high priority access to pediatric recipients for kidney transplants from deceased donors ages <35 years.

The solution paths for the pediatric kidney transplant study
Figure 1

The solution paths for the pediatric kidney transplant study

Note: The vertical lines indicate the tuning parameter values selected for the two approaches using the BIC-type criteria. Each curve represents the coefficient estimate trajectory for each predictor (labeled on the right) as a function of formula. PRA, panel reactive antibodies; induction therapy: include Interleukin-2 receptor antibody and lymphocyte depleting agents; number of pediatric transplants, per center total number of pediatric transplants performed during the study period; HLA mismatch, human leukocyte antigen mismatches.

We refitted the model with the selected five important variables only, using the proposed extended empirical likelihood (ELformula) approach without penalization. The results obtained using the usual maximum likelihood (ML) approach were reported for comparison. The bootstrap method was used to estimate the standard errors of the regression coefficient estimates and the corresponding confidence intervals. As shown in Table 5, the two methods yield similar coefficient estimates. More importantly, the proposed ELformula method yields a substantial efficiency gain, with the relative efficiency ranging from 1.334 to 4.561. The result is well expected since we incorporated the external aggregated information. It is worthwhile to point out that the effects of the two known important predictors, donor age (≥35 years) and delayed graft function, do not reach statistical significance when applying the ML method but are found to be statistically significant when applying the proposed ELformula method. Specifically, donor age (≥35 years) and delayed graft function are significantly associated with higher acute rejection rates, with odds ratios of formula (95% CI, [1.226, 2.743]) and formula (95% CI, [1.811, 5.490]), respectively, estimated by the ELformula method.

TABLE 5

Regression coefficient estimates for the pediatric kidney transplant study

MLELformula
EstBSE95% CIEstBSE95% CI
Deceased donor1.0670.425( 0.017, 1.645)1.0940.199( 0.738, 1.534)
Donor age ≥35 years0.6750.343(−0.207, 1.061)0.7420.197( 0.204, 1.009)
Panel reactive antibody formula0.7840.272( 0.078, 1.182)0.8100.236( 0.330, 1.217)
Delayed graft function1.0860.576(−0.314, 1.887)1.0920.272( 0.594, 1.703)
Deceased donor & recipient age ≥6 years−0.8810.319(−1.288, −0.055)−0.8720.230(−1.349, −0.416)
MLELformula
EstBSE95% CIEstBSE95% CI
Deceased donor1.0670.425( 0.017, 1.645)1.0940.199( 0.738, 1.534)
Donor age ≥35 years0.6750.343(−0.207, 1.061)0.7420.197( 0.204, 1.009)
Panel reactive antibody formula0.7840.272( 0.078, 1.182)0.8100.236( 0.330, 1.217)
Delayed graft function1.0860.576(−0.314, 1.887)1.0920.272( 0.594, 1.703)
Deceased donor & recipient age ≥6 years−0.8810.319(−1.288, −0.055)−0.8720.230(−1.349, −0.416)

Note. ML, the unpenalized maximum likelihood approach; ELformula, the unpenalized extended empirical likelihood approach with auxiliary information; Est, the estimated regression coefficient; BSE, bootstrap standard error given by the standard deviation of the 500 estimates; 95% CI, the 95% bootstrap confidence interval given by the 2.5% and 97.5% of the 500 estimates.

TABLE 5

Regression coefficient estimates for the pediatric kidney transplant study

MLELformula
EstBSE95% CIEstBSE95% CI
Deceased donor1.0670.425( 0.017, 1.645)1.0940.199( 0.738, 1.534)
Donor age ≥35 years0.6750.343(−0.207, 1.061)0.7420.197( 0.204, 1.009)
Panel reactive antibody formula0.7840.272( 0.078, 1.182)0.8100.236( 0.330, 1.217)
Delayed graft function1.0860.576(−0.314, 1.887)1.0920.272( 0.594, 1.703)
Deceased donor & recipient age ≥6 years−0.8810.319(−1.288, −0.055)−0.8720.230(−1.349, −0.416)
MLELformula
EstBSE95% CIEstBSE95% CI
Deceased donor1.0670.425( 0.017, 1.645)1.0940.199( 0.738, 1.534)
Donor age ≥35 years0.6750.343(−0.207, 1.061)0.7420.197( 0.204, 1.009)
Panel reactive antibody formula0.7840.272( 0.078, 1.182)0.8100.236( 0.330, 1.217)
Delayed graft function1.0860.576(−0.314, 1.887)1.0920.272( 0.594, 1.703)
Deceased donor & recipient age ≥6 years−0.8810.319(−1.288, −0.055)−0.8720.230(−1.349, −0.416)

Note. ML, the unpenalized maximum likelihood approach; ELformula, the unpenalized extended empirical likelihood approach with auxiliary information; Est, the estimated regression coefficient; BSE, bootstrap standard error given by the standard deviation of the 500 estimates; 95% CI, the 95% bootstrap confidence interval given by the 2.5% and 97.5% of the 500 estimates.

4 Discussion

In this paper, we have proposed a PEL approach to improve variable selection accuracy and estimation efficiency by synthesizing information from a mixture of high-dimensional subject-level data and low-dimensional aggregate data. The proposed approach is extended to account for the population heterogeneity and uncertainty in the auxiliary information. Although we focus on the logistic regression, the proposed approaches can be easily extended to handle more general parametric models. Moreover, our work assumes that the covariates formula in the density ratio model are of a fixed dimension and are correctly specified. When such knowledge is not available, one can set formula and impose a sparsity assumption that only a small subset of formula accounts for the heterogeneity in the predictor distributions. Then we can include a proper penalty for formula in the penalized log likelihood for variable selection in formula. This will be investigated in our future research.

Data Availability Statement

The data that support the findings of the pediatric kidney transplant data analysis in this paper are available from the Organ Procurement and Transplantation Network (OPTN). The general public can submit a request to OPTN for the data at the following URL: https://optn.transplant.hrsa.gov/data/request-data/.

Acknowledgments

This work was supported by National Institutes of Health grant R01CA193888. This work was supported in part by Health Resources and Services Administration contract 234-2005-37011C. The authors thank Dr. Edward at the Cincinnati Children's Hospital Medical Center for providing clinical insight to the analysis of the pediatric kidney transplant study.

References

Boonstra
,
P.S.
,
Taylor
,
J.M.
and
Mukherjee
,
B.
(
2016
)
Increasing efficiency for estimating treatment–biomarker interactions with historical data
.
Statistical Methods in Medical Research
,
25
,
2959
2971
.

Chang
,
J.
,
Tang
,
C.Y.
and
Wu
,
T.T.
(
2018
)
A new scope of penalized empirical likelihood with high-dimensional estimating equations
.
The Annals of Statistics
,
46
,
3185
3216
.

Chatterjee
,
N.
,
Chen
,
Y.-H.
,
Maas
,
P.
and
Carroll
,
R.J.
(
2016
)
Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources
.
Journal of the American Statistical Association
,
111
,
107
117
.

Chen
,
B.
and
Qin
,
J.
(
2014
)
Use of empirical likelihood to calibrate auxiliary information in partly linear monotone regression models
.
Statistics in Medicine
,
33
,
1713
1722
.

Chen
,
J.
,
Pee
,
D.
,
Ayyagari
,
R.
,
Graubard
,
B.
,
Schairer
,
C.
,
Byrne
,
C.
,
Benichou
,
J.
and
Gail
,
M.H.
(
2006
)
Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density
.
Journal of the National Cancer Institute
,
98
,
1215
1226
.

Chen
,
J.
,
Sitter
,
R.
and
Wu
,
C.
(
2002
)
Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys
.
Biometrika
,
89
,
230
237
.

Chen
,
S.X.
,
Peng
,
L.
and
Qin
,
Y.-L.
(
2009
)
Effects of data dimension on empirical likelihood
.
Biometrika
,
96
,
711
722
.

Costantino
,
J.P.
,
Gail
,
M.H.
,
Pee
,
D.
,
Anderson
,
S.
,
Redmond
,
C.K.
,
Benichou
,
J.
and
Wieand
,
H.S.
(
1999
)
Validation studies for models projecting the risk of invasive and total breast cancer incidence
.
Journal of the National Cancer Institute
,
91
,
1541
1548
.

Donald
,
S.G.
,
Imbens
,
G.W.
and
Newey
,
W.K.
(
2003
)
Empirical likelihood estimation and consistent tests with conditional moment restrictions
.
Journal of Econometrics
,
117
,
55
93
.

Evangelou
,
E.
and
Ioannidis
,
J. P.A.
(
2013
)
Meta-analysis methods for genome-wide association studies and beyond
.
Nature Reviews Genetics
,
14
,
379
389
.

Fan
,
J.
and
Li
,
R.
(
2001
)
Variable selection via nonconcave penalized likelihood and its oracle properties
.
Journal of the American Statistical Association
,
96
,
1348
1360
.

Fan
,
J.
and
Peng
,
H.
(
2004
)
Nonconcave penalized likelihood with a diverging number of parameters
.
The Annals of Statistics
,
32
,
928
961
.

Gail
,
M.H.
(
2011
)
Personalized estimates of breast cancer risk in clinical practice and public health
.
Statistics in Medicine
,
30
,
1090
1104
.

Gail
,
M.H.
,
Brinton
,
L.A.
,
Byar
,
D.P.
,
Corle
,
D.K.
,
Green
,
S.B.
,
Schairer
,
C.
and
Mulvihill
,
J.J.
(
1989
)
Projecting individualized probabilities of developing breast cancer for white females who are being examined annually
.
Journal of the National Cancer Institute
,
81
,
1879
1886
.

Han
,
P.
and
Lawless
,
J.F.
(
2019
)
Empirical likelihood estimation using auxiliary summary information with different covariate distributions
.
Statistica Sinica
,
29
,
1321
1342
.

He
,
Q.
,
Zhang
,
H.H.
,
Avery
,
C.L.
and
Lin
,
D.
(
2016
)
Sparse meta-analysis with high-dimensional data
.
Biostatistics
,
17
,
205
220
.

Hjort
,
N.L.
,
McKeague
,
I.W.
and
Van Keilegom
,
I.
(
2009
)
Extending the scope of empirical likelihood
.
The Annals of Statistics
,
37
,
1079
1111
.

Huang
,
C.-Y.
,
Qin
,
J.
and
Tsai
,
H.-T.
(
2016
)
Efficient estimation of the Cox model with auxiliary subgroup survival information
.
Journal of the American Statistical Association
,
111
,
787
799
.

Imbens
,
G.W.
(
2002
)
Generalized method of moments and empirical likelihood
.
Journal of Business & Economic Statistics
,
20
,
493
506
.

Leng
,
C.
and
Tang
,
C.Y.
(
2012
)
Penalized empirical likelihood and growing dimensional general estimating equations
.
Biometrika
,
99
,
703
716
.

Liu
,
D.
,
Zheng
,
Y.
,
Prentice
,
R.L.
and
Hsu
,
L.
(
2014
)
Estimating risk with time-to-event data: an application to the Women's Health Initiative
.
Journal of the American Statistical Association
,
109
,
514
524
.

Nehus
,
E.J.
,
Liu
,
C.
,
Lu
,
B.
,
Macaluso
,
M.
and
Kim
,
M.-O.
(
2017
)
Graft survival of pediatric kidney transplant recipients selected for de novo steroid avoidance: a propensity score-matched study
.
Nephrology Dialysis Transplantation
,
32
,
1424
1431
.

Newey
,
W.K.
and
Smith
,
R.J.
(
2004
)
Higher order properties of GMM and generalized empirical likelihood estimators
.
Econometrica
,
72
,
219
255
.

Owen
,
A.B.
(
1988
)
Empirical likelihood ratio confidence intervals for a single functional
.
Biometrika
,
75
,
237
249
.

Owen
,
A.B.
(
1990
)
Empirical likelihood ratio confidence regions
.
The Annals of Statistics
,
18
,
90
120
.

Pasaniuc
,
B.
and
Price
,
A.L.
(
2017
)
Dissecting the genetics of complex traits using summary association statistics
.
Nature Reviews Genetics
,
18
,
117
127
.

Qin
,
J.
(
2000
)
Combining parametric and empirical likelihoods
.
Biometrika
,
87
,
484
490
.

Qin
,
J.
and
Lawless
,
J.
(
1994
)
Empirical likelihood and general estimating equations
.
The Annals of Statistics
,
22
,
300
325
.

Qin
,
J.
and
Lawless
,
J.
(
1995
)
Estimating equations, empirical likelihood and constraints on parameters
.
Canadian Journal of Statistics
,
23
,
145
159
.

Qin
,
J.
,
Zhang
,
H.
,
Li
,
P.
,
Albanes
,
D.
and
Yu
,
K.
(
2015
)
Using covariate-specific disease prevalence information to increase the power of case-control studies
.
Biometrika
,
102
,
169
180
.

Simmonds
,
M.
and
Higgins
,
J.
(
2007
)
Covariate heterogeneity in meta-analysis: criteria for deciding between meta-regression and individual patient data
.
Statistics in Medicine
,
26
,
2982
2999
.

Tang
,
C.Y.
and
Leng
,
C.
(
2010
)
Penalized high-dimensional empirical likelihood
.
Biometrika
,
97
,
905
920
.

Thall
,
P.F.
and
Simon
,
R.
(
1990
)
Incorporating historical control data in planning phase II clinical trials
.
Statistics in Medicine
,
9
,
215
228
.

Wu
,
C.
and
Sitter
,
R.R.
(
2001
)
A model-calibration approach to using complete auxiliary information from survey data
.
Journal of the American Statistical Association
,
96
,
185
193
.

Zou
,
H.
(
2006
)
The adaptive lasso and its oracle properties
.
Journal of the American Statistical Association
,
101
,
1418
1429
.

Zou
,
H.
and
Hastie
,
T.
(
2005
)
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B
,
67
,
301
320
.

Zou
,
H.
and
Zhang
,
H.H.
(
2009
)
On the adaptive elastic-net with a diverging number of parameters
.
The Annals of Statistics
,
37
,
1733
1751
.

Appendix

(C1) The covariate vector formula is bounded with probability one. The true value formula lies in a compact subset of formula.

(C2) The functions formula and formula are continuous in a neighborhood of the true value formula. Moreover, the functions formula, formula, and formula are bounded by some integrable functions in this neighborhood.

(C3) Let formula and formula denote the smallest and largest eigenvalues of a positive definite matrix M, respectively. Assume that there exist constants formula and formula such that

Conditions (C1) and (C2) are commonly imposed for estimating functions under the empirical likelihood framework. Condition (C3) assumes that the Fisher information matrix under the logistic regression model is positive definite and its eigenvalues are bounded. Similar conditions have been assumed in Fan and Peng (2004) and Zou and Zhang (2009) under different model assumptions.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data