-
PDF
- Split View
-
Views
-
Cite
Cite
Thuan Nguyen, Jiming Jiang, Restricted fence method for covariate selection in longitudinal data analysis, Biostatistics, Volume 13, Issue 2, April 2012, Pages 303–314, https://doi.org/10.1093/biostatistics/kxr046
Close -
Share
Abstract
Fence method (Jiang and others 2008. Fence methods for mixed model selection. Annals of Statistics36, 1669–1692) is a recently proposed strategy for model selection. It was motivated by the limitation of the traditional information criteria in selecting parsimonious models in some nonconventional situations, such as mixed model selection. Jiang and others (2009. A simplified adaptive fence procedure, Statistics & Probability Letters79, 625–629) simplified the adaptive fence method of Jiang and others (2008) to make it more suitable and convenient to use in a wide variety of problems. Still, the current modification encounters computational difficulties when applied to high-dimensional and complex problems. To address this concern, we proposed a restricted fence procedure that combines the idea of the fence with that of the restricted maximum likelihood. Furthermore, we propose to use the wild bootstrap for choosing adaptively the tuning parameter used in the restricted fence. We focus on problems of longitudinal studies and demonstrate the performance of the new procedure and its comparison with other procedures of variable selection, including the information criteria and shrinkage methods, in simulation studies. The method is further illustrated by an example of real-data analysis.
1. INTRODUCTION
Recently, Jiang and others (2008) developed a new strategy for model selection, known as the “fence” methods. The basic idea is to build a statistical fence, or barrier, to carefully isolate a subgroup of what are known as the correct models. Once the fence is constructed, the optimal model is selected from those within the fence according to a criterion which can incorporate quantities of practical interest. Jiang and others (2009) developed a simplified adaptive fence (SAF) procedure to reduce the computational burden of the adaptive fence of Jiang and others (2008) (see Section 2.2 for more details). A summary of the fence methods is provided in the supplementary appendix available at Biostatistics online. On the other hand, even with the SAF, one may still encounter computational difficulties when applying the fence to high-dimensional and complex problems. The main difficulty rests in the evaluation of a large number of measures of lack-of-fit for every bootstrap sample if, for example, the number of candidate variables is fairly large. Furthermore, as in Jiang and others (2008), the adaptive fence or SAF involve bootstrapping under the full model. Such a procedure may not be robust and can be time consuming if the full model is complex.
To address these concerns, we propose a restricted fence procedure that combines the idea of the fence with that of the restricted maximum likelihood (REML). We show how to implement the restricted fence via a wild bootstrap procedure, whose validity is discussed. Finite sample performance of the restricted fence is studied as well as its comparison with the information criteria and shrinkage methods of variable selection in a number of simulation studies. The method is further illustrated using a real-data example. Further results and technical derivations are deferred to the supplementary appendix available at Biostatistics online.
2. RESTRICTED FENCE PROCEDURE
2.1 Method
The REML is well known in mixed model analysis (e.g. Jiang, 2007). The idea is to first apply a transformation to the data to get rid of the (nuisance) fixed effects. Maximum likelihood (ML) is then applied with the transformed data to estimate the variance components. The transformation is constructed so that there is no loss of information in estimating the variance components. Our idea is to combine the ideas of REML and SAF to come up with a strategy for variable selection. We focus on longitudinal studies in which the mean response is often of main interest. As a result, selection of the fixed covariates that are directly associated with the mean is of main interest. Quite often in such studies, the number of candidate covariates, or variables, is fairly large. Thus, as noted, direct application of the fence may encounter computational difficulties. To reduce the computational difficulty, we first apply a transformation to the data that is orthogonal to a (large) subset of the candidate variables to make them “disappear.” The SAF is then applied to the remaining (small) subset of the candidate variables. The term “restricted” is used because the first step of the proposed procedure involves the same transformation of the data as in REML (e.g. Jiang, 2007, p. 13); however, there is no estimation of the variance components.





The point is that S1 can be any subset of the candidate variables. Thus, by dividing S into a number of subsets and applying the above method to every subset, a number of variables are selected from each subset (or no variable is selected from the subset). Finally, the SAF is applied to all the variables that are picked up from the subsets to select the final set of covariate variables.
2.2 Algorithm
A numerical algorithm for the restricted fence procedure is given below:
For the candidate variables xj,1 ≤ j ≤ J, determine a division S = {1,…,J} = 𝒮1∪…∪𝒮q, where 𝒮r,1 ≤ r ≤ q are subsets of S (not necessarily disjoint).
Let S1 = 𝒮1 and S2 = S∖S1. Apply the SAF using the measure of lack-of-fit (2.1) to select the variables among xj,j∈S1. The SAF consists of the following steps:
2.1. Estimating the parameters under the restricted full model ( = {xj,j∈S1}).
2.2. Bootstrapping under the restricted full model; for each bootstrapped sample, select the optimal model using the fence [see (A.2) of the supplementary appendix available at Biostatistics online] for each c among a grid 0 < c1 < ⋯ < cK.
2.3. For each c among the grid, compute the frequency, over the bootstrap samples, that each candidate model is selected as the optimal model; compute the maximum frequency, denoted by p*. Note that p* depends on c.
2.4. Find a peak in the middle of the plot of p* against c; let c* be the c corresponding to the peak; use the fence with c = c* to select the final optimal model.
Apply the same procedure as Step 2 to 𝒮2,…,𝒮q.
Apply another SAF to the subset of variables selected in Step 2 and Step 3 (combined; considered as the new candidate variables) to select the final variables.
3. RESTRICTED FENCE FOR LONGITUDINAL DATA

While there is an extensive literature on modeling the correlation structures, parameter estimation, and inference about the mean response (e.g. Diggle and others, 2002, Jiang, 2007), longitudinal model selection has received much less attention. In particular, there is a lack of theoretical development regarding model selection criteria due to the nonconventional features of the longitudinal data (see Jiang and others, 2008). Although a practioners may employ a number of heuristic selection criteria, such as the AIC (Akaike, 1973), BIC (Schwarz, 1978), HQ (Hannan and Quinn, 1979), and CAIC (or consistent AIC; see Bozdogan, 1987), the theoretical bases for these methods have not been justified in the longitudinal setting. In fact, our simulation results (see below) show that some of these methods may perform poorly in selecting parsimonious models for longitudinal studies. On the other hand, the fence methods (Jiang and others, 2008) were developed for dealing with nonconventional model selection problems. In particular, the restricted fence method introduced in Section 2 applies naturally to longitudinal variable selection problems. Such a problem is motivated by practical problems of longitudinal studies, which often involve many potential variables.
The measure of lack-of-fit, QM, for the restricted fence is chosen as the RSS as in Section 2. The measure is computationally easy to operate, which is very important for high-dimensional model selection problems. Note that an explicit expression of is given by (2.1). Furthermore, in our simulation study, restricted fence based on RSS performs very well as compared with other methods (see below for further discussion).
3.1 Simulation study
We consider the following linear mixed model: yij = xij′β + υi + ϵij, i = 1,…,n; j = 1,…,T, where i represents the ith subject, and j the jth time point; υi∼N(0,σv2), ϵij∼N(0,σe2), and υis and ϵijs are independent.
The data were simulated to mimic a real data set regarding a bone turnover study. The data for the study were collected over 3 time points (within 12 months study period). The participants were women (18–40 years of age) in 2 dietary groups, vegan, or omnivore. The outcome of interest was a marker of bone formation (Osteocalcin), measured over time with respect to dietary groups. The covariates including 30 variables. See supplementary appendix available at Biostatistics online for a list of these variables.
As measurements were collected at 3 time points, we have T = 3. We consider 3 cases: n = 50,n = 100, and n = 150, where n is the number of subjects. In the real data set, there were 48 participants, 24 of them are vegan and the others are omnivore. Thus, we set up the simulations in a similar way so that half of the subjects are in each dietary group. The continuous covariates are generated under the normal distribution with the mean and standard deviation (SD) equal to those obtained from the real data set with respect to time and group categories, disregarding the missing values.
The true model used for the simulation includes the following variables: time, dietary group, height, N-telopeptide, and crude calcium balance. The true regression coefficients are β = (1,1,1,1,0.05,0.25,0.001)′, corresponding to the intercept, 2 time-point indicators, the dietary indicator, and the 3 continuous variables. These coefficients are set to be similar to those obtained from the real data under the full model except for the coefficient of crude calcium balance. The latter variable is known to be associated with the bone metabolism (Anderson and Garner, 1995). Yet, it is not a significant variable according to the real-data analysis in that its coefficient under the full model was quite small. Thus, in the simulation, we bring this variable into the true model (due to its practical interest) by increasing the value of its coefficient to 0.001. The variances of the subject-specific random effects and random errors, σv2 and σe2 are set to be 1, which is close to their estimates from the real data. The number of bootstrap samples for the restricted fence is 100. A total of 100 simulations are run under each sample size.
For the restricted fence, we divide all potential predictors into 4 groups according to biological considerations. The variable groups are Group A-1: 1–8, Group A-2: 9–15, Group A-3: 16–22, and Group A-4: 23–30, where the variable numbers correspond to those listed in Section A.3 of the supplementary appendix available at Biostatistics online. The true variables correspond to the numbers 1, 2, 3, 4, 9, 10, and 27 on the list. We then apply the SAF procedure to each group based on the transformed data (see Section 2). The results are reported in Table 1. Same data comparisons are made with 4 of the traditional information criteria, AIC, BIC, HQ, and CAIC. Due to the high dimensional and complex data structure, the forward/backward (F/B) procedure of Broman and Speed (2002) is applied, where the forward selection stops when 50% of the candidate variables are selected, which is then followed by the backward elimination. We then apply AIC, BIC, HQ, and CAIC to the sequence of models generated by the F/B procedure and choose the model with minimum AIC, BIC, HQ, or CAIC, respectively, as the optimal model.
Empirical results. See Section 3.1 for notation
| n | Summary | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| 50 | TP | 53 | 37 | 33 | 28 | 0 | 15 | 0 |
| UF | 26 | 46 | 39 | 39 | 14 | 32 | 29 | |
| OF | 21 | 17 | 28 | 33 | 86 | 53 | 71 | |
| MC | 6.73 | 6.54 | 6.61 | 6.61 | 6.86 | 6.66 | 6.63 | |
| SD | (0.46) | (0.50) | (0.49) | (0.49) | (0.34) | (0.52) | (0.80) | |
| MIC | 0.34 | 0.39 | 0.68 | 0.77 | 4.28 | 1.84 | 1.30 | |
| SD | (0.60) | (0.69) | (0.88) | (0.88) | (1.93) | (1.98) | (0.48) | |
| 100 | TP | 85 | 65 | 54 | 47 | 0 | 30 | 0 |
| UF | 5 | 7 | 4 | 3 | 1 | 2 | 44 | |
| OF | 10 | 28 | 42 | 50 | 99 | 68 | 56 | |
| MC | 6.94 | 6.92 | 6.96 | 6.97 | 6.99 | 6.98 | 6.28 | |
| SD | (0.27) | (0.30) | (0.19) | (0.17) | (0.10) | (0.14) | (1.05) | |
| MIC | 0.12 | 0.33 | 0.52 | 0.72 | 4.70 | 1.76 | 6.72 | |
| SD | (0.38) | (0.53) | (0.65) | (0.87) | (2.03) | (2.05) | (0.58) | |
| 150 | TP | 96 | 82 | 69 | 56 | 0 | 37 | 0 |
| UF | 3 | 0 | 0 | 0 | 2 | 1 | 86 | |
| OF | 1 | 18 | 31 | 44 | 98 | 62 | 14 | |
| MC | 6.97 | 7.00 | 7.00 | 7.00 | 6.98 | 6.99 | 5.39 | |
| SD | (0.17) | (0.00) | (0.00) | (0.00) | (0.14) | (0.10) | (1.43) | |
| MIC | 0.01 | 0.22 | 0.38 | 0.58 | 4.14 | 1.45 | 10.58 | |
| SD | (0.10) | (0.50) | (0.63) | (0.76) | (2.01) | (1.85) | (0.93) |
| n | Summary | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| 50 | TP | 53 | 37 | 33 | 28 | 0 | 15 | 0 |
| UF | 26 | 46 | 39 | 39 | 14 | 32 | 29 | |
| OF | 21 | 17 | 28 | 33 | 86 | 53 | 71 | |
| MC | 6.73 | 6.54 | 6.61 | 6.61 | 6.86 | 6.66 | 6.63 | |
| SD | (0.46) | (0.50) | (0.49) | (0.49) | (0.34) | (0.52) | (0.80) | |
| MIC | 0.34 | 0.39 | 0.68 | 0.77 | 4.28 | 1.84 | 1.30 | |
| SD | (0.60) | (0.69) | (0.88) | (0.88) | (1.93) | (1.98) | (0.48) | |
| 100 | TP | 85 | 65 | 54 | 47 | 0 | 30 | 0 |
| UF | 5 | 7 | 4 | 3 | 1 | 2 | 44 | |
| OF | 10 | 28 | 42 | 50 | 99 | 68 | 56 | |
| MC | 6.94 | 6.92 | 6.96 | 6.97 | 6.99 | 6.98 | 6.28 | |
| SD | (0.27) | (0.30) | (0.19) | (0.17) | (0.10) | (0.14) | (1.05) | |
| MIC | 0.12 | 0.33 | 0.52 | 0.72 | 4.70 | 1.76 | 6.72 | |
| SD | (0.38) | (0.53) | (0.65) | (0.87) | (2.03) | (2.05) | (0.58) | |
| 150 | TP | 96 | 82 | 69 | 56 | 0 | 37 | 0 |
| UF | 3 | 0 | 0 | 0 | 2 | 1 | 86 | |
| OF | 1 | 18 | 31 | 44 | 98 | 62 | 14 | |
| MC | 6.97 | 7.00 | 7.00 | 7.00 | 6.98 | 6.99 | 5.39 | |
| SD | (0.17) | (0.00) | (0.00) | (0.00) | (0.14) | (0.10) | (1.43) | |
| MIC | 0.01 | 0.22 | 0.38 | 0.58 | 4.14 | 1.45 | 10.58 | |
| SD | (0.10) | (0.50) | (0.63) | (0.76) | (2.01) | (1.85) | (0.93) |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure.
Empirical results. See Section 3.1 for notation
| n | Summary | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| 50 | TP | 53 | 37 | 33 | 28 | 0 | 15 | 0 |
| UF | 26 | 46 | 39 | 39 | 14 | 32 | 29 | |
| OF | 21 | 17 | 28 | 33 | 86 | 53 | 71 | |
| MC | 6.73 | 6.54 | 6.61 | 6.61 | 6.86 | 6.66 | 6.63 | |
| SD | (0.46) | (0.50) | (0.49) | (0.49) | (0.34) | (0.52) | (0.80) | |
| MIC | 0.34 | 0.39 | 0.68 | 0.77 | 4.28 | 1.84 | 1.30 | |
| SD | (0.60) | (0.69) | (0.88) | (0.88) | (1.93) | (1.98) | (0.48) | |
| 100 | TP | 85 | 65 | 54 | 47 | 0 | 30 | 0 |
| UF | 5 | 7 | 4 | 3 | 1 | 2 | 44 | |
| OF | 10 | 28 | 42 | 50 | 99 | 68 | 56 | |
| MC | 6.94 | 6.92 | 6.96 | 6.97 | 6.99 | 6.98 | 6.28 | |
| SD | (0.27) | (0.30) | (0.19) | (0.17) | (0.10) | (0.14) | (1.05) | |
| MIC | 0.12 | 0.33 | 0.52 | 0.72 | 4.70 | 1.76 | 6.72 | |
| SD | (0.38) | (0.53) | (0.65) | (0.87) | (2.03) | (2.05) | (0.58) | |
| 150 | TP | 96 | 82 | 69 | 56 | 0 | 37 | 0 |
| UF | 3 | 0 | 0 | 0 | 2 | 1 | 86 | |
| OF | 1 | 18 | 31 | 44 | 98 | 62 | 14 | |
| MC | 6.97 | 7.00 | 7.00 | 7.00 | 6.98 | 6.99 | 5.39 | |
| SD | (0.17) | (0.00) | (0.00) | (0.00) | (0.14) | (0.10) | (1.43) | |
| MIC | 0.01 | 0.22 | 0.38 | 0.58 | 4.14 | 1.45 | 10.58 | |
| SD | (0.10) | (0.50) | (0.63) | (0.76) | (2.01) | (1.85) | (0.93) |
| n | Summary | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| 50 | TP | 53 | 37 | 33 | 28 | 0 | 15 | 0 |
| UF | 26 | 46 | 39 | 39 | 14 | 32 | 29 | |
| OF | 21 | 17 | 28 | 33 | 86 | 53 | 71 | |
| MC | 6.73 | 6.54 | 6.61 | 6.61 | 6.86 | 6.66 | 6.63 | |
| SD | (0.46) | (0.50) | (0.49) | (0.49) | (0.34) | (0.52) | (0.80) | |
| MIC | 0.34 | 0.39 | 0.68 | 0.77 | 4.28 | 1.84 | 1.30 | |
| SD | (0.60) | (0.69) | (0.88) | (0.88) | (1.93) | (1.98) | (0.48) | |
| 100 | TP | 85 | 65 | 54 | 47 | 0 | 30 | 0 |
| UF | 5 | 7 | 4 | 3 | 1 | 2 | 44 | |
| OF | 10 | 28 | 42 | 50 | 99 | 68 | 56 | |
| MC | 6.94 | 6.92 | 6.96 | 6.97 | 6.99 | 6.98 | 6.28 | |
| SD | (0.27) | (0.30) | (0.19) | (0.17) | (0.10) | (0.14) | (1.05) | |
| MIC | 0.12 | 0.33 | 0.52 | 0.72 | 4.70 | 1.76 | 6.72 | |
| SD | (0.38) | (0.53) | (0.65) | (0.87) | (2.03) | (2.05) | (0.58) | |
| 150 | TP | 96 | 82 | 69 | 56 | 0 | 37 | 0 |
| UF | 3 | 0 | 0 | 0 | 2 | 1 | 86 | |
| OF | 1 | 18 | 31 | 44 | 98 | 62 | 14 | |
| MC | 6.97 | 7.00 | 7.00 | 7.00 | 6.98 | 6.99 | 5.39 | |
| SD | (0.17) | (0.00) | (0.00) | (0.00) | (0.14) | (0.10) | (1.43) | |
| MIC | 0.01 | 0.22 | 0.38 | 0.58 | 4.14 | 1.45 | 10.58 | |
| SD | (0.10) | (0.50) | (0.63) | (0.76) | (2.01) | (1.85) | (0.93) |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure.
In addition, there have been several shrinkage variable selection methods, following the Lasso (Tibshirani, 1996). We make the same data comparisons of our method with 2 of the most popular shrinkage methods, the adaptive Lasso (Zou, 2006) and the smoothly clipped absolute deviation method (SCAD; Fan and Li, 2001; also see Fan and Lv, 2008). It has been shown (Zou, 2006) that the Lasso is not consistent for model selection while the adaptive Lasso is. Therefore, our comparison focuses on the latter. It should also be pointed out that there have been recent work on simultaneous selection of fixed and random effects in linear mixed effects models using the shrinkage methods (Bondell and others, 2010, Ibrahim and others, 2011). However, because our focus is selection of the fixed covariates only, it seems more fair to compare with the shrinkage methods that focus on the fixed covariates, namely the adaptive Lasso and SCAD, even though the latter use regression-based measures of lack-of-fit. Interestingly, all the methods being compared, including our method (see the last paragraph before Section 3.1), AIC, BIC, HQ, and CAIC, use regression-based measures of lack-of-fit.
Table 1 summarizes the performance of the restricted fence comparing with those of the (F/B) BIC, CAIC, HQ, and AIC procedures as well as the adaptive Lasso and SCAD. For the adaptive Lasso, we use the function adalasso() from the parcor package in R, with the regularization parameter chosen by the cross-validation (which is the default method). For SCAD, we use the function GLMvanISISscad() in the SIS package in R, with the regularization parameter chosen by the BIC method. In Table 1, RF, AIC, BIC, HQ, CAIC, and ALASSO stand for the restricted fence, the F/B AIC, BIC, HQ, CAIC, and adaptive Lasso, respectively. Note that the true model includes 7 variables. Here, true positive (TP) means identifying exactly the true variables; underfitting (UF) means that at least one true variable is missing in the selected model (which may also include extraneous variables); overfitting (OF) means that the selected model includes all the true variables plus at least one extraneous variable; TP, UF, and OF are in percentages of empirical probabilities; and MC (MIC) is the empirical mean number of correctly (incorrectly) selected variables. The corresponding empirical SD are in the parentheses. Overall, the restricted fence seems to outperform, significantly, all the other procedures, both in terms of the (empirical) probability of correct selection and in terms of the (empirical) mean and SD of the number of incorrectly selected variables. Some plots of p* vs. c are shown in Figures 1 and 2 as illustrations.
Plots from the group stage of the restricted fence in one simulation. The four p* vs. c plots correspond to the 4 bins of variables. Five variables are picked up from the upper left plot; 1 variable is selected from the upper right plot; no variable is selected from the lower left plot; and one more variable is picked from the lower right plot.1
Plots from the group stage of the restricted fence in one simulation. The four p* vs. c plots correspond to the 4 bins of variables. Five variables are picked up from the upper left plot; 1 variable is selected from the upper right plot; no variable is selected from the lower left plot; and one more variable is picked from the lower right plot.1
Plot of p* vs. c from the final stage of the restricted fence for the same simulation (as the one in Fig. 1). Seven variables are selected from this plot.
Plot of p* vs. c from the final stage of the restricted fence for the same simulation (as the one in Fig. 1). Seven variables are selected from this plot.
3.2 Wild bootstrapping
As mentioned, a key step of the restricted fence is bootstrapping. This is relatively straightforward if the random effects are not present, that is, if the components of ϵ in Section 2.1 are i.i.d. In fact, in this case, all one needs to do is to (i) obtain an estimate of the variance of ϵi under the full model, say, ; and (ii) bootstrap the components of ϵ* independently from the distribution. However, under the mixed linear model (3.1), the situation is more complicated.
Ideally, the bootstrapping should be done under the full model of (3.1). To do so, one needs to (a) estimate the parameters, which include the fixed effects β and all the variance components associated with the distributions of αi and ϵi; (b) draw samples αi*,ϵi*,i = 1,…,n, from the assumed distributions of αi and ϵi, respectively, treating the estimated variance components as the true parameters; and (c) use , i = 1,…,n, to generate the bootstrap samples, where Xf,i is the covariate matrix under the full model, and the estimator of β under the full model. We call such a procedure linear mixed model bootstrapping.
However, there are practical reasons that bootstrapping under the full linear mixed model as above may not be robust. For example, the standard procedures of fitting the linear mixed model (3.1), which are ML and REML, involve numerically solving nonlinear maximization problems or equations. Although these procedures are available in standard software packages, such as SAS, S-plus and R, nonconvergence, false convergence, and convergence to local maximums often occur in practice. In such cases, the variance components under the full linear mixed model may be poorly estimated, which results in poor bootstrap approximations, as in step (b) above. This is confirmed, for example, in our simulation studies in which we found that the restricted fence performs significantly better using the wild bootstrapping method, described below, than using the linear mixed model boostrapping.
The “wild bootstrap” was proposed by Liu (1988) following a suggestion of Wu (1986). Also see Beran (1986). Suppose that we are interested in estimating the mean function of the data, which are independent but not identically distributed. Suppose that we apply the classical bootstrap based on i.i.d. samples from the empirical population to estimate the sampling distribution of the estimator. Can the result still be asymptotically correct? Liu (1988) showed that the answer is yes even though the classical bootstrap does not seem intuitively appropriate here for the simple reason that the original data are not i.i.d. More specifically, it was shown that this wild bootstrap not only captures the first-order limit but also retains the second-order asymptotic properties. This suggests that the wild bootstrap is robust, at least to some extent, against distributional misspecifications.
In our bootstrapping procedure, we first estimate the fixed effects β1 under the restricted full model. This is naturally done by minimizing the QM that is the RSS. The estimator is given by (2.2) with X1 = Xf,1 and X2 = Xf,2 and is denoted by , where Xf,j is the full Xj, j = 1,2. Thus, Xf,1 corresponds to the restricted full model, which has much fewer covariates than the full model. Here, for simplicity, we assume that Xf,1′PXf,2⊥Xf,1 is nonsingular. For example, in our simulation study, the full model of X has 30 fixed covariates, while the full model of X1 has either 7 or 8 fixed covariates. Next, we write (3.1) as y = Xβ + ζ, where y = (yi)1 ≤ i ≤ m, X = (Xi)1 ≤ i ≤ m, and ζ represents the rest of the model involving the random effects and errors. We then assume a “ working distribution” for the error vector ζ such that under the working distribution, the components of ζ are independent and distributed as N(0,σ2), where σ2 is an unknown variance that is estimated by the standard unbiased estimator, , where is given by (2.1), and pf = pf,1 + pf,2 with pf,j = rank(Xf,j),j = 1,2 (see supplementary appendix available at Biostatistics online). Given , we generate ϵ* by , where the components of ξ are generated independently from the N(0,1) distribution. We then use (2.3) to compute , the bootstrap version of for the SAF in selecting the covariates for X1.
In a way, our case is similar to what Liu considered. The underlying model is a linear mixed model, but we are bootstrapping under a regression model that has the same mean function asymptotically. In other words, the bootstrap draws samples that have the correct mean vector but potentially incorrect covariance matrix. Thus, we refer our bootstrap procedure also as wild bootstrap following Liu (1988). By using similar arguments as the latter, it can be shown that our wild bootstrap captures the first-order limit, which is what matters for the consistency of model selection.
There is also a similarity between our wild bootstrap procedure and the GEE (Liang and Zeger, 1986). In GEE, the means of the responses are correctly specified but the covariance matrices may be misspecified. Nevertheless, the GEE estimator is consistent, even thought it may not be efficient. In our wild bootstrap procedure, the bootstrapped , that is, (2.3), depends on . The first term is correctly specified. This is because the LS estimator of βf,1, which is a special GEE estimator, is consistent. On the other hand, the covariance matrix of ϵ* may be misspecified, but this does not affect the consistency property of the model selection. Note that only selection of the fixed covariates are considered here. By a very similar argument as that in Jiang and others (2008) (or Jiang and others, 2009), the consistency property of the restricted fence using the wild bootstrap can be rigorously established. For the most part, the consistency of fence rests on a single requirement, that is, the values of are well separated between correct and incorrect models. It can be shown that the given by (2.1) has the latter property. The technical conditions and proof are omitted (see supplementary appendix available at Biostatistics online for some empirical results). The results further support our conclusion that when the parameter estimation is unreliable due to computational instability, the simple and much more stable wild bootstrap may have an advantage.
3.3 A real-data example
A clinical trial, Soy Isoflavones for Reducing Bone Loss (SIRBL), was conducted at multicenters (Iowa State University and University of California at Davis-UCD). Only part of the data collected at UCD will be analyzed here. The data include 56 healthy postmenopausal women (45–65 years of age) as part of a randomized, double-blind, and placebo-controlled study. The data were collected over 3 time points—baseline, after 6 and 12 months. One problem of interest is to model the Cytokines (IL1BBLLA, TNFABLLA, and IL6BLLA)—inflammatory markers—over time on gene expression for IFNb and cFos along with other variables listed in Section A.3 of the supplementary appendix available at Biostatistics online.
We are interested in finding a subset of relevant variables/covariates that contribute to the variation of Cytokines. Here, we only report the results of data analysis for IL1BBLLA. The covariate variables are grouped into 4 groups according to biological interest. More specifically, one of the authors worked closely with an expert scientist in the field, Dr Marta Van Loan of the USDA Western Human Nutrition Research Center located at UCD to determine what variables should be grouped together and finally came up with the grouping (see Section A.3 of the supplementary appendix available at Biostatistics online for details). The restricted fence method is then applied in very much the same way as in Section 3.1. The results are compared with other procedures, reported in Table 2.
| RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD | |
| TP | 100 | 70 | 57 | 49 | 0 | 62 | 0 |
| UF | 0 | 0 | 0 | 0 | 1 | 0 | 100 |
| OF | 0 | 30 | 43 | 51 | 99 | 38 | 0 |
| MC | 7.00 | 7.00 | 7.00 | 7.00 | 6.99 | 7.00 | 4.00 |
| SD | (0.00) | (0.00) | (0.00) | (0.00) | (0.10) | (0.00) | (0.00) |
| MIC | 0.00 | 0.33 | 0.52 | 0.71 | 4.70 | 0.92 | 6.24 |
| SD | (0.00) | (0.53) | (0.65) | (0.88) | (2.03) | (1.60) | (0.69) |
| RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD | |
| TP | 100 | 70 | 57 | 49 | 0 | 62 | 0 |
| UF | 0 | 0 | 0 | 0 | 1 | 0 | 100 |
| OF | 0 | 30 | 43 | 51 | 99 | 38 | 0 |
| MC | 7.00 | 7.00 | 7.00 | 7.00 | 6.99 | 7.00 | 4.00 |
| SD | (0.00) | (0.00) | (0.00) | (0.00) | (0.10) | (0.00) | (0.00) |
| MIC | 0.00 | 0.33 | 0.52 | 0.71 | 4.70 | 0.92 | 6.24 |
| SD | (0.00) | (0.53) | (0.65) | (0.88) | (2.03) | (1.60) | (0.69) |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure.
| RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD | |
| TP | 100 | 70 | 57 | 49 | 0 | 62 | 0 |
| UF | 0 | 0 | 0 | 0 | 1 | 0 | 100 |
| OF | 0 | 30 | 43 | 51 | 99 | 38 | 0 |
| MC | 7.00 | 7.00 | 7.00 | 7.00 | 6.99 | 7.00 | 4.00 |
| SD | (0.00) | (0.00) | (0.00) | (0.00) | (0.10) | (0.00) | (0.00) |
| MIC | 0.00 | 0.33 | 0.52 | 0.71 | 4.70 | 0.92 | 6.24 |
| SD | (0.00) | (0.53) | (0.65) | (0.88) | (2.03) | (1.60) | (0.69) |
| RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD | |
| TP | 100 | 70 | 57 | 49 | 0 | 62 | 0 |
| UF | 0 | 0 | 0 | 0 | 1 | 0 | 100 |
| OF | 0 | 30 | 43 | 51 | 99 | 38 | 0 |
| MC | 7.00 | 7.00 | 7.00 | 7.00 | 6.99 | 7.00 | 4.00 |
| SD | (0.00) | (0.00) | (0.00) | (0.00) | (0.10) | (0.00) | (0.00) |
| MIC | 0.00 | 0.33 | 0.52 | 0.71 | 4.70 | 0.92 | 6.24 |
| SD | (0.00) | (0.53) | (0.65) | (0.88) | (2.03) | (1.60) | (0.69) |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure.
The main objective of the study was to examine whether Soy Isoflavones treatment affects the bone metabolism. This treatment effect is selected by the restricted fence, AIC and SCAD, but not by the other methods. The Weight variable was thought to be relevant and is picked up by AIC and HQ but not by other procedures; however, the BMI variable, which is a function of weight and height, is picked up by the restricted fence and SCAD. As also seen in the same table, BMD for lumbar and spine measures (LSTBMD) is picked up by the restricted fence but not by any other procedure. Apparently, in this analysis, BIC, CAIC, HQ, and the adaptive Lasso have overpenalized; as a result, their optimal models do not pick up some relevant covariates, such as BMD and BMC (adaptive Lasso did not pick up any of the variables). As for AIC, it is able to pick up femoral neck area (FNArea) and lumbar spine total area (LSTArea), which are related to bone areal size (i.e. prefix-Area) and considered relevant. However, after consulting with the expert scientist in this field, we are confirmed that BMD and BMC are more important variables than area measures in this case. Thus, the results of the restricted fence data analysis are more clinically relevant. Although SCAD has selected the most variables, it has missed the important variable LSTBMD. As for the total body area (WBodArea) that is uniquely picked up by SCAD, the variable is relatively less important, compared with the BMD and BMC, as noted. Our simulation study (see Tables 1 and ) has suggested that SCAD has the tendency of missing important variables as well as selecting extraneous variables.
4. DISCUSSION
The restricted fence begins with the division of the candidate variables into several groups (see Step 1 of Section 2.2). We have indicated that, in practice, the grouping should be based on biological information (see Sections 3.1, 3.3). Nevertheless, there is concern on sensitivity of the variable selection result to the grouping. Suppose that the biological information is ignored in the grouping. Will the result be dramatically different? We carry out additional simulation studies to investigate this problem. Recall that, in the simulation study of Section 3.1, the candidate variables were divided into 4 Groups, A-1–A-4, according to biological considerations. Call this grouping Strategy A. In our additional simulation study, we ignore the biological consideration (which is something that we would not recommend in practice). Instead, we consider a different grouping strategy, called grouping Strategy B, by shuffling the variable numbers randomly (keeping 3 and 4 together, which are the time-point indicators—it does not make sense to separate them). We then divide the variables into 4 groups, with the same numbers of variables in the groups as Strategy A (i.e. 8, 7, 7, 8). The new groups are Group B-1: 5, 8, 16, 18, 22, 23, 24, 28; Group B-2: 1, 2, 9, 10, 19, 20, 21; Group B-3: 3, 4, 13, 14, 17, 29, 30; and Group B-4: 6, 7, 11, 12, 15, 25, 26, 27. We then run the simulations based on the new grouping. The results for the restricted fence corresponding to the part n = 100 in Table 1 are TP: 77; UF: 2; OF: 21; MC: 6.76 (0.49); MIC: 0.02 (0.14). (The corresponding results for the competing methods do not change, of course.) Comparing with Table 1, it is seen that grouping makes some difference, which suggests that information such as biological interest may help. On the other hand, even with the completely randomized grouping, the results have not changed dramatically; in particular, the restricted fence still outperforms the competing methods. This suggests that the restricted fence is somewhat robust with respect to the grouping.
The robustness of the restricted fence can be argued theoretically in large sample. Because of the consistency of the restricted fence (Jiang and others, 2008), in large sample, the procedure will select the correct variables (and nothing else) with high probability regardless of the grouping. Equivalently, one may argue in terms of “signal consistency” (Jiang and others, 2011), which is more appropriate in cases where the number of variables is comparable to the sample size. For the most part, signal consistency means that, as the signals (i.e. the absolute values of the true regression coefficients) increase (but with the sample size fixed), the probability of identifying the true variables (and nothing else) goes to one. As argued in Jiang and others (2011), it can be shown that the restricted fence is signal-consistent regardless of the grouping.
To verify the signal consistency of the restricted fence empirically, we consider again the 2 different grouping strategies. We run simulations with n = 100 and the following increased signals for the true variables: 1, 1, 1, 1, 0.5, 0.5, and 0.01 (in other words, the first 4 coefficients are unchanged, the fifth and seventh are 10 times as strong, and the sixth is twice as strong). The simulation results are presented in Table 3. In particular, the results for the restricted fence are exactly the same (which are perfect) under the 2 grouping strategies, A and B, indicating signal consistency of the restricted fence as aforementioned. Interestingly, the results also seem to suggest that the competing methods improve at a much slower rate as the signals increase, compared to the restricted fence.
Modeling IL1BBLLA
| Variable | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| Soy treatment | × | × | × | ||||
| Weight | × | × | |||||
| BMI | × | × | |||||
| WaistCir | × | × | × | ||||
| HipBMD | × | ||||||
| LSTBMC | × | × | |||||
| LSTBMD | × | ||||||
| TibTrBMC | × | ||||||
| TibTrBMD | × | × | × | × | × | × | |
| FNArea | × | ||||||
| LSTArea | × | ||||||
| WBodArea | × |
| Variable | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| Soy treatment | × | × | × | ||||
| Weight | × | × | |||||
| BMI | × | × | |||||
| WaistCir | × | × | × | ||||
| HipBMD | × | ||||||
| LSTBMC | × | × | |||||
| LSTBMD | × | ||||||
| TibTrBMC | × | ||||||
| TibTrBMD | × | × | × | × | × | × | |
| FNArea | × | ||||||
| LSTArea | × | ||||||
| WBodArea | × |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure. The × indicates variable selected. Variables not listed were not selected by any of the methods.
Modeling IL1BBLLA
| Variable | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| Soy treatment | × | × | × | ||||
| Weight | × | × | |||||
| BMI | × | × | |||||
| WaistCir | × | × | × | ||||
| HipBMD | × | ||||||
| LSTBMC | × | × | |||||
| LSTBMD | × | ||||||
| TibTrBMC | × | ||||||
| TibTrBMD | × | × | × | × | × | × | |
| FNArea | × | ||||||
| LSTArea | × | ||||||
| WBodArea | × |
| Variable | RF | BIC | CAIC | HQ | AIC | ALASSO | SCAD |
| Soy treatment | × | × | × | ||||
| Weight | × | × | |||||
| BMI | × | × | |||||
| WaistCir | × | × | × | ||||
| HipBMD | × | ||||||
| LSTBMC | × | × | |||||
| LSTBMD | × | ||||||
| TibTrBMC | × | ||||||
| TibTrBMD | × | × | × | × | × | × | |
| FNArea | × | ||||||
| LSTArea | × | ||||||
| WBodArea | × |
RF, restricted fence procedure; BIC, the F/B BIC procedure; CAIC, the F/B CAIC procedure; HQ, the F/B HQ procedure; AIC, the F/B AIC procedure; ALASSO, the adaptive Lasso procedure; SCAD, the SCAD procedure. The × indicates variable selected. Variables not listed were not selected by any of the methods.
It should be noted that consistency or signal consistency are theoretical properties indicating what to expect in the “ideal” situations. In the practical and most likely less ideal situations, a careful design for the grouping could make a difference as is shown. In short, if there is knowledge about the candidate variables, such as biological interests, the knowledge should be used in the grouping. This was illustrated in Section 3.3 with a data example. For the most part, we recommend that the (bio)statistician work closely with the expert scientist(s) in the field to determine what grouping strategy is reasonable. It is also important to take into account the relationships between the variables (see Section A.4 of the supplementary appendix available at Biostatistics online). Keep in mind that, by definition (see Step 1 of Section 2.2), the groups need not be disjoint. Finally, for computational efficiency, the group sizes should be kept relatively small, typically 5–10 variables in each group if possible.
See Section A.4 of the supplementary appendix available at Biostatistics online for further simulation results that show another aspect of robustness of the restricted fence to “bad” groupings.
Finally, there is interest in comparing the restricted fence with the adaptive fence (Jiang and others, 2008) or SAF (Jiang and others, 2009). Although the latter are not computationally feasibles for the simulation setting considered in Section 3.1, which has 30 candidate variables, we have provided limited simulation results for a (much) lower dimensional problem. The comparison is in terms of both the selection performance and the computational costs. See Section A.5 of the supplementary appendix available at Biostatistics online.
SUPPLEMENTARY MATERIAL
Supplementary material is available at http://biostatistics.oxfordjournals.org.
FUNDING
The research was supported, in part, by National Science Foundation (DMS-02-03676 and DMS-04-02824).
The authors are grateful to Dr Marta Van Loan for kindly providing 2 data sets from her research laboratory at the USDA Western Human Nutrition Research Center located at UC-Davis and for consultation and helpful discussions. The authors also wish to thank an Associate Editor and 2 referees for their insightful comments. Conflict of Interest: None declared.


