On inference in high-dimensional regression

,


Introduction
We consider inference for the coefficient parameters of a high-dimensional linear regression model in which the number of potential explanatory variables p is larger than the sample size n.In this context, the debiased lasso (Zhang and Zhang, 2014;van de Geer et al., 2014) and decorrelated score (Ning and Liu, 2017) require two sparsity assumptions: a natural one on the parameter vector, and a less natural one on the inverse Fisher information matrix.
The key contribution of the present paper is an approach that induces sparsity on relevant blocks of the notional Fisher information matrix, removing the second assumption without affecting interpretation of the parameter of interest.The induced sparsity is exploited through a marginal least squares analysis for each variable, as in a factorial experiment.Since regularized regression is not used, there is no need to standardize the columns of the covariate matrix to have unit length.Thus, the physical interpretation of coefficients is retained.
Primary inspiration for the proposal came from the analysis of matched comparison problems, in which it is sometimes feasible to eliminate a large number of nuisance parameters by simple operations on the sample space.There are also connections to Cox and Reid (1987) as discussed in §2.Section 4 provides a more detailed comparison to the debiased lasso and related proposals.
The remainder of the present section focusses on the broader inferential strategy.Usual usage of the debiased lasso and its extensions, as described for example in Bühlmann et al. (2014), involves hypothesis tests for all the parameters followed by a correction for multiple testing.The conclusion is typically that the null hypothesis of no effect is rejected for a very small number of variables; the joint explanatory power of sets of variables tends not to be assessed.The present paper instead proposes confidence intervals for the regression coefficient parameters as an adjunct to confidence sets of models (Cox and Battey, 2017;Battey and Cox, 2018), as detailed in §5.The latter papers emphasized the construction of statistically indistinguishable sets of variables based on their fit to the data, but did not consider inferential statements about the individual parameters or their associated compatibility with the models in the confidence set.This strategy is illustrated on a set of gene-expression data from Bühlmann et al. (2014).
Several quite different approaches to inference in high-dimensional regression have been proposed.In Lockhart et al. (2014), Lee et al. (2014), and Tian and Taylor (2018) inferential statements are corrected for variable selection by conditioning on the region of the sample space that led to the selection being made.Sample-splitting is a simple alternative that avoids detailed characterization of the selection event.The efficiency loss of sample-splitting was calculated in a simple example by Cox (1975) and several more elaborate splitting strategies have been proposed in Rasines and Young (2021) and Leiner et al. (2021).For a different perspective see Zhao et al. (2021).The approach advocated by Berk et al. (2013), and further discussed in Leeb et al. (2015), aims instead for inferential statements that are universally valid under all possible model selection procedures; this leads to highly conservative confidence regions.
In contrast with much of the earlier literature on this topic, the analysis of §3 is conditional on X, as would be dictated by ancillarity when p < n.In addition to ancillarity, there are substantive reasons for preferring a conditional analysis.In observational studies X, although random, is frequently realized in advance of the response variable Y , in which case it may be possible to interpret the effect of X on Y as causal.For example X may be information on the genome or epi-genome, and Y a measure of some condition expected to be caused by changes in X.However, even in cases where the values of (Y i , X i ) n i=1 are realized contemporaneously, it is appropriate to condition on X for studying its association with the outcome variable Y .
Otherwise it would be possible that an inference justified on the basis of its unconditional properties performed poorly for any given sample of X.

Approximate orthogonalization
Let x i be a vector of measurements on p covariates for individual i.Outcomes are assumed to be generated from a linear regression model where (ε i ) n i=1 are independent with mean zero and variance τ > 0, and β = (β 1 , . . ., β p ) T is of dimension p n, but satisfying the sparsity condition β 0 = p u=1 I 1{β u = 0} = s p.The set of signal indices is S = {v : β v = 0}.The arbitrary designated interest parameter is β v , so that the vector β −v of nuisance parameters has elements {β 1 , . . ., β p }\β v .Let Y = (Y 1 , . . ., Y n ) T and let X be the matrix whose ith row is x T i .The column of X corresponding to the interest parameter β v is written x v = (x 1,v , . . ., x n,v ) T , where indices u, v, w, . . .from the end of the alphabet distinguish arbitrary columns of X from arbitrary transposed rows.The remaining columns are those of the n If the columns in X −v are orthogonal to x v then a simple regression of Y on x v estimates β v without bias, this being the motivation for factorial experiments and more elaborate experimental designs.In fact only the columns corresponding to signal variables need be orthogonal to x v , but it is not known which columns these may be.
Since the parameter β in (1) is unchanged by premultiplication of both sides by an n × n matrix, for each interest parameter β v a matrix A v is sought such that the columns x v w of the transformed covariate matrix X v = A v X are as orthogonal as possible to x v v .The parameter of interest β v is then estimated by simple linear regression of The associated least squares estimator is (2) Under model ( 1) where and by the Cauchy-Schwarz inequality, b w∈S ϑ 2 w .Since β −v 2 and S are unknown, this suggests choosing A v to minimize σ vv (A v ) + w =v ϑ 2 w (A v ), an upper bound on the observable components of the mean squared error.The relevant objects in terms of the sum of the variance τ σ vv (q v ) and squared potential biasing terms w =v ϑ 2 w (q v ) is minimized when q v solves the simple unconstrained optimization problem argmin This estimator, which weights squared bias and variance equally, is a natural choice.However, a generalization is helpful for gaining geometric insight into the problem and for comparison with other work.Thus define q v by for some fixed δ > 0 to be chosen.In §3, such solutions will be shown to exist in closed form under a condition on X defined below at Proposition 1.If v , as can be seen from the form of the objective function.Hereafter, q v refers to an arbitrary solution to the minimization problem of equation ( 6) unless otherwise specified.
The connection to parameter orthogonalization (Cox and Reid, 1987) is that, for Gaussian linear regression, both approaches seek to induce sparsity on the relevant row and column of the Fisher information matrix.With β v identified as the interest parameter, a so-called interest-respecting orthogonal parameterization of model ( 1) is (β v , η), where as orthogonal parameters are only defined up to linear transformation.Expressing (1) in either of parameterizations ( 7) or (8) requires p < n.This suggests an alternative to (8) in which ( , an approach discussed briefly in §7. Section 3 derives some theoretical aspects of the above approximate orthogonalization procedure, particularly the interpretation of q v , which clarifies the relationship between the present proposal and that of Zhang and Zhang (2014).Proofs of the main results are given in Appendix A.
Remark 1.For any two parameters β u and β v , the covariance between β u and β v constructed according to equations ( 2) and ( 6), is which is known up to τ .
Remark 2. There is a simple transformation of X, Xv , say, whose columns are exactly orthogonal to a given one, xv v say.This is Xv projects the columns of X into the subspace of R n orthogonal to x v , and C v is a matrix of zeros apart from the column c v v = xv v = x v , which replaces the single column of zeros produced by P v X.Since this transformation to Xv is not of the form A v X, it is not interest-respecting.An interest-respecting version could be obtained by projection of Xv , in a suitable matrix norm, into the space of matrices of the form AX, where A ∈ R n×n .This is however much less convenient than (6), both computationally and theoretically.
A nominal α-level confidence interval for where Φ(z 1−α ) = α.As shown in §3, the confidence interval (10) is valid as n → ∞, to the extent that q v is successful in orthogonalizing x v v to x v w for all w ∈ S. The qualification here is because the objective function in (6) does not ensure x v v is exactly orthogonal to x v w for all w ∈ S, only that the cumulative non-orthogonality to x v w over w = v is small.For some v this non-orthogonality will concentrate relatively more on the unknown signal indices, resulting in larger bias for those parameters.However, as shown in Proposition 3, this bias decays rapidly with the sample size.
One could in principle attempt to identify positions where the signal variables are most likely to be and adjust the objective function (6) accordingly, a suggestion to which we return in §7.A variant of this idea has been proposed recently by Li et al. (2021).Since their procedure is based on the debiased lasso (Zhang and Zhang, 2014;van de Geer et al., 2014) the resulting inference is not scale-invariant, and the theoretical guarantees entail a similar condition of sparsity of the inverse Fisher information matrix, which in their random design setting corresponds to sparsity of the inverse covariance matrix of the covariates.

Inferential properties
In setting up the objective function in equation ( 6), δ > 0 was introduced to enable comparison to other work.However, the procedure is remarkably robust to this choice, a phenomenon discussed in further detail following Proposition 3. Let (δI n + X −v X T −v ) = M δ , and define Proposition 1 specifies the set of solutions v from (6) in closed form.
Proposition 1. Suppose δ > 0 is such that the eigenvalues of L δ are non-negative.Then any is a minimizer of (11), where a is any non-zero real number.If the eigenvalue condition is violated, then such a q v is a saddlepoint of (11).
Modulo multiplication by a positive scalar, L δ is the matrix field ∇∇ T m evaluated at any q v of the form (13 where the first equality is the so-called matrix determinant lemma for determinants of rank one perturbations and the last equality follows as z T M −1 δ z = 1.Thus, the matrix L δ has at least one zero eigenvalue, showing that, if the condition of Proposition 1 is satisfied, m is locally weakly convex at any q v of the form (13) but not strictly convex (e.g.Boyd and Vandenberghe, 2004).Let so that q v = P v (a, δ)x v .Although any choice of a yields a minimizer of (11), the choice a = δ improves interpretability of ( 14) in view of equation ( 15) of Proposition 2. It is also of some theoretical value to consider a = δ → 0 + .
Proposition 2. Set a = δ and define X + −v to be the Moore-Penrose pseudo-inverse of X −v given in (36) of Appendix A.2. Then and is an orthogonal projection operator that maps any vector in R n into the left null space of X −v or, equivalently, the kernel of This limiting case only serves to supply insight and is of low practical relevance when p > n. 15) can be viewed as a ridge-regularized version of this projection (Hoerl and Kennard, 1970).See §4 for further discussion.
From ( 5), ( 6), and Proposition 4 below, τ −1/2 S n (β v ) is an asymptotically pivotal quantity, where However the confidence intervals in (10) ignore the bias term (q . While, for sufficiently small δ, q v effectively minimizes the observable part of an upper bound on b 2 v , it is inevitable that |b v | will be larger for some components β v than for others, due to lack of knowledge of S. Proposition 3 establishes the behaviour of the bias b v in (3) and the quantity (q T v x v )b v /(τ q T v q v ) 1/2 as a function of the sample size.The simulations in §6.2 are consistent with these conclusions.
By Proposition 2, P v (δ, δ)z is a regularized projection of z ∈ R n into the kernel of X T −v , so by construction the elements of X T −v P v (δ, δ)z are small and bounded away from zero provided that δ is.To avoid explicitly quantifying the dependence on n, p and δ, write Proposition 3. Let δ > 0 be bounded away from zero.Provided that the non-zero elements of and Proposition 3 suggests that one sensible choice of δ would solve the fixed point equation δ h (n, p, δ), so that We have found empirically that the stability of the conclusions to the choice of δ is remarkably high.Such stability can always be checked using the data at hand.
In applying Proposition 3, confidence intervals are based on the asymptotically pivotal quantity , where V n is an estimator of τ discussed below and The validity of this is established via Proposition 3, combined with the limiting distribution of S n (β v ) derived below.We also see that, for fixed sample size, the bias plays a relatively larger role for components whose associated variances are smaller.This is visually apparent from Figure 3 of §6.2.
Proposition 4. Suppose that (ε i ) n i=1 are independent mean zero random variables of variance τ and bounded moments of order 2 + γ for some γ > 0. Then where Φ is the standard normal cumulative distribution function.If the above condition holds with γ = 1, sup where C ∈ (0.4097, 0.5606], (q j,v ) n j=1 are the elements of q v , and With any consistent estimator V n of the error variance, the limiting normal approximation to the Studentized statistic V −1/2 n S n (β v ) remains valid by Slutsky's theorem.Proposition 5 provides more precise quantification.
Proposition 5. Suppose that the conditions of Proposition 4 hold with γ = 1 and let V n be a consistent estimator of τ .On writing where t n and r n (t n ) are both decreasing in n, the normal approximation to the distribution of where c is a constant that we do not quantify, g n is any nondecreasing sequence such that , and e n is defined in Proposition 4.
While it is unclear which choice of g n delivers the sharpest bound in ( 21), a reasonably tight bound is obtained by choosing g n such that 1 − Φ(g n ) is roughly the same order as e n .By Mill's ratio where φ is the standard normal density function.Thus, using the approximation e n = g −1 n φ(g n ) and solving approximately for g n using Lagrange's inversion formula (e.g. de Bruijn, 1981, p.25-29), we obtain , which is a very slowly growing function of n.Thus, asymptotically, the confidence interval based on the infeasible statistic ) has nominal coverage probability 1 − α, and for any fixed n the error in this approximation is roughly of order max{e n , t n g n , r n (t n )}.

The debiased lasso
Our proposal is related to inferential methods based on inverting the Karush-Kuhn-Tucker conditions associated with a lasso solution.In one respect, the most general procedure is that of van de Geer et al. ( 2014), which covers penalized likelihood inference.Their debiased lasso estimator, β d say, reduces essentially to the proposal of Zhang and Zhang (2014) in the case of the linear model.Zhang and Zhang (2014) modified the score equation for the least squares estimator to z T v (Y − x v β v ) = 0, with z v to be chosen.By comparison the estimating equation that defines β v is Zhang and Zhang (2014) suggested a bias correction based an a pilot estimator β Thus the bias is controlled by and Zhang and Zhang (2014) recommended that β be taken as the lasso estimator, so that β (init) −v − β −v 1 is small with high probability under regularity conditions.They also recommended that z v be chosen using the lasso, motivated by ordinary least squares when p < n, where the score equation z T v (Y − x v β v ) = 0 holds for the least squares estimator with i.e., the projection of x v on the orthogonal complement of the space spanned by the columns of X −v .Equation ( 15) in Proposition 2 shows the parallels between the two approaches, although they were derived from different perspectives.In particular, when a = δ our operations on the sample space, designed to minimize a feasible upper bound on the mean squared error, recover the ridge regression residual vector for z v , rather than the nodewise lasso residuals from an 1 -constrained linear regression of x v on X −v recommended by Zhang and Zhang (2014) and van de Geer et al. (2014).
While there is no obvious direct analogue of our quantities b v or (q T v x v )b v /(τ q T v q v ) 1/2 , van de Geer et al. (2014) show that their Studentized test statistic is the sum of a standard normally distributed random variable and the quantity ∆ v , satisfying Here, φ 2 0 is the so-called compatibility constant of the lasso, γ v is the nodewise lasso estimator of the coefficient vector from a regression of x v on X −v with regularization parameter λ v , and λ (chosen as a function of t > 0) is the lasso tuning parameter associated with the regression of Y on X.They also show that their debiased lasso estimator satisfies √ n( β d − β) = W + ∆ where W is a normal random variable of zero mean and ∆ satisfies ∆ ∞ = o pr (1).This theoretical guarantee is established under an assumption of row-sparsity of the inverse Fisher information matrix.Our result of Proposition 3 shows that b v = O(s/n) without the extra sparsity condition (beyond sparsity of β) and in a Gaussian setting induces sparsity on the relevant row and column of the Fisher information matrix, as in Cox and Reid (1987).Thus our approach seeks to achieve by construction what is implicitly assumed in van de Geer et al. ( 2014) and Zhang and Zhang (2014).Inducement of sparsity cannot be accomplished simultaneously for all parameters, but is achievable by considering each coefficient in turn as the interest parameter.
Another important advantage of our approach is that it does not require rescaling the columns of X to unit length, so the physical interpretation of the estimate and confidence intervals is maintained.Neither the lasso estimator nor its debiased counterpart is invariant to componentwise rescaling, and simulations in §6.1 indicate that the approach is rather conservative after transformation to the original scale.
Javanmard and Montanari (2014) also consider a version of the debiased lasso estimator for the linear model.Their estimator is of a similar form to that of van de Geer et al. (2014) and Zhang and Zhang (2014), namely where, in a similar spirit to our proposal, M is chosen to minimize the variance of the resulting estimator subject to a constraint on the bias.Specifically, M = (m 1 , . . ., m p ) T , where m v solves with e v the zero vector with a 1 in the vth position.Thus M can be viewed as a regularized inverse of n −1 i X i X T i .A key difficulty with this proposal is that the tuning parameter µ needs to decay sufficiently quickly in order that the bias of the estimator (the analogue of our b v ) may be controlled.Javanmard and Montanari (2014) state in their Algorithm 1 that if any of the constraints is violated, then M should be set to the identity matrix.This seems problematic since the constraint is used to derive the theoretical guarantees.We found that it was often violated in examples we tried.

Background
For sparse high-dimensional regression problems, Cox and Battey (2017) emphasised that several or even many low-dimensional models are likely to fit the data indistinguishably well.They argued that while an arbitrary choice between them would be reasonable for prediction, for subject-matter understanding, it is more appropriate to report a confidence set of models.Our suggested use for the estimates and confidence intervals obtained as described in §2 is as an adjunct to such confidence sets, the two procedures being performed in isolation, thereby avoiding issues of post-selection inference.
The first phase in the construction of confidence sets of models is to identify a set of retained explanatory variables, S, where | S| is reasonably large, but considerably less than n.The model that includes all variables in S is called the encompassing model.A confidence set of models, M, is obtained by comparing the fit of the encompassing model to the fit of any model using s # or fewer variables from S, and retaining any such models that are statistically indistinguishable from the encompassing one.The choice of s # is arbitrary and based on an assumption of sparsity.
The confidence intervals in §2 can be used to refine this large confidence set, M, to a smaller set R ⊂ M, by removing from M any models which have coefficient estimates that are not included in the relevant confidence interval (10) (with τ replaced by V n ).Specifically, if any of the estimated coefficients from fitting a model in M exceed such confidence limits, this casts doubt on the plausibility of that model.If none of the models in M is consistent with the collection of confidence intervals, then this suggests that | S| is too small.
We assume that M is exactly calibrated, i.e. that pr(S ∈ M) = 1 − ϑ.Asymptotic calibration of M is established under certain conditions in Lewis and Battey (2022).As discussed in §3, while the individual confidence intervals are all calibrated at their nominal levels for sufficiently large n, small-sample bias affects inference on some parameters more than others.The following calculations indicate the implications of this bias on the small-sample coverage properties of R.
Let q = q(u) be the proportion of confidence intervals whose coverage probability is below some acceptable threshold, 1 − u say.Since poor calibration is unrelated to whether a variable is in S, the indices of signal variables can be treated as a simple random sample of size s from {1, . . ., p}.Thus, the number of poorly calibrated intervals among these s is binomial of index s and parameter q, and an approximate lower bound on the probability that no signal variable is excluded from M is (( The analysis of artificially generated responses and real covariate data in §6.1 have q = 23/4088 = 0.0056 intervals with coverage smaller than 1 − u = 0.95 when the nominal level of our intervals is 0.01, q = 118/4088 = 0.0289 with coverage smaller than 1−u = 0.98 at nominal level 0.005 and q = 20/4088 = 0.0049 with coverage smaller than 1 − u = 0.99 at nominal level 0.001 (see Table 3).Thus, for s = 5, and pr(S ∈ M) = 0.99, the coverage probability of R using the 0.001-level intervals is approximately The above arguments would not work for the debiased lasso, as simulations in §6.2 show that, while the coverage probabilities are high on average, the debiased lasso intervals are more systematically miscalibrated for signal variables.
Whether the refined set R is more useful than a version of M whose coverage matches R depends on q and its associated value of u.This can be assessed through an analysis similar to that in §6.1.Bühlmann et al. (2014) describe an investigation to study the association of the response variable, vitamin B 2 production, with the logarithmic expression levels of 4088 genes.In the sample of size n = 71 the correlation among columns of X ranges from −0.926 to 0.991, with a mean correlation 0.029.See Bühlmann et al. (2014) for original references.

Illustration
The first analysis is the estimation of each of the regression coefficients and their confidence limits at nominal level α = 0.001 using the approximate orthogonalization approach of §2.A separate analysis was carried out to choose the encompassing model S and a confidence set of models M. The variables in S were identified through a series of model fits based on partially balanced incomplete block arrangements (Yates, 1936;Cox and Battey, 2017).Then each possible low-dimensional sub-model of S was tested against the encompassing model using an F -test at level ϑ = 0.01.Models not rejected by the F -test comprise the confidence set M.
The confidence intervals constructed as described in §2 for the 22 variables in S are reported in Table 1 for α = 0.05 and α = 0.001.These have been ordered according to the proportion of models from M to which they belong.For comparison, the lasso and the elastic net, fitted to all 71 observations and tuned to retain 22 variables, find nine and fourteen of the same variables.These are indicated in the first column of Table 1.Further explanation of S and M is provided in Appendix B, including details of error variance estimation and how asymptotic calibration of M is ensured.
There are 34555 models in M, but only 773 of these give estimates of the regression variable proportion β v lower limit upper limit lower limit upper limit index v of models in M (0  1: Indices for variables in S, ordered by prevalence in M, plus 95% and 99.9% confidence limits for the associated regression coefficients.Superscripts L and E indicate that the variable was also found by the lasso and the elastic net fitted to the full sample. coefficients that are compatible with our nominal 99.9% confidence intervals for the individual parameters in Table 1.Let R denote this set of 773 models.Since the decision to report a model in R depends on the simultaneous correctness of the associated confidence intervals, a Bonferroni-type correction should be applied as in §5.1.In fact, to account for a degree of finitesample miscalibration of our intervals due to the term b v , we use the argument of §5.1 with q and the corresponding value of u estimated from Table 3.Thus the effective coverage probability of R assuming, optimistically, that the coverage probability for M is exactly calibrated at its nominal level, is approximately 0.9186.For comparison, the size of the model confidence set based only on a likelihood ratio test at nominal level 1 − 0.9186 = 0.0814 is 15084, illustrating the advantage of our intervals as an adjunct to confidence sets of models.Any choice between the 773 models in R would require either additional data or subject matter expertise.Table 2 reports central regions of the confidence "distribution" of models (in the sense of Fisher, 1930;Cox, 1958) obtained as a nested sequence of confidence sets at different levels.The full confidence set of models is reported in the supplementary material.2: Central region of a confidence distribution of models, obtained from successively smaller values of α.Shading corresponds to (from dark to light): α ∈ {0.5, 0.4, 0.3, 0.2}.The full confidence set of models R with coverage probability 0.9186 is reported as supplementary material.

Using covariate data from Bühlmann et al. (2014)
In this section we use the covariate data from Section 5.2 to study the coverage and bias in a moderately realistic example, and compare the results to the debiased lasso.Simulations of this nature are a helpful adjunct to the analysis above, to assess the security of the results.We drew indices for five signal variables at random from the index set {1, . . ., p}, giving S = {313, 689, 724, 1747, 2470}, fixed across 1000 ensuing Monte Carlo replications.In each, an artificial outcome was generated as Y = Xβ + ε, where ε is a vector of independent standard normally distributed random variables.The signal strength was 1 for all signal variables.The assumption of known error variance used here was relaxed in §5.2.Confidence intervals at the α ∈ {0.05, 0.01, 0.005, 0.001} nominal levels for each of the 4088 coefficients were constructed as described in §2, and the coverage probabilities and mean lengths for each coefficient were estimated as simulation averages.Modal and median simulated coverage probabilities over the 4088 variables and the median of the simulated mean lengths are reported in Table 3. Detailed summary information is depicted in Figure 1 3: Summary statistics over the 4088 coefficients of the simulated coverage probabilities and simulated mean lengths for each confidence interval.The last four columns are the values of 1 − q used in the calculation of §5.1.
For comparison, we also recorded the coverage properties and lengths for the debiased lasso estimator β d (van de Geer et al., 2014) which, for the linear model, is essentially the same estimator as that of Zhang and Zhang (2014).Since lasso-based procedures require the columns of X to be on the same scale, we standardized them to unit length before constructing the debiased lasso in the usual way.In order to compare them to our approach, we then converted the conclusions back to the scale of interest by dividing the value of each entry of β d by the length of the corresponding column of the original X matrix.Estimates of standard errors were adjusted by dividing them by the square root of this length.While more appropriate than a debiased lasso analysis on the original scale, the rescaling operation should not be expected to be adequate, as neither Wald-based inference nor the debiased lasso is invariant to rescaling.Summary statistics are reported in Table 4 and Figure 2.This demonstrates a degree of overcoverage, with the lengths of each interval tending to be larger than is necessary to achieve the nominal level (by comparison with Table 3).Section 6.2 gives a further comparison to the debiased lasso when the explanatory variables are all of similar magnitudes, so that no rescaling is required.The empirical coverage probabilities in that idealized setting are close to nominal, although the procedure appears to be systematically miscalibrated for signal variables, which is problematic for the type of analysis discussed in §5.  3 for debiased lasso-based confidence intervals.
We close this comparison with a comment on computation.Construction of our confidence intervals for a single component β v only entails construction of q v , which is essentially instantaneous because of its closed-form solution.The computational burden associated with construction of a single interval is therefore approximately independent of p.By contrast, specification of the debiased lasso estimator and its variance estimate for β v entails construction of the whole matrix Θ Lasso from equation (8) of van de Geer et al. (2014).This involves solving p nodewise lasso regressions, whose tuning parameters are ideally chosen by cross validation as in the experiments above.Specifically, there are p + 1000 tuning parameters chosen by crossvalidation involved in the generation of Figure 2, compared to one tuning parameter fixed at δ = 1 across Monte Carlo replicates for Figure 1, since the performance of our approach was found to be stable with respect to δ, provided that a value extremely close to zero was not used.The computation of the confidence set M of §5 is computationally more intensive as this entails checking all sub-models of dimension ≤ s # of the encompassing model for their compatibility with the data.With s # = 5, computation can be performed within a minute on a standard computer for | Ŝ| = 22 but computation time increases rapidly with | Ŝ|, and to a lesser extent with s # .

A factorial experiment
In this section we assess how coverage properties and lengths of the proposed intervals are affected by p, n and the correlation between columns of X.In particular, we seek to understand the small-sample behaviour of the procedure in various circumstances.Our Monte Carlo experiment was designed as follows.An X matrix was obtained by generating each row x T i for i = 1, . . ., n from a normal distribution of mean zero and covariance matrix ρ1 p 1 T p + (1 − ρ)I, and the columns were each centered to have sample mean zero.The matrix X was fixed across subsequent Monte Carlo replications, as was a sparse vector β, consisting of ones in the last five entries and zeros elsewhere.The values of n, p and ρ were taken at all 2 3 combinations of high and low levels as in Table 5, the value of n/p being the same in two of the four configurations at high and low values of ρ.In each replication, a vector ε of independent standard normally distributed errors was generated anew and a vector of outcomes constructed as Y = Xβ + ε.Each of the first 1000 entries of β was treated in turn as the interest parameter and the coverage properties and lengths, obtained as in §6.1, are reported in Table 5.These set a = δ = 1 in formula (13).Simulations (not reported here) for a = 1 and δ = n −1 , produced identical numbers to those in Table 5 up to three significant figures.
Table 6  the coverage probabilities, and linear regression of the length, using the values in Table 5.For the coverage probabilities, the multiplicative effects are reported, with estimates close to one indicating a negligible effect on the coverage.The modal coverage probabilities are essentially unaffected by ρ, n and p, as would be expected, as for most of the coefficients the approximate orthogonalization is effective and the statistic used is close to the pivotal one.At these small values of n, an increase from n = 35 to n = 70 reduces the length of the intervals but appears to have a negligible effect on the median coverage probability, while increasing p has no effect on the length but multiplies the odds associated with the median coverage probabilities by 0.624.This is to be expected: in optimizing the composite non-orthogonality over more variables, the signal variables play a relatively smaller role in the optimization problem.By far the largest effect on the median coverage is due to ρ.The median length of the confidence intervals also increases with the correlation, reflecting expression (5) for the variance of β v .See Table 7 for further analysis on the role of n.Detailed summary information is in Figure 3, where we plot coverage probabilities in each of the eight settings as a function of the absolute bias |b v |.The z axis indicated by the colour is the mean length of the intervals over simulations.Coverage probabilities for signal variables and their associated values of |b v | are represented as yellow stars.As expected in view of §3, the coverage probabilities associated with the ρ = 0.1 settings decay more sharply with absolute bias than with ρ = 0.9.The decrease in |b v | for increasing n, as quantified in Proposition 3, is also reflected in these plots.
Proposition 3 indicates that the procedure is well calibrated asymptotically for all β v .We confirm this empirically by taking the worst case from Table 5 (p = 2450 and ρ = 0.1) and considering the performance over a sequence of n values.
The experiments suggest that while the procedure is valuable in settings with high correlations between columns of X, it is relatively less secure in the almost uncorrelated case when the sample size is small.For any given data, we suggest performing preliminary checks as in §6.1 as an indication of the reliability of the method on the data at hand.Performance of the debiased lasso on the same data is summarized in Table 8 and Figure 4. Since the equicorrelation matrix used to generate X does not have associated with it a sparse inverse, a key assumption for the theoretical guarantees of van de Geer et al. ( 2014) is violated.Their approach nevertheless performs favourably for variables with no real effect (see Table 8 which concerns null coefficients only), although it appears to be more systematically miscalibrated for signal variables, as illustrated in Figure 4.

Concluding remarks and some open problems
We have proposed a new asymptotically unbiased estimator of each component of β in a linear regression model.Our estimator has a simple closed form formula and effectively no tuning parameters, since the stability with respect to δ was found to be high.Importantly, the estimator does not require all columns of X to be on the same scale.Theoretical conclusions associated with the procedure are confirmed by simulation in §6.2.We used these estimates and associated confidence intervals to propose an inferential framework based on refined confidence sets of models.The logical argument used in §5 seems new and is of the following form.Suppose for a possible contradiction that model m is true.This leads to standard least squares estimates β(m) of the low-dimensional vector β(m) associated with model m.Under model m, β(m) is unbiased and consistent for β(m).Let β v be an arbitrary entry of β(m).To the extent that the orthogonalization makes b v = 0 our confidence interval for β v is unbiased and consistent for any model.Thus extremity of any entry of β(m), as calibrated by the associated limits of the corresponding confidence interval, contradicts validity of m.Our intervals are asymptotically calibrated by Proposition 3 but allowance for finite-sample miscalibration is discussed in §5.1 and §5.2.In contrast, the usual proposed use of the debiased lasso requires adjustments to the confidence limits on account of the large number of comparisons.
An alternative to the orthogonalization described in §2 is to first estimate the set of signal variables, and modify (6) to where q v = q v ({1, . . ., p}).This proposal was assessed in simulations (not reported) when S was constructed using the lasso.The coverage properties depended strongly on the lasso tuning parameter.A tuning parameter could always be found to yield a median coverage probability close to nominal.However, the coverage probabilities corresponding to the signal variables was appreciably reduced.In §5.2 we used a simple estimator of the error variance, obtained as part of the construction of calibrated confidence sets of models; see Appendix B. A similar approach, following Fan et al. (2012), could be based on cross-validation.In factorial experiments estimates of the variance are obtained from the estimates of the null effects: if (ε i ) n i=1 are normally distributed and the approximate orthogonalization via equation 6 was completely effective, then provide estimates of τ .The composite of p such statistics is, however, contaminated by those corresponding to signal variables and by those for which the optimization problem (6) fails to orthogonalize x v v to x v w for one or more w ∈ S. One might however construct a robust estimator of variance from something like a trimmed mean of these p estimates.
The extent to which the ideas of §2 apply beyond the linear regression model remains unclear.In a generalized linear model with canonical link we can write the log-likelihood function in terms of an interest parameter β v as and the function that is linear in the true parameter β is with µ i = E(Y i ).An estimate η = ( η 1 , . . ., η n ) T can be expressed as where ε captures the error in estimating η.In the linear model η = Y is an estimator of Xβ with error satisfying E(ε) = 0, so that ( 28) is a generalization.Following §2 we might seek a matrix A v that approximately orthogonalizes the vth column, x v v , of X v = A v X to its remaining columns, thereby justifying a marginal regression of A v η on x v v to provide inference on β v .The same reparameterization arguments of §2 apply.The challenge is to specify an estimator η, perhaps highly inefficient, but whose properties are sufficiently well understood.
An alternative generalization might be based on the ideas of Cox and Reid (1987).The possibility for this in the linear model is apparent from equation ( 8), where the factor multiplying β v is the least squares estimate of the coefficient vector in a regression of x v on X −v .The introduction of δI n would accommodate p > n by replacing this least squares estimate by a ridge regression estimate (Hoerl and Kennard, 1970) and would yield a parameter approxi-mately orthogonal to β v when p > n.Further analysis is needed to understand how far this idea generalizes, as our preliminary calculations for a logistic regression setting (not reported) suggested some difficulties.To be useful, any subsequent operation would have to exploit the induced sparsity in the relevant column of the Fisher information matrix similarly to Ning and Liu (2017), who assumed such sparsity in the inverse of this matrix without preliminary parameter-orthogonalizing steps.

B Details of the analysis of §5.2
The cross correlations of all potential explanatory variables were examined and any adjacent variables whose correlations exceeded 0.95 were treated as a single variable for the purpose of constructing S, with the first of each such pair being retained in the analysis.Among the 4088 variables initially present, 61 were discarded on these grounds, leaving 4027 variables.
We implemented a refinement of Cox and Battey's method, as Battey and Cox (2018) demonstrated that using all the available data to identify first the encompassing model S and then the confidence set of models M leads to lower than nominal coverage for the confidence set of models.Thus we used a sample splitting procedure, which serves to give a better calibrated set M but also enables unbiased estimation of error variance for construction of the confidence intervals for components of β.The sample of size n = 71 observations was randomly partitioned into subsets, I 1 and I 2 , say, of sizes n 1 = 35 and n 2 = 36.In the terminology of Cox and Battey (2017), the encompassing model was identified using I 1 for the cube reduction and I 2 for the square reduction.The confidence set of models and estimate of error variance was constructed using I 1 .
To improve the stability of this sample-splitting procedure, we repeated it 1000 times, rerandomizing the positions of the variable indices in the cube each time.The subset S was taken to be all those variables that survived the two-stage procedure in at least half of these randomizations.The confidence intervals constructed as described in §2 for the 22 variables in S are reported in Table 1.These have been ordered according to the proportion of models from M to which they belong.For comparison, the lasso and the elastic net, fitted to all 71 observations and tuned to retain 22 variables, find nine and fourteen of the same variables.These are indicated in the first column of Table 1.
From the variables constituting the models in Table 2, three were the first in a string of two or more variables whose empirical correlations exceeded 0.95 (see the first paragraph of this subsection).As far as M is concerned, such variables are essentially interchangeable.The affected variables are 1278 (paired with 1279), 1285 (paired with 1286, 1287, 1288 and 1289) and 4002 (paired with 4003, 4004, 4005 and 4006).

Figure 1 :
Figure 1: Real covariate data with artificially generated outcomes: simulated coverage probabilities plotted against the absolute bias |b v | of the estimator with the the mean length of the confidence interval over Monte Carlo replications indicated by the shading.Nominal levels are 0.05 (left) and 0.01 (right)

Figure 3 :Figure 4 :
Figure 3: Coverage probabilities (vertical axis) for each β v plotted against the associated absolute bias |b v | (horizontal axis) and the mean length of the confidence interval (colour axis).High and low values of p are 2450 and 1225 and those of n are 70 and 35.Yellow stars correspond to signal variables.
, where the coverage probabilities are plotted against |b v | and mean length.Yellow stars indicate signal variables.As expected, intervals with higher coverage (1, right) are on average longer, and the actual coverage decreases as the bias increases, although relatively slowly.

Table 4 :
Analogue of first two rows of Table

Table 5 :
Summary statistics for simulated coverage probabilities and mean lengths of α = 0.01 confidence intervals for the first 1000 coefficients (null variables only).The last two columns are the median and 95th percentile of the 1000 × (p − 1) values of ϑ 2 w for the first 1000 coefficients.
reports point estimates of the main effects of ρ, n and p, from logistic regression of

Table 6 :
Estimated main effects of ρ, n and p on the properties of our α = 0.01 nominal confidence intervals.Estimated effects of coverage are multiplicative on the odds scale.

Table 8 :
). Yellow stars correspond to signal variables.Modal and median Monte Carlo coverage probabilities and the median of the Monte Carlo mean length of confidence intervals across the first 1000 coefficients (corresponding to null variables only) for the debiased lasso with tuning parameter chosen by cross-validation.The nominal coverage probability is 0.99.