On inference in high-dimensional logistic regression models with separated data

Direct use of the likelihood function typically produces severely biased estimates when the dimension of the parameter vector is large relative to the effective sample size. With linearly separable data generated from a logistic regression model, the log-likelihood function asymptotes and the maximum likelihood estimator does not exist. We show that an exact analysis for each


INTRODUCTION
1.1.Background The analysis of binary response data commonly assumes a logistic regression model for which the distribution of response variables Y 1 , . . ., Y n is pr for some unknown parameter β * ∈ R p and covariates x 1 , . . ., x n ∈ R p , treated as fixed.This 25 was proposed by Cox (1958) and is the unique model for binary data yielding the same simple sufficient statistics for the regression coefficients as in a normal-theory linear model.His exact conditional inference based on combinatorial calculations evades maximum likelihood fitting and simultaneously achieves relevance and elimination of nuisance parameters.See Chapter 4 of Cox (1970) or Mehta & Patel (1995) for a more explicit and general exposition than that of Cox 30 (1958).Motivated by high-dimensional models arising in modern scientific applications, notably genomics, there has been increased interest in theoretical treatments that allow for a notional double asymptotic regime p, n → ∞.Even prior to the genomics applications, this setting interested Bartlett (1936Bartlett ( , 1937)), who used it to illustrate serious difficulties with maximum-likelihood-35 A c c e p t e d M a n u s c r i p t that the maximum likelihood estimator does not exist if and only if the data can be separated, that is, whenever the outcome-covariate pairs (y 1 , x T 1 ) T , . . ., (y n , x T n ) T are such that y i x T i β ≥ 0 for all i and for some non-zero β ∈ R p (Albert & Anderson, 1984).For centred Gaussian covariates, Candès & Sur (2020) derived the liming probability that the data can be separated in terms of the relative dimension p/n → κ and a function of the signal strength.This probability converges to one when κ exceeds a threshold, illustrating the difficulties encountered in high-dimensional regimes.
Issues are also encountered when the maximum likelihood estimator exists.In the same limiting setting p, n → ∞ with p/n → κ > 0, Sur & Candès (2019) showed that the maximum likelihood estimator can be severely biased when the design matrix is treated as random with in-55 dependent and identically distributed entries.They further showed that standard error estimates based on fixed-p maximum likelihood theory underestimate the true variability, and that the χ 2 1 limiting approximation to the distribution of likelihood ratio test statistic is poor.Related work is due to Zhao et al. (2020) who obtained similar results for Gaussian designs with arbitrary covariance structures.Similar ideas have been explored in more general models by Coolen et al. (2020), who sought to correct the average bias in the p maximum likelihood estimates using ideas from statistical physics, and Tang & Reid (2020), who clarified the extent to which classical higherorder inference based on the so called r * statistic continues to hold under the p, n → ∞ regime.
Motivated by the issues summarised above, this work clarifies the extent to which inference is possible in logistic regression models with separated data and proposes an alternative to maximum likelihood estimation valid for these settings.We begin by studying the exact conditional inference of Cox (1958), showing that in the presence of data separation, at least one of the exact conditional confidence intervals is of infinite length in at least one direction.The results are then extended to arbitrary exact confidence sets.Such conclusions are not vacuous and are the best that could be hoped for without further assumptions or data.In high-dimensional regimes, however, it is common to make further restrictions that allow for consistent estimation of the unknown regression parameter.We introduce an approach based on least squares that is consistent in both the ℓ ∞ and ℓ 2 norms when p, n → ∞ with p < n, under weak conditions on the design matrix.These guarantees are shown to apply to cases with separated data.

Our approach
Our work is concerned with settings in which the true model is thought to be logistic but that separation precludes logistic likelihood fitting.Because the sufficient statistics are the same in logistic regression as in a notional linear in probability model, the maximum likelihood estimates of the logistic coefficients, if they exist, are recoverable from the ordinary least squares estimates obtained by treating the model for the probabilities as linear, as shown in Proposition 6.This suggests an approach to inference on the logistic coefficients based on the ordinary least squares estimator.
http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika We establish a relationship between the logistic regression coefficients and the limiting values of the ordinary least squares estimates, and use this as a basis for componentwise estimation on β * .In particular, assuming the existence of a consistent estimator of Xβ * , this typically being 85 easier to obtain than an estimate of the entries of β * , we manipulate the least squares estimator to obtain a corrected least square estimator whose entries converge uniformly to those of the parameter of interest.Whilst biased, we then show that the LASSO (Tibshirani, 1996) estimate of β * produces a consistent estimator of Xβ * when the unknown parameter is suitably sparse and max n i=1 |x T i β * | ≤ c 1 √ log n for some c 1 > 0. These conditions bound the entries of the unknown 90 parameter thereby avoiding the issues caused by separation.Asumptions of this form are natural in high-dimensional regimes, see for example van de Geer et al. (2014).
Least-squares fitting of a linear regression model to binary data has been explored by Cox & Wermuth (1992) and Battey et al. (2019).The latter work parameterised the linear in probability model as 95 pr(Y i = y) = (1 + yx T i β 0 )/2 under the restriction that for all data x, |x T β 0 | ≤ 1. Ordinary least squares was the recommended approach for estimating the unknown parameter β 0 as this is more robust than maximum likelihood estimation to observations that invalidate the condition |x T β 0 | ≤ 1.While there are advantages, notably of interpretation and existence of estimates, there are difficulties in treating the linear in probability model as generative.Indeed, the restriction to data x satisfying |x T β 0 | ≤ 1 100 violates McCullagh's (2002) formal definition of a statistical model.For this reason we consider the generative model as logistic and use a relationship between the logistic coefficients and the probability limit of the ordinary least squares estimator to obtain a consistent estimator of the logistic parameters.

Related work 105
To avoid the issues encountered by maximum likelihood estimation in the logistic regression model, a number of methods have been proposed for use.When the maximum likelihood estimator exists, Sur & Candès (2019) introduced the Probe-Frontier method to correct the bias and consistently estimate β * when p is large.Yadlowsky et al. (2021) remarked on the computational difficulties involved in using the Probe-Frontier method and proposed an alternative 110 named SLOE.Both approaches rely on the existence of the maximum likelihood estimator and so are unsuitable for settings with separated data.
When the data are separated, Firth's (1993) bias reduced estimator has been recommended for use, see for example Heinze & Schemper (2002).Kosmidis & Firth (2021) showed that Firth's (1993) estimator always exists and established an analogous result for a more general version 115 obtained by penalising the logistic log-likelihood function using a Jeffreys-prior penalty with arbitrary tuning parameter.Additionally, when p is fixed and n → ∞, the first order asymptotic distribution of Firth's estimator coincides with that of the maximum likelihood estimator (Firth, 1993).It is unclear how Firth's (1993) estimator behaves when the maximum likelihood estimator does not exist and there are currently no theoretical guarantees when p, n → ∞.

120
Although not proposed with this situation in mind, maximum likelihood estimation with certain forms of penalty on β * ensures existence of an estimator when data are separated.Such estimators have been shown to have low composite estimation and prediction errors with high probability under sparsity assumptions (e.g.Duffy & Santner, 1989;van de Geer, 2008;Meier et al., 2008;Fan & Peng, 2004), however their components are biased for β * j , j = 1, . . ., p.We 125 make use of this observation to construct a consistent estimator of logistic regression coefficients using a consistent estimator of Xβ * , the latter typically being easier to obtain.Ning & Liu (2017), Ma et al. (2021), Shi et al. (2021) and Cai et al. (2021) which entail correcting the bias of penalised estimators and require a consistent estimator of β * in either the ℓ 1 or ℓ 2 norm, our procedure only requires that Xβ * 130 be estimated consistently, making it applicable to a broader range of settings, see for example Raskutti et al. (2011).

NOTATION AND LIKELIHOOD FRAMEWORK
Let n observations on p variables be represented as vectors x 1 , . . ., x n ∈ R p , and let X ∈ R n×p be the matrix with rows x T i .We assume throughout that X has full-rank, a condition 135 that can always be checked once the data have been observed and which does not affect the presence of separation.Let Col-Sp(X) denote the column-span of X and P X = X(X T X) −1 X T the projection matrix onto Col-Sp(X).Each element of the response vector Y = (Y 1 , . . ., Y n ) T , taking values in {−1, 1}, is assumed to be an independent random variable with distribution given in (1).A realisation of Y is written in lower case.Define Γ ∈ R n×n to be the diagonal matrix with (i, i)-th entry given by Γ ii = Var(Y i ).Let β * be the maximum likelihood estimator or MLE, when it exists, of β * and let β0 = (X T X) −1 X T Y be the ordinary least squares, or OLS, estimator.Define β 0 to be the limiting value of β0 as p, n → ∞ with p < n.Unless otherwise specified, this is the notional limiting operation assumed throughout.For a function f : R → R and a vector v, we use f (v) to denote the vector with ith entry f (v i ).
The vector ℓ 1 , ℓ 2 and ℓ ∞ norms are given by ∥ If the argument is a matrix, these refer to the matrix norms induced by the corresponding vector norms.The Frobenius norm is written ∥ • ∥ F .The minimum and maximum eigenvalues of a square matrix are written λ min (•) and λ max (•) respectively.For a set S ⊆ R n , the notation S ⊥ refers to its orthogonal complement For a univariate random variable Z, the sub-Gaussian norm is given by The exact conditional analysis of Cox (1958) In the logistic regression model, the log-likelihood function at an observation y = (y 1 , . . ., y n ) T is given by where t j = n i=1 x ij z i and z i = (y i + 1)/2 ∈ {0, 1}.Let T j and Z i be the random versions of these quantities, obtained by replacing y i by Y i .When the data are separated, the log-likelihood function asymptotes and so inference via maximum-likelihood fitting is unavailable.Suppose that only inference on the first component β * 1 is of interest, the other entries β * 2 , . . ., β * p being regarded as nuisance parameters.When all entries of β * are of interest each entry may be treated in turn as the single interest parameter.Cox (1958)  (2) where is the number of realisations of the outcome variable that produce the same observed values of the sufficient statistics T 1 , . . ., T p , and u∈T 1 c(u, t 2 , . . ., t p )e bu be the conditional probability that T 1 = v when β * 1 = b, and be the conditional probabilities that T 1 ≥ t 1 or T 1 ≤ t 1 .On replacing t 1 by T 1 in the definitions above, Cox (1970) with t min and t max the minimum and maximum values of the set T 1 .Let (β − 1 (t 1 ), β + 1 (t 1 )) be the observed confidence interval for β * 1 .We show that the observed value of t 1 coincides with either or both of t min and t max when the data are linearly separable.It follows that the exact conditional confidence interval defined above is of infinite length.When the data are completely separated, that is, when there exists β ∈ R p \{0} such that y i x T i β > 0 175 for all i = 1, . . ., n, the interval may equal the whole real line.
PROPOSITION 1. Suppose the data are separated by some vector β ∈ R p \{0}.If β 1 > 0 then the upper limit of the confidence interval satisfies β + 1 (t 1 ) = ∞ and if β 1 < 0 then the lower limit of the confidence interval satisfies β − 1 (t 1 ) = ∞.If additionally the data are completely separated with β 1 = 0, the conditional likelihood satisfies The result above only requires existence of one such β.If the data can be separated by multiple vectors, say 1 , then the result may be applied to each vector separately to conclude that (β − 1 (t 1 ), irrespective of how they are constructed, and thus the restrictions outlined above are limitations of the data and not the method of analysis.

Other forms of exact analysis
Define CS (1) Let CS (1) ϑ (t 1 ) be its observed value.The following result outlines the form of these sets in the presence of separation.THEOREM 1. Suppose the observed data are separated by ϑ (t 1 ).If additionally the data are completely separated with β 1 = 0, the conditional likelihood satisfies and the exact confidence set is CS (1) Similar results are obtained when an unconditional analysis is performed.This makes use of the distribution of the response vector Y rather than the conditional distribution of T 1 given T 2 , . . ., T p .Define CS ϑ (Y ) ⊆ R p to be a (1 − ϑ)-level exact unconditional confidence set for with observed value CS ϑ (y) ⊆ R p , and let CS (1) ϑ (y) be the projection of CS ϑ (y) onto its first component.THEOREM 2. Suppose the observed data are separated by β ∈ R p \{0} and let m ∈ {0, 1, . . ., n} be the number of observations satisfying ϑ (y).If additionally the data are completely separated by β and β 1 = 0, then CS (1) Theorems 1 and 2 show that all confidence sets with exact coverage guarantees, either conditional or unconditional, contain at least one unbounded interval of the form [B, ∞) or (−∞, −B] where B > 0. In some settings, these sets are equal to the whole real line.As a result, only limited information about the unknown parameter is available from data that are linearly separable.The most severe setting occurs when the data can be completely separated by some β ∈ R p \{0} with β 1 = 0, in which case there is never enough evidence to reject a null hypothesis concerning only β * 1 , whatever this might be.Even when the data can be separated but not completely separated, one-sided hypotheses of the form The non-existence of finite confidence intervals also affects estimation as, for example, it is impossible to guarantee that an estimate of an entry of β * lies in a small region about the 220 unknown parameter with a pre-determined probability.Indeed, if there existed such an estimate for some ϵ > 0 and ϑ ∈ (0, 1), then [ β1 − ϵ, β1 + ϵ] would be an exact (1 − ϑ)-level confidence interval for β * 1 with bounded support.Markov's inequality implies that the variance of any such estimate is unbounded as a function of the unknown parameter.

225
In high-dimensional settings, restrictions on Xβ * are natural and often made.These justify our approach to estimation based on least squares, to be presented in section 5, which has statistical guarantees even when the maximum likelihood estimator does not exist or exhibits poor performance.Our results are asymptotic, allowing both the dimension p and the sample size n to diverge simultaneously.

PRELIMINARY RESULTS
We begin by studying the limiting behaviour of the least squares estimator β0 in the logistic regression model.This motivates a construction that allows consistent estimation of the logistic coefficient of interest.Since the dimension p is allowed to grow under the notional operation n → ∞, the limit distribution of β0 is not well-defined.Instead we consider the behaviour of 235 linear functions α T β0 , where choices of particular interest are α equal to one of the canonical basis vectors for R p or representing simple contrasts of the entries of β0 .Thus assume α ∈ B d for some d > 0 where is the sparse ℓ 2 -ball of radius one in R p .The following result shows that linear functions of the least squares estimator converge in probability to similar functions of 240 The result in Proposition 2 is stated uniformly over all design matrices contained in the set X B .Justification for this will be provided at a later stage.For now, it is sufficient to consider a single design matrix satisfying ∥α T (X T X) −1 X T ∥ 2 2 = O(n −1 ), for which it follows that http://mc.manuscriptcentral.com/biometrika

Manuscripts submitted to Biometrika
A c c e p t e d M a n u s c r i p t provided the diagonal entries of (X T X/n) −1 are asymptotically bounded above under no restrictions on p beyond p < n.This may be seen by setting d = 1 in Proposition 2. Proposition 3 strengthens the latter result, showing that the rate of convergence in ℓ 2 -norm is of order p −1/2 .PROPOSITION 3. Suppose λ max {(X T X/n) −1 } = O(1).Then, For inference on the entries of β * , the limiting distribution of the least squares estimator is of interest.Proposition 4 shows that after suitable normalisation, the distribution of α T ( β0 Then, Assumption (5) arises when the quantity of interest is expressed as a sum of independent random variables and central limit type arguments are used to derive its asymptotic distribution.It is closely related to a Lindenberg condition.Similar assumptions are made by Huber (1973) and Lei et al. (2018) to establish the asymptotic normality of the least squares estimator in a different context.To understand when this assumption holds, suppose the rows of X are independently and identically distributed as centred, multivariate normal random variables with covariance matrix Σ and focus on the limiting setting where p, n → ∞ with p/n → κ ∈ [0, 1) and β * T Σβ * → γ 2 for some γ > 0. This is a setting that will be considered further in section 7. Let R consist of the standard basis vectors of R p .Theorem 2.16 (Bai, 1999) shows that for all constants c > 0. Although the distribution of (X T X) −1 X T is unknown, if we assume that the entries of (X T X/n) −1 X T are sub-Gaussian with bounded norm, then for all c > 0 and so the condition is satisfied.

MAIN RESULTS
The previous results motivate a corrected least squares estimator, which is shown in the present section to be consistent in both the ℓ ∞ and scaled ℓ 2 norms, and to have some predictive guarantees.The considerations involved in obtaining stronger inferential guarantees are also briefly 275 discussed and assessed by simulation.

The corrected least squares estimator
Section 4 showed that the probability limit of a linear function of the least squares estimator is a linear function of If this function f were invertible, then a consistent estimator of each entry of β * could poten-280 tially be obtained using f −1 ( β0 ).The function is not invertible, however we show that it can be rewritten as for some ς ∈ R and δ ∈ R p depending only on Xβ * .Whilst the entries of β * are difficult to estimate, estimation of Xβ * is simpler and leads to an estimator of α T β * in the form where ς and δ are estimates of ς and δ to be defined. where Such a decomposition exists uniquely because the subspaces Col-Sp(Xβ * ), Col-Sp(X) ∩ Col-Sp(Xβ * ) ⊥ and Col-Sp(X) ⊥ are orthogonal and span the whole of R n .It follows that where, to correspond with the definition of ς, we use the notation P Xβ * tanh(Xβ * /2) to mean Xβ * /2 when Xβ * is the zero vector.On definining δ = (X T X) −1 X T ∆, Based on this observation, define the corrected least squares estimator to be where ς and δ are given by with P = P X − P η and η a consistent estimator of η * = Xβ * to be discussed next. 295

Assumptions
To ensure that the corrected least squares estimator may be used to estimate β * , assumptions are required in addition to X being full rank with p < n.Define

X
(1) B such that for all t > 0, sup where η * = Xβ * and Condition 1 makes restrictions on the unknown parameter that avoids the issues outlined in section 3. Part a) follows from part b), although there will be settings where only part a) is needed and we identify these throughout.Assumptions of this form are common in high-dimensional regimes, for example, Condition 1 is implied when β * contains at most s = O(1) non-zero entries of bounded magnitude, but may also hold for dense vectors.
Condition 2 ensures the existence of a design matrix X ∈ H B for some B > 0. For this matrix, and there exists η where h(η, η * ) = o P (1).This second statement guarantees that a consistent estimator of η * = Xβ * is available for estimation of ς and δ.The former allows the behaviour of the least squares estimator to be controlled, as in Propositions 2 and 3, and limits the accumulation of error when correcting the estimator using ς and δ.It is sometimes sufficient to replace the eigenvalue condition by a weaker one but for notational simplicity, we do not do this here.
When the rows of X are treated as independent and identically distributed observations from an appropriate distribution and ) and the weak law of large numbers may be applied to both the numerator and denominator to deduce that ς −1 = O P (1), see Lemma S8 for details.In section 6 we identify suitable choices of η and H B that ensure Condition 2 is met.For most of the analysis, it will be sufficient to focus on a single X ∈ H B .However, in section 7, we consider the validity of our results in settings with separated data, focusing on the accuracy 320 of our estimator in terms of the ℓ ∞ -norm.As our results are asymptotic and the limiting probability of data separation has not yet been considered for fixed designs, section 7 treats the design matrix as random.To extend our analysis to this framework, we show that consistency in terms of the ℓ ∞ -norm in the fixed design setting holds uniformly over all X ∈ H B .It follows that our estimator consistently estimates the entries of β * with respect to certain joint distributions for 325 X and Y , even in settings where data separation occurs with probability converging to one.For clarity, we use pr(•) to denote the probability conditional on the observed value of the design and pr Y,X (•) when considering the joint distribution.The latter only appears in section 7.
Our assumptions make no explicit restrictions on p relative to n beyond p < n.However there may be some implicit constraints.Section 7 considers a more refined setting where p/n → κ ∈ 330 [0, 1) to evaluate the validity of our approach when the data are linearly separable.

Consistent estimation
Under the assumptions in section 5.2, the corrected least squares estimator β * is consistent, both entry-wise and in terms of the ℓ 2 -norm.We refer to the latter as the composite estimation error.This is established in Theorems 3 and 4. In light of the comments in section 5.2, consistency 335 in terms of the ℓ ∞ -norm is established uniformly over design matrices X ∈ H B in Theorem 3.All other results focus on pointwise convergence.THEOREM 3. Suppose conditions 1b) and 2 hold.For all t > 0, sup as p, n → ∞ with p < n and d = O(1).In particular, THEOREM 4. Suppose conditions 1a) and 2 hold.If there exist B, N > 0 such that X ∈ H B 340 for all n ≥ N , then, To derive these results, the estimation errors were decomposed into terms involving β0 − β 0 , ς − ς, and δ − δ.The results in section 4 and Condition 2 ensure that these quantities converge to zero in probability.

345
For the purpose of variable selection, it is the first result that is of most interest.Provided the non-zero entries of β * are sufficiently large, a variable selection procedure that selects the ŝ indices corresponding to the entries of β * with the largest magnitudes will asymptotically prioritise signal variables over noise variables.For a more formal Wald-based test of β * j = 0 with appropriate calibration, the asymptotic distribution of the corrected least squares estimator 350 is needed.

A remark on inference
Proposition 5 provides initial insights into the limiting distribution of the corrected least squares estimator, probed further by simulation in section 8. where Proposition 5 shows that, up to the term Π 2 , the limiting distribution of a scaled version of our estimator is standard normal.In section 8, we conduct simulations whose results suggest that Π 2 360 is negligible in the particular cases considered.They also suggest that a normal approximation to the distribution of the corrected least squares estimator may be accurate even when α T β * ̸ = 0. Neither observation has yet been established theoretically.Part of the difficulty arises in the term B −1 n , which up to the quantity Γ, is of order n 1/2 .Then for Π 2 to converge to zero in probability, we would require however we have only shown that which is not expected to be o P (1) in high-dimensional regimes.
If a closed form approximation to the distribution of the corrected least squares estimator cannot be derived, a bootstrap algorithm may serve to estimate p-values.This has not yet been considered in detail, but also presents an avenue for future work.

Prediction error
Although the primary aim was estimation of the unknown parameter, the prediction error of the corrected least squares estimator, defined by may be of interest.This is somewhat misleading terminology, as Xβ * is the vector of logistictransformed probabilities.Theorem 5 shows that the above quantity converges in probability to 375 a value that depends on the relative dimension p/n and the signal strength.THEOREM 5. Suppose there exist B, N > 0 such that X ∈ H B for all n ≥ N .Then, the individual probabilities.This is unproblematic as Condition 2 assumed the existence of an alternative estimator suitable for this purpose.
In view of Theorem 5, the iterative version need not perform better than the once-corrected version, as η = X β * violates the convergence condition in Condition 2. This conclusion was also checked by simulation but not reported.
6. POSSIBLE CHOICES OF η 390 6.1.Maximum likelihood estimation When the data cannot be separated, the maximum likelihood estimator exists and may be used to correct the least-squares estimator.Proposition 6 shows that our estimator obtained by setting η to X β * recovers the original maximum likelihood estimator.PROPOSITION 6. Suppose the logistic maximum likelihood estimator β * exists and let η = 395 X β * .Define where P = P X − P η.Then, β * = ς−1 ( β0 − δ).That is, the corrected OLS estimator recovers the logistic maximum likelihood estimator.
This result only serves to supply insight and is of no practical relevance.The equivalence to maximum likelihood estimation in this setting demonstrates that the corrected OLS estimator, 400 without exploiting further assumptions, is subject to the same considerable bias as maximum likelihood estimation.In the next section, we show that when this estimator is combined with a version of η that exploits constraints on Xβ * , bias is substantially reduced.

Penalised regression
To operationalise β * , a consistent estimator η of η * = Xβ * satisfying h(η, η * ) = o P (1) is 405 needed.A penalised regression may be used to obtain such an estimator even when the maximum likelihood estimator does not exist.This entails setting η = X β(λ) where and p : R p → R is a penalty function that does not depend on the data but ensures a unique maximiser exists.See Duffy & Santner (1989), Meier et al. (2008), Kosmidis & Firth (2021) and Fan & Peng (2004) for examples including the ridge, group LASSO, Jeffrey's-prior and 410 non-concave penalty functions respectively.Unless strong conditions are imposed on the design matrix that limit the amount of correlation between covariates, β(λ) rarely provides an accurate estimate of the individual entries of β β(λ) .
As an example, consider the LASSO estimator that maximises (8) with p(β) = ∥β∥ 1 .Lemma 1 in Meier et al. (2008) shows that β(λ) exists whenever λ > 0 and 0 < n i=1 y i < n, which includes cases with separated data.Further, under a suitable sparsity assumption, the LASSO estimator produces consistent predictions whenever the design matrix is contained in where S = {j : β * j ̸ = 0}, s = |S|, β S is the vector with entries equal to those of β for all indices in S and 0 otherwise, and The proof of this result closely follows the arguments in Theorem 6.4 and Lemma 6.8 in Bühlmann & van de Geer (2011, p. 130-134).We make minor modifications to account for our slightly weaker assumptions, see the definition of X (2) B compared to the assumptions in Lemma 6.8 of Bühlmann & van de Geer (2011).
Proposition 7 establishes conditions under which Condition 2 is satisfied, with B .As a result, the least squares estimator, after correction by the LASSO, is consistent with respect to both the ℓ 2 and ℓ ∞ norms whenever there exists B, N > 0 such that X ∈ X (1) B for all n ≥ N .We call this the OLS-LASSO estimator.In the following section, we show that when the rows of X follow a multivariate Gaussian distribution, there exist suitable constants such the probability that X ∈ X (1) B is arbitrarily close to one for large enough sample sizes.Thus, B is non-empty.

Other approaches
Any consistent estimator of η * = Xβ * can be used for correction.Thus, if β * is sparse, Proposition 7 shows that a LASSO penalty yields an estimator with appropriate behaviour.If instead β * is dense but Xβ * is well-approximated by a small number of left singular vectors of X, then it may be possible to consistently estimate η * using a sparse singular value decomposition of X.We do not explore this here, although it highlights our method's potential validity under a variety of sparsity assumptions.

RELEVANCE TO SEPARATED DATA
The applicability of our method to separated data is considered here, focusing on the OLS-445 LASSO estimator defined in section 6.2.As no restrictions on the observed response y are made in section 5.2, and section 6.2 only assumes 0 < n i=1 y i < n, the occurrence of separability does not affect the existence of our estimator.To ensure that our asymptotic guarantees are also valid, it is necessary to verify that the limiting probability of separation is non-zero.If this were not the case, then for any t > 0, and so consistency may be achieved irrespective of the value of the estimator when the data are separable.The probability of data separation has not yet been studied for fixed designs and so we adopt the random design setting introduced by Candès & Sur (2020) and summarised in Condition 3.
Condition 3. The joint distribution of (Y, X) is given by 455 and for some κ ∈ [0, 1) and γ 2 ≥ 0, Let pr Y,X denote the probability under the joint distribution of Y and X. Candès & Sur (2020, Theorem 2.2) show that there exists some decreasing function h(γ) where Our aim is to show that the OLS-LASSO estimator is consistent with respect to the ℓ ∞ -norm 460 even in these settings where the limiting probability of separation is one.For this, we also assume Condition 4.
Proposition 8 shows that in this random design setting, the corrected least squares estimator is consistent.If κ is sufficiently large compared to γ, then consistency is achieved alongside 470 separation.
http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika In particular, when κ > h(γ), Similar results are expected to hold for other quantities of interest, for example the composite estimation error, although we do not derive a result of this form.Instead, given our preference for conditional analyses, we highlight the importance of characterising the limiting probability of separation for fixed designs to ensure similar guarantees can also be provided in these settings.
8. NUMERICAL PERFORMANCE Results of extensive numerical experimentation are reported in the supplementary material.In section 4.1, different versions of the corrected least squares estimators were obtained from various estimators of η * = Xβ * .For each estimator, error rates were examined as a function of the sample size n when p, n → ∞ with p/n kept constant.Cases with separated data were included.The results show that both the component-wise and composite estimation errors converged to zero, whilst the prediction error remained stable at a non-zero value, coinciding with the analysis in section 5.The small-sample performance was analysed in section 4.2 of the supplementary material, where the average composite estimation and prediction errors of the corrected least squares estimators were recorded in multiple contexts, keeping n fixed.The results were compared to those of the maximum likelihood and Firth's (1993) estimator.
The remainder of this section fixes n = 700 and p = 70 and provides a comparison to other available methods.Based on the results in section 4.1 of the supplementary material, we focus on the SCAD (Fan & Li, 2001) correction to the least squares estimator, denoted OLS-SCAD, as it outperformed other corrections.The following four approaches were also considered: a SCAD penalised regression (Fan & Li, 2001), Firth's bias-reduced estimator (Firth, 1993), the desparsified LASSO (van de Geer et al., 2014) and the LSW estimator (Cai et al., 2021).The first 495 comparison serves to illustrate bias removal from penalised estimators.Other penalties were examined and exhibited similar performance, therefore the results are not reported.The methods of van de Geer et al. ( 2014) and Cai et al. (2021) also aim to remove bias from penalised estimators in high-dimensional settings, although data separation was not explicitly considered.The data were generated as defined in section 7, with the rows of X sampled independently 500 from a p-dimensional multivariate Gaussian distribution with mean zero and covariance matrix Σ to be specified.The outcome Y was generated from a logistic regression model with log-odds equal to Xβ * .The following three examples were considered: Example 1.The covariance matrix and parameter vector were given by to ensure the presence of a pair of highly correlated signal variables with small marginal corre-505 lation with the response.Example 2. The covariance matrix and parameter vector were given by to ensure the presence of a signal variable that is highly correlated with a noise variable.
Example 3. The covariance matrix and parameter vector were given by to ensure the presence of a pair of highly correlated signal variables with equal signal strength.

510
Due to the dependence structure among the covariates, these scenarios exemplify situations where the individual entries of β * are difficult to estimate accurately.In R = 500 Monte Carlo replications, we obtained the aforementioned estimates of β * .To probe the inferential capabilities beyond point estimation, for all except the SCAD penalised regression, we also computed 95% confidence intervals for β * 1 and β * 4 , the former corresponding to a signal variable and the latter 515 to a noise variable.Motivated by the results in section 5.4, we defined a ϑ-level test for the hypothesis H 0 : α T β * = b 0 based on the least squares estimator of the form where z 1−ϑ/2 is the (1 − ϑ/2) quantile of the N (0, 1) distribution and Γ is the diagonal matrix with entries Γii = 1 − tanh 2 (η i /2) for i = 1, . . ., n. Approximate confidence intervals were constructed by inverting this test.No guarantees have been provided for this construction to date, 520 the numerical results serve only to gain insights into the relevance of the term Π 2 in Proposition 5.The R functions cv.ncvreg, logistf, lasso.proj in the hdi package and LF in the SIHR package were used to compute the estimates and confidence intervals.As ncvreg necessarily includes an intercept term whereas lasso.projdoes not, we fitted all models without an intercept except SCAD.This is likely to marginally favour the SCAD and OLS-SCAD results.525 The left column of Figure 1 shows the average estimated signal strength of entries 1-6 of β * for each example.The right column shows the distribution of estimates of β * 1 obtained via SCAD and OLS-SCAD.The results show that OLS-SCAD was able to correct the bias in the SCAD estimates.In Example 1, the SCAD estimate of β * failed to accurately characterise the two signal variables that were only weakly marginally related to the response variable, whereas 530 OLS-SCAD was able to estimate all signal strengths accurately.In Example 2, the estimates of the highly correlated noise and signal variables were less biased for OLS-SCAD than for SCAD.This was because SCAD estimated the signal strength of the second signal variable to be zero in a non-negligible portion of cases.Finally, in Example 3, SCAD often assigned most of the signal strength corresponding to the two highly correlated signal variables to a single signal variable, 535 whereas OLS-SCAD spread the signal more evenly across the two variables.The other three estimators performed similarly to OLS-SCAD in terms of estimation.
Whilst the examples show that OLS-SCAD is able to estimate the effects of signal variables with improved accuracy over the estimator obtained directly via SCAD, it is often the case that http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika  the latter has smaller cumulative estimation and prediction error, particularly when p is large and 540 β * is sparse.This is because the corrected versions of the OLS estimator are not sparse and so error is accumulated across all entries of the parameter vector.
Table 1 shows the average length and coverage probability of the confidence intervals.The power is also recorded.No intervals were obtained for SCAD due to the bias and highly non-Gaussian distribution of its estimates observed in Figure 1.The intervals obtained from OLS-545 SCAD and Firth's approach performed similarly and moderately better than the other two approaches, with OLS-SCAD producing slightly shorter intervals than Firth's approach on average for a coverage close to 95%.The LSW method produced the largest confidence intervals, resulting in uninformative intervals that contained both zero and the true non-zero signal strength in a large number of cases.The desparsified LASSO produced the shortest intervals, however this 550 sometimes resulted in coverage probabilities substantially below the nominal level 0.95.9. DISCUSSION 9.1.Extension to cases with p ≥ n A limitation of our analysis is that corrections of the least squares estimator can only be used for inference in settings with p < n.The result is nevertheless relevant in the sparse p > n set-555 ting.For instance, in Cox & Battey (2017) a large number of low-dimensional regressions are fitted, the motivation being that if a variable is causal, its explanatory power is, to a certain extent, preserved regardless of which other variables are included.Indeed, the original motivation for studying the problem of the present paper came from the difficulties in applying logistic regression in the context of Cox & Battey (2017) due to separable data.The corrected least squares 560 estimator may be used as an alternative to logistic regression in this context.

Extensions to other models
Beyond its relevance to settings with separated data, there are additional benefits of the new approach that may be of interest beyond the logistic regression model.The method converts an estimation and inferential problem on the entries of β * to a predictive one on Xβ * , the latter typically being easier to solve without assuming strong conditions on the design matrix.As a result, the performance is favourable even in settings where covariates are highly correlated in sample.The numerical results of section 8 exhibit this most clearly, providing examples where a penalised regression estimator fails to accurately characterise the entries of the unknown parameter, yet when this estimator is used alongside the least squares estimator, the performance is 570 improved.
Another favourable aspect is the method's adaptability to different forms of sparsity.In our analysis, sparsity is only assumed to ensure that a LASSO penalised regression produces a consistent estimator of Xβ * .However, by making use of an alternative estimator of Xβ * , there is considerable flexibility in the form that this assumption takes.This was briefly outlined in sec-575 tion 6.3.It would be of interest to determine whether a version of the corrected least squares estimator can be used with similar benefits in other models.

SUPPLEMENTARY MATERIAL
The supplementary file contains proofs of the theoretical results stated in the main paper and additional numerical simulations.
as B → ∞.In particular, there exists B > 0 such that when However, by definition, CS ϑ (Y ) satisfies and so this inequality should hold for ϑ (y) by construction and so β * / ∈ CS ϑ (y).It follows that, by definition of the confidence set.Inequalities (S3) and (S4) cannot both hold and so we reach a contradiction.Thus, there exists B > 0 such that [B, ∞) ⊆ CS Complete separation ensures c is well-defined.Consider β * T = cβ T + (b, 0, . . ., 0).By definition of the confidence interval and the fact that b / ∈ CS (1) where we have applied a union bound to obtain the last inequality.Let ε = Y − E(Y ) and write where v i = αT (X T X) −1 x i .The random variables ε i lie in the range [−2, 2] and hence, are sub-Gaussian with ∥ε i ∥ ψ 2 ≤ 2. As they are also independent, αT ( β0 up to a constant.Then, there exist constants C, c 1 , c 2 > 0 not depending on t such that for all X ∈ X B pr sup for some constant C ϵ depending only on ϵ.By assumption, this bound converges to zero as n → ∞ and so the result follows.
i is an independent random variable with zero mean and variance bounded above by one.Define By Theorem 5.8 in Petrov (1995, p. 154), there exists some absolute constant A > 0 such that As ε k is a bounded random variable and Further, and so the result follows.

□
Proof of Theorem 3. Fix t > 0. For ξ < 1, define A ξ be the event that h(η, η * ) ≤ ξ.By definition of H B , sup Our aim is to show that for appropriately chosen ξ, in which case the result follows from pr sup pr sup Assume the event A ξ holds and recall that β * = ς −1 (β 0 − δ) and β * = ς−1 ( β0 − δ).Then, where By Lemmas S5 and S6, and our assumptions, there exist constants C, N > 0 such that when it follows that Π 2 < t/3 and Π 3 < t/3 when n ≥ N , where we have used the fact that α Further, by Proposition 2 and equation (S5), Applying a union bound, and so the result in (S6) follows.

□
Proof of Theorem 4. The estimation error may be bounded by Consider each of the terms.By Lemma S5 and the assumption that by assumption and the fact that ς ≤ 1/2.For the second term, Proposition 3 shows that By the proof of Lemma S6, Combining results, we conclude that p −1/2 ∥ β * − β * ∥ 2 = o P (1).

□
Proof of Proposition 7. Deferred to section 3 of the supplementary material.

□
Proof of Proposition 8. Throughout this proof, we will use pr Y |X (•), pr Y,X (•) and pr X (•) to denote the probabilities under the conditional distribution of Y given X, the joint distribution of Y and X, and the marginal distribution of X. Fix t 1 , t 2 > 0. Our aim is to show that there exists B for every B > 0. Lemmas S8-S11 show that there exist B, N We focus on this choice of B from now on.When s = ∥β * ∥ 0 = O(n 1/2−ξ ), Proposition 7 shows that for all t 1 > 0, sup by assumption.Applying Theorem 3, there exists N 2 such that Let N = max{N 1 , N 2 } and assume n ≥ N .Then, by (S9) and (S10), where F is the distribution function of X.The first result follows.To obtain the second statement, let E be the event that the data (Y, X) are separated.Candès & Sur (2020) showed that pr(E) converges to one in the setting of interest.So, which converges to one.
□ 2. PROOFS OF ADDITIONAL RESUTLS LEMMA S1.Suppose the observed data are separated by β ∈ R p \{0}.If β 1 > 0 then t 1 is the largest element in the set T 1 .If β 1 < 0 then t 1 is the smallest element in the set T 1 .
Proof.Fix z = (z 1 , . . ., zn ) T ∈ C 1 , let A = {i : z i ̸ = zi } denote the set of indices where z and z differ, and let ỹi = 2 z i − 1.Then, for all k ̸ = 1, n i=1 ≥ 0 and so it follows that n i=1 x i1 z i ≥ n i=1 x i1 zi .As z was arbitrary, the result holds for all z ∈ C 1 and so t 1 = n i=1 x i1 z i is the largest element of the set T 1 .When β 1 < 0, we must have i∈A y i x i1 ≤ 0. The result follows analogously, this time showing that t 1 is the smallest element of the set T 1 .□ LEMMA S2.Suppose the observed data are completely separated by β ∈ R p \{0}.If β 1 = 0 then t 1 is the unique element of the set T 1 and pr(T Proof.Using the notation and results in Lemma S1, when the data are separated and β 1 = 0 it must hold that y i x T i β = 0, ∀i ∈ A by (S11).This establishes the fact z and z ∈ C 1 can only differ at indices i where x T i β = 0.When the data are completely separated, x T i β > 0 for all i = 1, . . ., n.Thus, C 1 , and hence T 1 , contains a unique element.It follows that the conditional probability is equal to one for all values of the unknown parameter.In particular, By taking the maximum over all sets S and noting that the upper bound does not depend on S, the result follows.
Then for any x ∈ R, Proof.The result holds for the case x = 0 by inspection.When x ̸ = 0, it is sufficient to consider x > 0 as f is even.As the functions − tanh(x) and (3 + x 2 ) tanh(x) are convex over the positive real line, they may be bounded below by the linear terms in their respective Taylor expansions.It follows that and so 3x 3 + x 2 ≤ tanh(x) ≤ x which establishes the result.
Then there exists a universal constant C > 0 not depending on X such that where we have used Lemma S4 to bound |f (η i /2) − 1| by η2 i /12.When η * ̸ = 0, we have |ς − ς| ≤ Π 1 + Π 2 + Π 3 where Using the Cauchy-Schwartz inequality and the fact that f The function xf (x/2) = 2 tanh(x/2) is Lipschitz continuous with constant one and so, The result follows on combining the bounds.
Then there exists a constant C > 0 such that for all X ∈ X (1) By definition of X (1) As the largest eigenvalue of a projection matrix is one and tanh(•) is Lipschitz continuous with constant one, Further, there exists some constant C > 0 such that by Lemma S5 and the fact that |ς| ≤ 1/2.The result follows on combining these inequalities.□ LEMMA S7.For any X ∈ R n×p of rank p < n, the prediction error of the OLS estimator satisfies as p, n → ∞ with p < n.
Proof.The unscaled prediction error can be written as ∥X( β0 − β 0 )∥ 2 = ∥P X ε∥ 2 where ε = Y − E(Y ) consists of independent and centred sub-Gaussian random variables with max n i=1 ∥ε i ∥ ψ 2 ≤ 2. The expected prediction error is and so by the Hanson-Wright inequality (Theorem 6.2.1 in Vershynin ( 2018)), for any t > 0, for some universal constant C > 0. As P X is a projection matrix of rank p, ∥P X ∥ 2 F = p and ∥P X ∥ 2 = 1.Thus, as p ≤ n, and so the result follows.
On the event E λ we have (2n) −1 ∥X T ε∥ ∞ ≤ λ and so By the triangle inequality, Proof.Using a Taylor expansion, there exists By assumption, where the last line follows because (3e x + 1)(e x − 1) ≥ 0 for all x ≥ 0. It follows that and so the result is obtained.There exists N > 0, such that when X ∈ X (2) B and n ≥ N ,

R. M. LEWIS AND H. S. BATTEY
By convexity of the logistic log-likelihood and the ℓ 1 -norm, and so by definition of ϕ 2 0 (X), .
Combining this with equation (S12), we have for n large enough.The last inequality follows because M * = o(1) under the assumptions of this lemma.By Lemma S14, where the last inequality follows because ab ≤ a 2 + b 2 /4 for all a, b ∈ R. Combining this with equation (S14), Rearranging, we obtain, In particular, using this and equation (S13), we find The desired result can then be obtained by repeating the arguments with β replaced by β(λ) .
In particular, replacing β with β(λ) in equations ( S13) and (S15) yields Using Lemma S14, this implies that and the result is obtained by replacing M * and λ by their defined values.

290
Proof of Proposition 7. By Lemmas S12 and S13, when as long as A is large enough.Now suppose β * ̸ = 0.Then, Let The results show that the composite estimation error decreased as a function of n.On the other hand, the prediction error remained relatively stable at a non-zero value which coincides with the analysis in section 5.5.The average biases of the corrected least squares estimators were often close to zero for null variables.For signal variables, the bias increased slightly for the ridge, 335 LASSO and SVD estimators, but remained small for the oracle, SCAD and MCP estimators.All estimators controlled the Type-I error close to the intended level of 0.05.The power of the test increased with the sample size and for large enough sample sizes, was very close to one.The plots displayed in Figure S3 show that the distribution of the p-values under the null hypothesis was in close agreement with a uniform distribution.

Small-sample performance
The finite sample performance of our estimator was tested by computing average composite estimation and prediction errors in various settings with n fixed.The data were generated as in section 4.1 with n = 100, ρ ∈ {0.5, 0.9}, and γ ∈ {3, 8}.The parameter β * consisted of exactly s = 5 randomly chosen non-zero entries with equal and positive signal strength.In this case, we allowed the intercept effect to be zero in some cases.
For each combination of parameter values, R = 100 Monte Carlo replications were performed where the design matrix was kept fixed but a new random response variable was sampled each time.In each repetition, multiple estimators of β * were obtained.The first was the usual logistic maximum likelihood estimator.We also calculated Firth's bias-reduced estimator (Firth, 1993).The others were corrections to the least squares estimator obtained by using the oracle, LASSO, ridge, SVD, SCAD and MCP estimators of η.The results are given in Table S1.For each estimate β of β * , the proportion of times the estimator existed, the average relative composite estimation error ∥ β − β * ∥ 2 /( √ p∥β * ∥ 2 ) and the average relative prediction error ∥X( β − β * )∥ 2 /∥Xβ * ∥ 2 were recorded as well as the standard errors for these quantities over replications.The estimation error was divided by √ p∥β * ∥ 2 to make the entries of Table S1 comparable on account of the varying dimension p and signal strength.Note that in some cases the maximum likelihood estimator did not exist and so the errors for the maximum likelihood estimator were averaged only over the simulations that returned a solution.
The results show that the corrected least squares estimators perform favourably in comparison to the maximum likelihood estimator and Firth's (1993) estimator, both in cases with and without data separation.This was most evident for the prediction error.In view of Theorem 5, this suggests that Firth's estimator may produce inconsistent predictions when κ ̸ = 0.A more formal analysis of Firth's estimator in high-dimensional settings is required to establish this theoretically.Comparing the various corrections to the least-squares estimator, the LASSO version performed the best in terms of estimation error, often out-performing even the oracle estimator.This is possible as the oracle estimator obtains β * exactly from β 0 , but it may not be the best correction of the estimator β0 .Increasing the sample size severely increased the time required to compute Firth's estimator whilst the times required to compute the corrected least squares estimators were less affected.
We also obtained the Probe-Frontier correction (Sur & Candès, 2019) although omitted the results from Table S1.This gave very similar results to those obtained using Firth's estimator when the maximum likelihood estimator existed but provided no estimate when the data was separated.It was also considerably more computationally demanding.SLOE (Yadlowsky et al., 2021) may be used to reduce the computational burden, although SLOE is also unavailable when the data is separated and so is omitted from this study.
s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 Biometrika style 3 230 s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 all n ≥ N .Assume α ∈ B d with

√
log n log p log n/n max{1, e 2B √ log n log p log n/n} = o(1).

−Fig. 1 :
Fig. 1: Left column: average estimated signal strengths of entries 1-6 of β * in each example obtained using OLS-SCAD (black), SCAD penalised regression (orange), Firth's estimator (red), desparsified LASSO (blue) and LSW (green).Error bars show one estimated standard deviation.True signal strengths are marked with black crosses.Right column: histogram of estimates of β * 1 obtained via OLS-SCAD (black) and SCAD penalised regression (orange) for the second signal variable in each example.The black dashed line represents the true signal strength.http://mc.manuscriptcentral.com/biometrika http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 by definition of the confidence set.Inequalities (S1) and (S2) cannot both hold, and so we reach a contradiction.It follows that there exists B > 0 such that [B, ∞) ⊆ CS 1 ).A similar argument establishes the case where β 1 < 0. □ Proof of Theorem 2. The proof closely follows the ideas in the proof of Theorem 1. Suppose β 1 > 0 and suppose for a contradiction that ∀B > 0, there exists b B ≥ B with b B / ∈ CS (1) ϑ (y).When β * = (b B /β 1 )β for some B > 0, . A similar argument establishes the case where β 1 < 0. Now assume the data are completely separated by β with β 1 = 0 and suppose for a contradiction that there exists b ∈ R with b / ∈ CS (1) ϑ (y).Define c ϑ and c satisfying c and c ϑ .Thus, we have reached a contradiction and so CS (1) ϑ (y) = R. □ Proof of Proposition 2. Fix t > 0. By Lemma S3, for all ϵ > 0 there exists a set N ϵ,d ⊆ B d of cardinality at most (2 + ϵ)ep ϵd d such that for all possible matrices X, pr sup S11) by the arguments above and the definition of C 1 .Consider possible cases for β 1 .When β 1 > 0, inequality (S11) implies i∈A y i x i1 ≥ 0.

√
Fig. S2: Average Type I error and power of the test ψ(Y ; 0.05, α) for various values of n, s = 5 and α equal to a standard basis vector.The left column corresponds to κ = 0.1 and the right column to κ = 0.5.Various estimators of η were used: oracle (black), LASSO (red), ridge regression (blue), SVD (green), SCAD (orange), MCP (purple).Error bars represent empirical standard errors.
depending on the sign of the first entry of the separating parameter.Any refinement requires either additional data or further assumptions.
Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 ) http://mc.manuscriptcentral.com/biometrikaManuscripts submitted to Biometrika A c c e p t e d M a n u s c r i p t Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 n does not converge to zero, there exist values of β * where the prediction error does not 380 decay to zero.As a result, whilst the corrected least squares estimator may be usefully used for estimation and variable screening, it is less suitable for inference on the logistic transforms of http://mc.manuscriptcentral.com/biometrika Downloaded from https://academic.oup.com/biomet/advance-article/doi/10.1093/biomet/asad065/7338235 by guest on 05 November 2023 * , see section 8 for numerical examples.Nevertheless, under much weaker conditions X β(λ) consistently estimates η * .The corrected least squares estimator http://mc.manuscriptcentral.com/biometrika