Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models

Penalization of the likelihood by Jeffreys' invariant prior, or by a positive power thereof, is shown to produce finite-valued maximum penalized likelihood estimates in a broad class of binomial generalized linear models. The class of models includes logistic regression, where the Jeffreys-prior penalty is known additionally to reduce the asymptotic bias of the maximum likelihood estimator; and also models with other commonly used link functions such as probit and log-log. Shrinkage towards equiprobability across observations, relative to the maximum likelihood estimator, is established theoretically and is studied through illustrative examples. Some implications of finiteness and shrinkage for inference are discussed, particularly when inference is based on Wald-type procedures. A widely applicable procedure is developed for computation of maximum penalized likelihood estimates, by using repeated maximum likelihood fits with iteratively adjusted binomial responses and totals. These theoretical results and methods underpin the increasingly widespread use of reduced-bias and similarly penalized binomial regression models in many applied fields.


Introduction
Logistic regression is one of the most frequently applied generalized linear models in statistical practice, both for inference about covariate effects on binomial probabilities, and for prediction. Consider realizations y 1 , . . . , y n of independent binomial random variables Y 1 , . . . , Y n with success probabilities π 1 , . . . , π n and totals m 1 , . . . , m n , respectively. Suppose that each y i is accompanied by a p-dimensional covariate vector x i and that the model matrix X with rows x 1 , . . . , x n has full rank. A logistic regression model has π i = (G • η i )(β) with G(η) = e η 1 + e η and η i (β) = p t=1 β t x it (i = 1, . . . , n) , where β = (β 1 , . . . , β p ) is the p-dimensional parameter vector, and x it is the tth element of x i (i = 1, . . . , n); if an intercept parameter is present in the model then the first column of X is a vector of ones. The maximum likelihood estimatorβ of β in (1) maximizes the log-likelihood m i log 1 + e η i (β) .
(2) Albert & Anderson (1984) categorized the possible settings for the sample points (y 1 , x 1 ) , . . ., (y n , x n ) into complete separation, quasi-complete separation and overlap. Specifically, if there exists a vector γ ∈ p such that γ x i > 0 for all i with y i > 0 and γ x i < 0 for all i with y i = 0, then there is complete separation in the sample points; if there exists a vector γ ∈ p such that γ x i ≥ 0 for all i with y i > 0 and γ x i ≤ 0 for all i with y i = 0, then there is quasicomplete separation in the sample points; and if neither complete nor quasi-complete separation is present, then the sample points overlap. Albert & Anderson (1984) showed that separation is necessary and sufficient for the maximum likelihood estimate to have at least one infinite-valued component. A parallel result appears in Silvapulle (1981), where it is shown that if G(η) in (1) is any strictly increasing distribution function such that − log G(η) and log{1 − G(η)} are convex, and x i1 = 1 (i = 1, . . . , n), then the maximum likelihood estimate has all components finite if and only if there is overlap.
When data separation occurs, standard maximum-likelihood estimation procedures, such as iteratively reweighted least squares (Green, 1984), can be numerically unstable due to the occurrence of large parameter values as the procedures attempt to maximize (2). In addition, inferential procedures that directly depend on the estimates and the estimated standard errors, such as Wald tests, can give misleading results. For a recent review of such problems and some solutions, see Mansournia et al. (2018). Firth (1993) showed that if the logistic regression likelihood is penalized by Jeffreys' invariant prior, then the resulting maximum penalized likelihood estimator has bias of smaller asymptotic order than that of the maximum likelihood estimator in general. Specifically, for logistic regressions the reduced-bias estimatorβ results from maximization of l(β) = l(β) + 1 2 log X W (β)X , with W (β) = diag{w 1 (β), . . . , w n (β)}, and where w i (β) = m i (ω • η i )(β) is the working weight for the ith observation with ω(η) = e η /(1 + e η ) 2 (i = 1, . . . , n). Heinze & Schemper (2002), in extensive numerical studies, observed that the reduced-bias estimates have finite values even when data separation occurs. Based on an argument about parameter-dependent adjustments to y 1 , . . . , y n and m 1 , . . . , m n stemming from the form of the gradient of (3), Heinze & Schemper (2002) conjectured that finiteness of the reduced-bias estimates holds for every combination of data and logistic regression model. Heinze & Schemper (2002) also observed that the reducedbias estimates are typically smaller in absolute value than the corresponding maximum likelihood estimates, when the latter are finite. These observations are in agreement with the asymptotic bias of the maximum likelihood estimator in logistic regressions being approximately collinear with the parameter vector (see, for example, Cordeiro & McCullagh, 1991). Example 1 illustrates the finiteness and shrinkage properties of the maximum penalized likelihood estimator in the context of estimating the strength of NBA basketball teams using a Bradley-Terry model (Bradley & Terry, 1952).
Example 1: Suppose that y ij = 1 when team i beats team j, and y ij = 0, otherwise. The Bradley-Terry model assumes that the contest outcome y ij is the realization of a Bernoulli random variable with probability π ij = exp(β i − β j )/{1 + exp(β i − β j )}, and that the outcomes for the available contests are independent. The Bradley-Terry model is a logistic regression with q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q ML RB A tl a n ta probabilities as in (1), for the particular X matrix whose rows are indexed by contest identifiers (i, j) and whose general element is x ij,t = δ it − δ jt (t = 1, . . . , p). Here, δ it is the Kronecker delta, with value one when t = i and zero otherwise. The parameter β t can be thought as measuring the ability or strength of team t (t = 1, . . . , p). Only contrasts are estimable, and an identifiable parameterization can be achieved by setting one of the abilities to zero. See, for example, Agresti (2013, § 11.6) for a general discussion of the model. We use the Bradley-Terry model to estimate the ability of basketball teams from game outcomes in the regular season of the 2014-2015 NBA conference. For illustrative purposes, we use only the 262 games that took place before 3 December 2014, up to which date the Philadelphia 76ers had recorded 17 straight losses and no win. The dataset was obtained from www.basketball-reference.com and is also available as part of the Supplementary Material. The ability of the San Antonio Spurs, the champion team of the 2013-2014 conference, is set to zero, so that each β i is the contrast of the ability of team i with San Antonio Spurs. The model is estimated via iteratively reweighted least squares, as implemented in the glm function of R (R Core Team, 2020) with default settings for the optimization. No warning or error was returned during the fitting process.
The top panel in the left plot of Figure 1 shows the reported maximum likelihood estimates of the contrasts, along with their corresponding nominally 95% individual Wald-type confidence intervals. The contrast for Philadelphia 76ers stands out in the output from glm with a value of −19.24 and a corresponding estimated standard error of 844.97. These values are in fact representations of −∞ and ∞, respectively, as confirmed by the detect separation method of the brglm2 R package (Kosmidis, 2020), which implements separation-detection algorithms from a 2007 University of Oxford Department of Statistics PhD thesis by K. Konis. The data are separated, with the maximum likelihood estimates for all teams being finite except that for Philadelphia 76ers, which is minus infinity. A particularly worrying side-effect of data separation here is that if the computer output is used naively, a Wald test for difference in ability between Philadelphia 76ers and San Antonio Spurs results in no apparent evidence of a difference, which is counter intuitive given that the former had no wins in 17 games and the latter had 13 wins in 17 games. In contrast, the reduced-bias estimates in the bottom panel of the left of Figure 1 all have finite values and finite standard errors. The right plot in Figure 1 illustrates the shrinkage of the reduced-bias estimates towards zero that has also been discussed in a range of different settings, for example in Heinze & Schemper (2002) and Zorn (2005).
The apparent finiteness and shrinkage properties of the reduced-bias estimator, coupled with the fact that the estimator has the same first-order asymptotic distribution as the maximum likelihood estimator, are key reasons for the increasingly widespread use of Jeffreys-prior penalized logistic regression in applied work. At the time of writing, Google Scholar records approximately 2700 citations of Firth (1993), more than half of which are from 2015 or later. The list of application areas is diverse, including for example agriculture and fisheries research, animal and plant ecology, criminology, commerce, economics, psychology, health and medical sciences, politics and many more. The particularly strong uptake of the method in health and medical sciences, and politics stems largely from the works of Heinze & Schemper (2002) and Zorn (2005), respectively. The reduced-bias estimator is also implemented in dedicated open-source software, such as the brglm2 (Kosmidis, 2020) and logistf (Heinze & Ploner, 2018) R packages, and it has now become part of textbook treatments of logistic regression; see, for example, Agresti (2013, § 7.4), or Hosmer et al. (2013.3).
However, a definitive theoretical account of the empirically evident finiteness and shrinkage properties has yet to appear in the literature. Such a formal account is much needed, particularly in light of recent advances that demonstrate benefits of the reduced-bias estimator in wider contexts than the ones for which it was originally developed. An example of such an advance is Lunardon (2018), which explores the performance of bias reduction in stratified settings and shows that bias reduction is particularly effective for inference about a low-dimensional parameter of interest in the presence of high-dimensional nuisance parameters. For the estimation of high-dimensional logistic regression models with p/n → κ, κ ∈ (0, 1), experiments reported in the supplementary information of Sur & Candès (2019) (see, also, Section S3.3 in the Supplementary Material) show that bias reduction performs similarly to their newly proposed method, and markedly better than maximum likelihood. These new theoretical and empirical results justify and motivate use of the reduced-bias estimator in even more complex applied settings than the one covered by the framework of Firth (1993); in such settings, more involved methods such as modified profile likelihoods (see, for example Sartori, 2003) and approximate messagepassing algorithms (see, for example  have also been proposed for recovering inferential accuracy. This paper formally derives the finiteness and shrinkage properties of reduced-bias estimators for logistic regressions under only the condition that model matrix X has full rank. We also provide geometric insights on how penalized likelihood estimators shrink towards zero, and discuss the implications of finiteness and shrinkage in inference, especially in hypothesis tests and confidence regions using Wald-type procedures. It is shown how the results extend in a direct way to other commonly-used link functions, such as the probit, log-log, complementary log-log and cauchit, whenever the Jeffreys prior is used as a likelihood penalty. The work presented here thus complements earlier work of Ibrahim & Laud (1991) and especially Chen et al. (2008), which studies the same models from a Bayesian perspective. Here we study the behaviour of the posterior mode and thereby derive results that add to those earlier findings, whose focus was instead on important Bayesian aspects such as propriety and moments of the posterior distribution.
The results in this paper also readily extend to situations where penalized log-likelihoods of the form are used, with a allowed to take values other than 1/2. Such penalized log-likelihoods have proved useful in prediction contexts, where the value of a can be tuned to deliver better estimates of the binomial probabilities; and they are the subject of ongoing research (see, for example, Elgmati et al., 2015;Puhr et al., 2017). The repeated maximum likelihood fits procedure with iteratively adjusted binomial responses and totals, derived in Section 4, maximizes l † (β; a) for general binomial-response generalized linear models and any a > 0.

Preamble
Results on finiteness and shrinkage of the maximum penalized likelihood estimator are derived first for logistic regression, which is the leading case in applications and also the case for which maximum penalized likelihood, with Jeffreys-prior penalty, coincides with asymptotic bias reduction. These results provide a platform for the generalization to link functions other than logit in Section 3.

Finiteness
Let W * (r) be W (β) at β = β(r), r ∈ , where β(r) is a path in p such that β(r) → β 0 as r → ∞, with β 0 having at least one infinite component. Theorem 1 below describes the limiting behaviour of the determinant of the expected information matrix X W * (r)X as r diverges to infinity, only under the assumption that X is of full rank. An important implication of Theorem 1 is Corollary 1 which shows that the reduced-bias estimators for logistic regressions are always finite. These new results formalize a sketch argument made in Firth (1993, § 3.3).
Corollary 1. Suppose that X has full rank. The vectorβ that maximizesl(β) has all of its components finite.
The proofs of Theorem 1 and Corollary 1 are given in the Supplementary Material. Corollary 1 also holds for any fixed a > 0 in (4). As a result, the maximum penalized likelihood estimators from the maximization of l † (β; a) in (4) have finite components, for any a > 0.
Despite its practical utility, the finiteness of the reduced-bias estimator results in some notable, and perhaps undesirable, side-effects on Wald-type inferences based on the reduced-bias estimator that have been largely overlooked in the literature. The finiteness ofβ implies that the estimated standard errors s t (β) (t = 1, . . . , p), calculated as the square roots of the diagonal elements of the inverse of X W (β)X, are also always finite. Since y 1 , . . . , y n are realizations of binomial random variables, there is only a finite number of values that the estimatorβ can take for any given x 1 , . . . , x n . Hence, there will always be a parameter vector with large enough components that the usual Wald-type confidence intervalsβ t ±z 1−α/2 s t (β), or confidence regions in general, will fail to cover regardless of the nominal level α that is used. This has also been observed in the complete enumerations of Kosmidis (2014) for proportional odds models which are extensions of logistic regression to ordinal responses; and it is also true when the penalized likelihood is profiled for the construction of confidence intervals, as is proposed, for example, in Heinze & Schemper (2002), and in Bull et al. (2007) for multinomial regression models.

Shrinkage
The following theorem is key when exploring the shrinkage properties of the reduced-bias estimator that have been illustrated in Example 1.
Theorem 2: Suppose that X has full rank. Then Table 1: Common link functions and the corresponding forms for G(η) and ω(η). For all the displayed link functions, ω(η) vanishes as η diverges.
Link function A complete proof of Theorem 2 is in the Supplementary Material. Part i) also follows directly from Chen et al. (2008, Theorem 1).
Consider estimation by maximization of the penalized log-likelihood l † (β; a) in (4) for a = a 1 and a = a 2 with a 1 > a 2 ≥ 0. Let β (a 1 ) and β (a 2 ) be the maximizers of l † (β; a 1 ) and l † (β; a 2 ), respectively and π (a 1 ) and π (a 2 ) the corresponding estimated n-vectors of probabilities. Then, by the concavity of log |X W (π)X|, the vector π (a 1 ) is closer to (1/2, . . . , 1/2) than is π (a 2 ) , in the sense that π (a 1 ) lies within the hull of that convex contour of log |X W (π)X| containing π (a 2 ) . With the specific values a 1 = 1/2 and a 2 = 0 the last result refers to maximization of the likelihood penalized by Jeffreys prior and to maximization of the un-penalized likelihood, respectively. Hence, use of reduced-bias estimators for logistic regressions has the effect of shrinking towards the model that implies equiprobability across observations, relative to maximum likelihood. Shrinkage here is according to a metric based on the expected information matrix rather than to Euclidean distance. Hence, the reduced-bias estimates are only typically, rather than always, smaller in absolute value than the corresponding maximum likelihood estimates.
If the determinant of the inverse of the expected information matrix is considered as a generalized measure of the asymptotic variance, then the estimated generalized asymptotic variance at the reduced-bias estimates is always smaller than the corresponding estimated variance at the maximum likelihood estimates. Hence approximate confidence ellipsoids, based on asymptotic normality of the reduced-bias estimator, are reduced in volume.

Non-logistic link functions 3.1 Preamble
The results here generalize Sections 2.2 and 2.3 beyond the logit link, still for estimators from penalized likelihoods of form (4). For non-logistic links, such estimators no longer coincide with the bias-reduced estimator of Firth (1993).

Finiteness
The results in Theorem 1 and Corollary 1 readily extend to more link functions than the logistic. Specifically, if G(η) = e η /(1 + e η ) in model (1) is replaced by an at least twice differentiable and invertible function G : → (0, 1), then the expected information matrix has again the form X W (β)X but with working weights w and g(η) = dG(η)/dη. If the link function is such that ω(η) → 0 as η diverges to either −∞ or ∞, then the proofs of Theorem 1 and Corollary 1 in the Supplementary Material apply unaltered to show that lim r→∞ |X W * (r)X| = 0 and, when the penalty is a positive power of Jeffreys' invariant prior, the maximum penalized likelihood estimates have finite components. The logit, probit, complementary log-log, log-log and cauchit links are some commonly-used link functions for which ω(η) → 0. The functions G(η) and ω(η) for each of the above link functions are shown in Table 1.

Shrinkage
If the link function is such thatω(z) is maximized at some value z 0 ∈ (0, 1), then the same arguments as in the proof of result i) in Theorem 2 can be used to show that |X W (π)X| is globally maximized at (z 0 , . . . , z 0 ) . The left plot of Figure 2 illustrates that this condition is satisfied for the logit, probit, log-log, and complementary loglog link functions. If x i1 = 1 (i = 1, . . . , n), then the maximum of |X W (β)X| is achieved at β = (b 0 , 0, . . . , 0) , where b 0 = g −1 (z 0 ). In addition, directly from the proof of Theorem 2, a sufficient condition for the log-concavity of |X W (π)X| for non-logit link functions is thatω(z) is concave.

Maximum penalized likelihood as repeated maximum likelihood
The maximum penalized likelihood estimates, for full rank X, can be computed by direct numerical optimization of the penalized log-likelihood l † (β; a) in (4) or by using a quasi Newton-Raphson iteration as in Kosmidis & Firth (2010). Nevertheless, the particular form of the Jeffreys prior allows the convenient computation of penalized likelihood estimates by leveraging readily available maximum-likelihood implementations for binomial-response generalized linear models.
If G(η) = e η /(1 + e η ) in model (1) is replaced by any invertible function G : → (0, 1) that is at least twice differentiable, then differentiation of l † (β; a) with respect to β t (t = 1, . . . , q) gives that the penalized likelihood estimates are the solutions of where If we temporarily omit the observation index and suppress the dependence of the various quantities on β, the derivatives of l † (β; a) are the derivatives of the binomial log-likelihood l(β) with link function G(η), after adjusting the binomial response y to y + 2ah(q − 1/2). Hence, the penalized likelihood estimates can be conveniently computed through repeated maximumlikelihood fits, where each repetition consists of two steps: P1) the adjusted responses are computed at the current parameter values; and P2) the maximum likelihood estimates of β are computed at the current value of the adjusted responses.
However, depending on the sign and magnitude of 2ah(q −1/2), the adjusted response can be either negative or greater than the binomial total m. In such cases, standard implementations of maximum likelihood are either unstable or report an error. This is because the binomial log-likelihood is not necessarily concave when y < 0 or y > m for at least one observation, when a link function with concave log{G(η)} and log{1 − G(η)} is used. Logit, probit, log-log and complementary log-log are link functions of this kind. See, for example, Pratt (1981, § 5) for results and discussion on concavity of the log-likelihood.
Such issues with the use of repeated maximum-likelihood fits can be avoided by noting that expression (5) results if, in the derivatives of the log-likelihood, y and m are replaced, respectively, by their adjusted versions y = y + 2ah(q − 1/2 + πc) andm = m + 2ahc .
Here c is some arbitrarily chosen function of β. The following theorem identifies one function c for which 0 ≤ỹ ≤m. The proof of Theorem 3 is given in the Supplementary Material, which also provides pseudocode (see Algorithm S1) and R code for Algorithm JeffreysMPL, which implements repeated maximum-likelihood fits to maximize the l † (β; a) for any supplied a and link function G(η).
The variance-covariance matrix of the penalized likelihood estimator can be obtained as (R R) −1 , where R is the upper triangular matrix from the QR decomposition of W (β) 1/2 X at the final iteration of the procedure. That decomposition is a by-product of JeffreysMPL.
If, in addition to full rank X, we require that X has a column of ones and g(η) is a unimodal density function, then it can be shown that if the starting value of the parameter vector β in the repeated maximum-likelihood fits procedure has finite components, then the values of β computed in step P2 will also have finite components at all repetitions. This is because, with a column of ones in the full rank X, the adjusted responses and totals in (6) satisfy 0 <ỹ <m, and hence maximum likelihood estimates with infinite components are not possible. The strict inequalities 0 <ỹ <m hold because, under the aforementioned conditions, w i (β) > 0 and X W (β)X is positive definite for β with finite components. Then, Magnus & Neudecker (1999, Chapter 11, Theorem 4) on bounds for the Rayleigh quotient gives the inequality h i (β) ≥ w i (β)x i x i λ(β) > 0 (1, . . . , n), where λ(β) > 0 is the minimum eigenvalue of (X W (β)X) −1 .
The repeated maximum-likelihood fits procedure has the correct fixed point even if, at step P2, full maximum-likelihood estimation is replaced by a procedure that merely increases the log-likelihood, such as a single step of iteratively reweighted least squares for the adjusted responses and totals. Firth (1992) suggested such a scheme for logistic regressions with a = 1/2. There is currently no conclusive result on whether full maximum-likelihood iteration with a reasonable stopping criterion is better or worse than, for example, one step of iteratively reweighted least squares, in terms of computational efficiency. A satisfactory starting value for the above procedure is the maximum likelihood estimate of β, after adding a small positive constant and twice that constant to the actual binomial responses and totals, respectively.
Finally, for a = 1/2, repeated maximum-likelihood fits can be used to compute the posterior normalizing constant when implementing the importance sampling algorithm in Chen et al. (2008, § 5) for posterior sampling of the parameters of Bayesian binomial-response generalized linear models with the Jeffreys prior.
Section S3 of the Supplementary Material illustrates the evolution of adjusted responses and totals through the iterations of JeffreysMPL, for the first 6 games of Philadelphia 76ers in Example 1. Section S3 also computes the reduced-bias estimates for a logistic regression model with n = 1000 binary responses and p = 200 covariates, as considered in Figure 2

Illustrations
The left plot of Figure 2 showsω(z) and z 0 for the various links. The plot for the log-log link is the reflection of the one for the complementary log-log through z = 0.5. As is apparent,ω(z) is concave for the logit, probit and complementary-log-log links but not for the cauchit link. The right plot of Figure 2 visualizes the shrinkage induced by the penalization by Jeffreys' invariant prior for the logit, probit, complementary log-log and cauchit links. For each link function, we obtain all possible fitted probabilities from a complete enumeration of a saturated model with π i = G(β 1 + β 2 x i ) (i = 1, 2), where x 1 = −1, x 2 = 1, m 1 = 9 and m 2 = 9. The grey curves are the contours of log |X W (π)X|. An arrow is drawn from each pair of estimated probabilities based on the maximum likelihood estimates to the corresponding pair of estimated probabilities based on penalized likelihood estimates, to demonstrate the induced shrinkage towards (z 0 , z 0 ) in accord to the results in Section 3. Despite the fact thatω(z) is not concave for the cauchit link, the fitted probabilities still shrink towards (z 0 , z 0 ) = (1/2, 1/2) . The plots in Figure 2 are invariant to the particular choice of x 1 and x 2 , as long as x 1 = x 2 . For either maximum likelihood or maximum penalized likelihood, if the estimates of β 1 and β 2 are b 1 and b 2 for x 1 = −1 and x 2 = 1, then the new estimates for any x 1 , x 2 ∈ with x 1 = x 2 are b 1 − b 2 (x 1 + x 2 )/(x 2 − x 1 ) and 2b 2 /(x 2 − x 1 ), respectively. Hence, the fitted probabilities will be identical. Another illustration of finiteness and shrinkage follows from Example 1. Figure 3 shows the paths of the team ability contrasts as a varies from 0 to 5. The estimates are obtained using JeffreysMPL, starting at the maximum likelihood estimates of the ability contrasts after adding 0.01 and 0.02 to the actual responses and totals, respectively. In accord with the theoretical results in Section 2.2, the estimated ability contrasts are finite for every a > 0; and, as expected from the results in Section 2.3, shrinkage towards equiprobability becomes stronger as a increases.

Concluding remarks
A recent stream of literature investigates the use of the coefficient path defined by maximization of the penalized log-likelihood (4) for the prediction of rare events through logistic regression. Elgmati et al. (2015) study that path for a ∈ (0, 1/2], and propose to take a to be around 0.1, in order to handle issues related to infinite estimates, and they obtain predicted probabilities that are less biased than those based on the reduced-bias estimates (a = 0.5). More recently, Puhr et al. (2017) proposed two new methods for the prediction of rare events, and performed extensive simulation studies to compare performance with various methods, including maximum penalized likelihood with a = 0.1 and a = 0.5. The coefficient path can be computed efficiently by using repeated maximum-likelihood fits with "warm" starts. For a grid of values a 1 < . . . < a k with a j > 0 (j = 1, . . . , k), JeffreysMPL (Algorithm S1 in the Supplementary Material) is first applied with a = a 1 to get the maximum penalized likelihood estimates β (a 1 ) ; then, JeffreysMPL is applied again with a = a 2 with starting values b = β (a 1 ) , and so on, until β (a k ) has been computed. This process supplies JeffreysMPL with the best available starting values, as the algorithm walks through the grid. The finiteness of the components of β (a 1 ) , . . . , β (a k ) and the shrinkage properties described in Sections 2.3 and 3 contribute to the stability of the overall process. The properties of the coefficient path for inference and prediction from binomial regression models, and the development of general procedures for selecting a, are interesting, open research topics.
Kenne Pagui et al. (2017) develop a method that can reduce the median bias of the components of the maximum likelihood estimator. According to the results therein, median bias reduction for one-parameter logistic regression models is equivalent to maximizing (4) with a = 1/6. Hence, the results in Section 2 also establish the finiteness of the estimate from median bias reduction in one-parameter logistic regression, and that the induced shrinkage to equiprobability will be less strong than penalization by the Jeffreys prior. Kenne Pagui et al. (2017) observed such properties in numerical studies for p > 1. When p > 1, though, median bias reduction is no longer equivalent to maximizing (4) with a = 1/6.

Acknowledgments
Ioannis Kosmidis and David Firth are supported by The Alan Turing Institute under the EP-SRC grant EP/N510129/1. David Firth was partly supported also by EPSRC programme EP/K014463/1, Intractable Likelihood: New Challenges from Modern Applications.

Supplementary material
The Supplementary Material is available for download at http://www.ikosmidis.com/files/ finiteness-jeffreys-supplementary-v1.4.zip and includes: a document with proofs for Theorems 1, 2, 3 and Corollary 1; Algorithm S1, and some additional numerical results; and R code and data to reproduce all of the numerical work and graphs.

S1 Supplementary material
All labels for the sections, equations, the table, the algorithm and the figure in the current document have been prefixed by "S" (e.g. Section S2, Table S1, Algorithm S1, etc).
The supplementary material for "Jeffreys-prior penalty, finiteness and shrinkage in binomialresponse generalized linear models" provides i) the proofs of Theorem 1, Corollary 1, Theorem 2, Theorem 3 in the main text (see Section S2); ii) the pseudo-code for Algorithm JeffreysMPL (see Algorithm S1), which implements the repeated maximum-likelihood fits procedure of Section 4 in the main text to maximize the penalized log-likelihood l † (β; a) in (4) for any supplied a and link function G(η); iii) illustrations using an R implementation of Algorithm S1; and iv) R scripts to reproduce all numerical results and figures in the main text and the current document. The current supplementary material document and the R scripts are available for download at http://www.ikosmidis.com/files/finiteness-jeffreys-supplementary-v1.3.zip.
In particular, the script nba-1415-case-study.R reproduces the numerical results in Example 1, Figure 1, and Figure 3 in the manuscript, and in Table S1 in the current document; the files nba-1415-functions.R and nba-1415-regular-season.csv provide the R functions and the data, respectively, for the case study in the main text; the script jeffreys-shrinkage.R reproduces Figure 2 in the main text; the file jeffreys-MPL.R provides an R implementation of Algorithm JeffreysMPL, and the file sur-candes-2019.R computes the timings for Algorithm JeffreysMPL that are reported in Section S3.3, and reproduces Figure 2b on page 11 of the supplementary information appendix of   (Figure S1 here).

S2.2 Proof of Corollary 1
The binomial log-likelihood l(β) in (2) is bounded above by zero. Hence, according to Theorem 1 and expression (3),l(β(r)) → −∞ as β(r) → β 0 . Such a setting for β is always dominated, by a choice b with finite components for whichl(b) takes a finite value. Hence, the maximizer of l(β) must have finite components.

S2.3 Proof of Theorem 2
The sum of m independent Bernoulli distributions with probability π is binomial with index m and probability π. For this reason and without any loss of generality, the proof proceeds with m i = 1 so that w i (β) = ω(η i (β)) (i = 1, . . . , n).
For proving i) decompose X as X = QR, where Q is a n×p matrix with orthonormal columns (Q Q = I p where I p is the p × p identity matrix) and R is a p × p non-singular matrix. This decomposition is always possible because X has full rank by assumption. Then |X W (β)X| = |Q W (β)Q||R| 2 . The functions |X W (β)X| and |Q W (β)Q| will have stationary points of the same kind and at the same values of β, because |R| 2 > 0 does not depend on β.
Proof. Since A ≥ B, elementwise, A = B+C, where C is a diagonal matrix with non-negative entries. Furthermore, X AX, X BX and X CX are positive semidefinite, by the non-negativity of the diagonal elements of A, B and C, respectively. Hence, by Magnus and Neudecker (1999, Chapter 11, Theorem 9), λ t X AX ≥ λ t X BX (t = 1, . . . , p), where λ t (D) denotes the tth eigenvalue of the matrix D. Since the determinant of a matrix is the product of its eigenvalues the result follows.

S2.4 Proof of Theorem 3
For c as in Theorem 3, the adjusted responses and totals in (6) have the form The result follows for any value of q, because 0 ≤ h ≤ 1 and 0 ≤ π ≤ 1.

S3.1 Details
Algorithm JeffreysMPL (see Algorithm S1) implements the repeated maximum-likelihood fits procedure of Section 4 in the main text, to maximize the penalized log-likelihood l † (β; a) in (4) for any supplied a and link function G(η). A satisfactory starting value b for JeffreysMPL is the maximum likelihood estimate of β, after adding a small positive constant and twice that constant to the actual binomial responses and totals, respectively.
Step 22 of JeffreysMPL can be carried out using readily available maximum-likelihood implementations for binomial-response generalized linear models, such as the glm function in R (R Core Team, 2020) and the various implementations in the Python modules statsmodels (Seabold and Perktold, 2010) and scikit-learn (Pedregosa et al., 2011).
1: procedure JeffreysMPL(y, m, X, a, G, g, gdash, ML, b, )  The estimated variance-covariance matrix of the penalized likelihood estimator can be obtained as (R R) −1 , where R is the upper triangular matrix from the QR decomposition of W (β) 1/2 X at the final iteration of the procedure. That decomposition is a by-product of step 17 of JeffreysMPL. Table S1 shows the values of the adjusted responses and totals for the first 6 games of Philadelphia 76ers in Example 1 of the main text, at the first 6 iterations of Algorithm JeffreysMPL, when computing the reduced-bias fit shown in Figure 1 (see script nba-1415-case-study.R for code to reproduce Table S1). The starting values (iteration 0) are the maximum likelihood estimates of the ability contrasts after adding 0.01 and 0.02 to the actual responses and totals, respectively. Table S1: The adjusted responses (top) and totals (bottom) for the first 6 games of Philadelphia 76ers (P76) at the first 6 iterations of Algorithm S1, when computing the reduced-bias fit in Figure 1. Figures are shown in 3 (2019) The R implementation of JeffreysMPL in the supplementary material (see script Jeffreys-MPL.R for code) is used here to compute the reduced-bias estimates for a logistic regression model with n = 1000 binary responses and p = 200 covariates, as considered in Figure 2(b) of the supplementary information appendix of Sur and Candès (2019) (see script sur-candes-2019.R for reproducible code). In particular, we construct a 1000 × 200 model matrix X by simulating 200 000 independent random variables from a normal distribution with mean 0 and variance 10 −3 . Then, we simulate 1000 Bernoulli random variables from a logistic regression model with linear predictors Xβ, where β 1 = . . . = β 25 = 10, β 26 = . . . = β 50 = −10, and β 50 = . . . = β 200 = 0.

S3.2 Illustration: evolution of adjusted responses and totals
The R implementation of JeffreysMPL relies on a full maximum-likelihood iteration for step 22 using the R function glm.fit, and it takes approximately 2.73 seconds to converge to the reduced-bias estimates of the 200 parameters in 4 decimal places on a MacBook Pro laptop with 3.5GHz processor and 16GB of memory. The resulting maximum penalized likelihood estimates are shown in Figure S1 along with the corresponding maximum likelihood estimates.