On shrinkage and model extrapolation in the evaluation of clinical center performance

We consider statistical methods for benchmarking clinical centers based on a dichotomous outcome indicator. Borrowing ideas from the causal inference literature, we aim to reveal how the entire study population would have fared under the current care level of each center. To this end, we evaluate direct standardization based on fixed versus random center effects outcome models that incorporate patient-specific baseline covariates to adjust for differential case-mix. We explore fixed effects (FE) regression with Firth correction and normal mixed effects (ME) regression to maintain convergence in the presence of very small centers. Moreover, we study doubly robust FE regression to avoid outcome model extrapolation. Simulation studies show that shrinkage following standard ME modeling can result in substantial power loss relative to the considered alternatives, especially for small centers. Results are consistent with findings in the analysis of 30-day mortality risk following acute stroke across 90 centers in the Swedish Stroke Register.

A simple interpretation of equation (1.2) is that it specifies an amount of bias to be introduced into the score function in order to remove the leading term in the asymptotic bias ofθ. The introduced bias function −I(θ)b(θ) is O(1) as n → ∞, an order of magnitude smaller in probability than u(θ) itself (Firth, 1992). The Firth approach eliminates the O(n −1 ) bias of the maximum likelihood estimate by adjusting the score function, rather than the estimate itself. An advantage is that the definition of the Firth corrected estimates does not depend on the existence of the non-corrected estimates, of which some components may be infinite in binomial problems (Firth, 1992).
Simulation studies in different settings for logistic regression in Heinze (2006) show promising results for the Firth method in case of separated or nearly separated data. Based on simulations in Kessels and others (2013), the Firth corrected multinomial logistic regression outperforms ordinary maximum likelihood as the Firth corrected method converges in case of data separation, it removes bias of the ML estimates and reduces variance.

Fixed effects logistic regression: Asymptotic variance
To simplify notation we denote the vector (L, I(C = 1), . . . , I(C = m)) by X. To assess the uncertainty on the potential full population riskÊ{Y (c)}, note that the MLEθ = (β ,ψ ) of θ in satisfies the following set of estimating equations To obtain the asymptotic variance ofÊ{Y (c)}, we first perform a Taylor expansion of Second, we perform a Taylor expansion of (1.4) around the true parameter value θ and obtain Finally, we combine (1.7) and (1.6) and estimate the variance ofÊ{Y (c)} via Confidence interval calculations assume normality of logitÊ{Y (c)}, so in our calculations we have transformed these estimators to the logit scale via the delta method. This allows us to obtain confidence intervals with boundaries inside the [0, 1] interval. The above expressions continue to hold when the parameters θ are estimated according to Firth's modified estimating equations (1.2). This is because bias reduction affects the covariance matrix of the estimates only in the O(n −2 ) and higher order terms (Firth, 1993). Note that standard errors of the Firth corrected estimates may be slightly smaller, since they are evaluated in a different estimate for θ (Firth, 1992).

Doubly robust PS method: Asymptotic variance
We denote the outcome model parameters by θ = (β , ψ ) and the propensity score parameters by ρ = (γ , δ ) . We estimate E{Y (c)} again as in (1.5), but now using a weighted regression to fit the FE model (1.3), with weights equal to one over the propensity score of the observed center (1.9) The MLEρ of ρ satisfies the following set of estimating equations (1.10) First, we perform a Taylor expansion of (1.10) around ρ and obtain Analogously, we perform a Taylor expansion of the outcome model estimating equations then we define the estimator for the potential full population risk as follows: (1.16) Combining (1.11) and (1.13), the variance of the estimated potential full population riskÊ{Y (c)} on the risk scale is estimated via (1.17) Again, the above expressions can be transformed to the logit scale and continue to hold when the parameters θ are estimated according to Firth's modified estimating equations (1.2).

Additional results
An illustration of model extrapolation considering two centers with strongly differential case-mix is given in Figure 1.

Simulation study application: Belgian Colorectal Cancer register
The ME models were fitted in R using the rjags package (Plummer, 2003) where we assigned the following independent non-informative hyperpriors: Classical normal ME model Clustered normal ME model where K = 3 denotes the number of center clusters and is the length of the vector of patient characteristics. The Firth corrected FE model was fitted in R using the brglm package (Kosmidis, 2011). The analysis of 5 simulated datasets took on average 3 hours for the classical normal ME method and 15 hours for the clustered analog, as compared to at most 3 minutes for the FE and doubly robust methods. First we determine 'true' center classifications based on a large sample of simulated data for 10 6 patients in 63 centers. Next, we generate 1000 samples of each 2355 patients under the same models. In this data-generating process, we first generated independent patient characteristics age, gender and initial disease status cStage based on descriptive statistics in Goetghebeur and others (2011), such that age was 67 years on average (SD 9 years), that 61% of the patients were male and that 13%, 18%, 51%, 14% respectively had cStage I to IV; 4% had missing cStage. Next, we use the PS and outcome models that were fitted on the observed data as data-generating models. So, the patients' center choice and outcome are generated, mimicking the center-specific distribution of patients in the register. Then, based on the large random sample of 10 6 patients we obtain precise estimates of E(Y ) and E{Y (c)} as sample averages. Following equations (2.2) and (2.3) in the main text, this results in the true center classification. For each center this true classification is then compared to the classification based on a sample of n = 2355 patients and using one of the regression methods to estimateÊ(Y ) andÊ{Y (c)}. In the analysis, we excluded centers treating less than 10 patients, resulting in a median of 33 registered patients per center and an average of 53 centers.
About half of the centers are labelled as having low/high risk based on the initial large sample (see main text Figure 1, ignoring the simulation results for now) although for many centers the 'true' potential full population risk is close to the decision limits, especially on the low side. Note that results depend highly on the values of the clinical (λ) and statistical (k) tolerance levels. When early detection is most important, such as for confidential feedback to clinical centers, large power is preferred over small type I errors. In contrast, when publishing results openly, mistakenly classifying a center as high risk may have unduly severe implications for that center (decrease in number of patients, lower funding) compared to mistakenly failing to recognize a center as potentially high risk.
Tables 1 and 2 provide extra results on simulations modelled on the Belgian Colorectal Cancer register when the Firth correction is not applied and when the sample size is doubled. We performed additional simulations to investigate the effect of model misspecification. We generate data under an outcome regression model with quadratic centered age effect and a PS model with the absolute value of centered age effect. Misspecification is introduced in the fitted outcome and/or PS model by estimating a linear centered age effect instead. The original population average risk E(Y ) = 23% based on Goetghebeur and others (2011) is retained. Center effects were multiplied by 2 to obtain a similar spread of E{Y (c)} values compared to the original simulations. The original age coefficients in the PS models are multiplied by 2 to obtain more extreme differences in age distribution and thus stronger extrapolation across centers.
Descriptives under the new and original data generating models for a random sample of n = 2355 patients are respectively given in Figures 2 and 3, where centers with size smaller than 10 are indicated in red. The quadratic age effect is quite pronounced while the corresponding estimated linear age effect is approximately zero (See bottom of figure 2). Comparing the age distribution under both simulation scenarios, we see more overlap in the age distribution across centers under the new data generating PS model.
Simulation results in table 3 show that the Firth corrected fixed effects and doubly robust PS methods have similar performance when the regression models are correctly specified. We summarize the bias on E{Y (c)} over all centers by taking the difference of the mean of the squared bias per center and the mean of the empirical variance of the bias over simulations and next taking the square root, that is where S is the number of simulations and E c is the average over all centers c = 1, . . . , m. Note that we correct for the fact that the squared bias also depends on the variance of the bias. In general the coverage of the 95% CI for E{Y (c)} is smaller than 95%, but this may be due to extrapolation across centers, which is also reflected in the bias. For both methods the power to detect high mortality risk centers decreases when the outcome model is misspecified, but the Type I errors do not decrease accordingly. Here, the doubly robust PS method outperforms the fixed effects method in terms of coverage of the 95% CI for E{Y (c)}, percentage of correct classification and bias on E{Y (c)}, although it does not gain power to detect high mortality risk centers. Misspecification of the PS model does not seem to have a large impact on the results in our simulation study, although this may be different for other data generating models. When both models are misspecified, the doubly robust PS method performs worse than the fixed effects method with misspecified outcome model.

Analysis of the Swedish Stroke Register
We give descriptive statistics for both the original and reduced dataset in Table 4. To assess the extent of differential patient case-mix across centers, we perform a univariate analysis on the complete cases (Figures 6 to 8). In general case-mix does not differ much across centers, but we detect considerable variation with respect to treatment for high blood pressure (Figure 4), education level ( Figure 5) and time of admission ( Figure 6). The minimum number of patients for whom we have records per hospital is 104 (median 1383.5 and max 7260) for the full data and 49 (median 957.5 and max 4341) for the complete cases. Some centers have no records for several years because they no longer treat stroke patients, have been merged with another center, or only started treating stroke patients during the study. There are no missing values for the outcome and the 30-day mortality risk is lower for the complete cases (8%) than for the full data (12%), which is due to the complete cases being generally more healthy as can be seen from the patient characteristics measured on admission (see Table 4). Figure 8 shows lower centerspecific mortality risks for the complete cases compared to all cases, so results will probably be too optimistic for all centers. To describe the effect of risk factors on outcome, we report the estimated odds ratios based on fitting a logistic normal ME model (see Table 5). The analysis is based on a logistic model for the outcome in which we allowed for a piecewise linear spline effect of standardized age (mean age 75 years, SD = 12 years) with a knot at age 80. Besides the main effects listed in Table 5, we include three interactions between patient-specific covariates which are of primary interest: education level by age exceeding 80 to account for a different effect of education level on 30-day mortality for elderly or young patients, institution by living alone to account for a different effect of living alone for patients in an institution or at home, and living alone by gender to account for a different effect of living alone for men and women. Note that we do not include interactions with center in the model. The time-dependent PS model includes the same risk factors, with exception of the interactions, and it does not allow for a piecewise linear effect of age to avoid model fitting problems. Results for the analysis of the Swedish Stroke Register when applying the Firth correction to the outcome model for the FE method and the doubly robust method are given in Figures 9 to 11.

R-code
Software in the form of R-code is available at https://github.ugent.be/mmvarewy/Biostatistics2014. Firth, D. (1992). Bias reduction, the Jeffreys prior and GLIM. In: Fahrmeir, Ed. L., Francis, B., Gilchrist, R. and Tutz, G.  Table 2. Center classification (Low/High risk or Accepted) based on 1000 simulations, except for the Bayesian normal ME analyses which were evaluated on 100 simulations, with double sample size. The true percentage of low and high risk centers is respectively 30% and 22%. Fig. 1. Artificial example illustrating extrapolation under outcome regression with main effects for age and center. It visualizes outcome data for two centers with strongly differential case-mix and such that smaller outcomes are seen in center 1 (black dots) than in center 2 (blue triangles) in each patient subgroup. Direct standardization under seemingly well fitting linear regression models involves strong extrapolation (dashed line) of the outcome distribution in the given center to what it is expected to achieve on the full patient population under study. In truth, the observed data carry no information about E{Y (2)} because no patients in center 2 have age below 60.      Table 5. Model parameters for a descriptive logistic normal ME model with outcome 30-day mortality based on the Swedish Stroke Register, indicating (*) if measured prior to stroke and (vs reference category) for categorical covariates. SE = standard error, OR = odds ratio = exp(β), 95% CI for OR = 95% confidence interval for odds ratio. Fig. 10. The observed center-specific risk versus the potential full population risk for all 90 centers of Riks-Stroke for the FE method, the doubly robust PS method and the classical normal ME method, distinguishing between low (blue filled circles) and high (red triangles) mortality risk (with Firth correction).