Using LASSO Regression to Estimate the Population-Level Impact of Pneumococcal Conjugate Vaccines

Abstract Pneumococcal conjugate vaccines (PCVs) protect against diseases caused by Streptococcus pneumoniae, such as meningitis, bacteremia, and pneumonia. It is challenging to estimate their population-level impact due to the lack of a perfect control population and the subtleness of signals when the endpoint—such as all-cause pneumonia—is nonspecific. Here we present a new approach for estimating the impact of PCVs: using least absolute shrinkage and selection operator (LASSO) regression to select variables in a synthetic control model to predict the counterfactual outcome for vaccine impact inference. We first used a simulation study based on hospitalization data from Mexico (2000–2013) to test the performance of LASSO and established methods, including the synthetic control model with Bayesian variable selection (SC). We found that LASSO achieved accurate and precise estimation, even in complex simulation scenarios where the association between the outcome and all control variables was noncausal. We then applied LASSO to real-world hospitalization data from Chile (2001–2012), Ecuador (2001–2012), Mexico (2000–2013), and the United States (1996–2005), and found that it yielded estimates of vaccine impact similar to SC. The LASSO method is accurate and easily implementable and can be applied to study the impact of PCVs and other vaccines.


WEB APPENDIX 1 The Input Variables
We simulated the outcome based on a combination of control variables, one seasonal variable, and an offset. First, we used 5 randomly selected control variables to generate the outcome, we repeated this process five times, each time randomly selecting a different set of 5 control variables. Then, we increased the number of causal control variables to 10 and again, we repeated the process five times. Web Table 1 shows the complete list of variables, where highlighted in gray are the control variables that were ever selected to simulate the outcome. In the non-causal framework, we generated the outcome using 3 control variables, and then removed these 3 causal control variables alongside the control variables that belonged to the same chapter under the International Classification of Diseases, Tenth Revision (ICD-10). In Web Table 1, the control variables that were ever selected to simulate the outcome in the non-causal framework are marked with "*" and the control variables that were then removed are marked with " †".

Statistical model -LASSO regression
LASSO is an extension of generalized linear regression that decreases the variance of regression coefficients and the prediction error by adding a term to the log-likelihood to penalize the complexity of the model 1 . This leads to a parsimonious model with a subset of control variables that best predicts the outcome. To estimate the penalty parameter, we first generated a grid of 100 values for the penalty parameter and fitted LASSO regression to the pre-vaccine period data for each value in the grid. Next, we selected the best value for the penalty using either 10-fold cross validation (CV) or Akaike Information Criterion (AIC) 2 . In a 10-fold CV, the prevaccine data period was randomly divided into 10 groups of equal size, with 9 groups forming the training set and 1 group forming the test set. A model was fit on the training set and the minimized mean squared error (MSE) was obtained when tested on the test set. This was repeated 10 times to yield an average MSE. This was repeated 100 times on each value in the grid of penalty parameter and the penalty with the lowest MSE was selected. Using the AIC for the penalty selection, we fitted LASSO regression to the pre-vaccine data period and the penalty with the lowest AIC was selected.
We tested two variants of the LASSO regression model: the first one included all seasonal variables by default (season-forced, SF); the second one treated seasonal variables as control variables and allowed LASSO regression to select from them (season-unforced, SU). The selected model was re-fitted onto the entire pre-vaccine period to predict the counterfactual outcome (̃) during the evaluation periodthat is, the hospitalization counts that would have occurred in the population if PCV had not been introduced, assuming the distribution and associations of the population features captured in the pre-vaccine period data remained unchanged. With the LASSO-predicted counterfactual under the no-vaccine scenario and the observed outcome ( ), we calculated the vaccine impact using equation 1. An IRR less than 1 indicates a reduction in all-cause pneumonia hospitalization due to the vaccination program.

=̃= ∑ ∈ ∑∈
(1) where is the set of time points during the evaluation period.

Statistical modelother methods
We compared LASSO regression to three established methods in the field of vaccine impact estimation. The key features of the implementation of LASSO regression and all comparator methods are summarized in Web Table 2. The three comparator methods are described below: 1. Interrupted Time Series (ITS) is a method that includes an indicator variable for vaccination, secular trends before and after PCV introduction, and background seasonality 3-6 . Following the procedures where ̂ is the fitted outcome and ̃ is the counterfactual outcome during the evaluation period. We used the fitted model to predict the counterfactual outcome ̃ during the evaluation period in the absence of vaccination (i.e., indicator variable set to "0" for all time points).

2.
In accordance with the synthetic control (SC) method 7,8 , time series of different control variables were weighted according to their fit to the outcome time series in the pre-vaccine period using Bayesian variable selection. The weighted time series were jointly used to predict the counterfactual outcome ̃.
The model was adjusted for background seasonality using 11 monthly indicator variables and the logarithm of non-respiratory hospitalization was included as a covariate. The vaccine impact was calculated using equation 1.

=̂= ∑∈ ∑∈
(2) method, a smoothed trend for each of the control variable's time-series was extracted with seasonal-trend decomposition using locally-estimated scatterplot smoothing (LOESS) 9 . A PCA was performed on the extracted smoothed trends, and the first principal component was selected as the composite trend, which was used as a covariate in a regression model to predict the counterfactual outcome ̃. The vaccine impact was calculated using equation 1.

WEB APPENDIX 3 Outcome Simulation and Performance Assessment
Outcome simulation Web Figure 1 shows the procedure for outcome simulation.
Web Figure 1. Flowchart illustrating the procedure of outcome simulation.

Performance assessment with simulated data
To assess the performance of all methods to estimate vaccine impact, we designed a simulation study.
We generated the outcome, monthly pneumonia hospitalization ( ), based on a combination of n (n = {5, 10}) control variables ( 1 , 2 , … , ) randomly selected from the list of control variables available in the Mexico data set 7 . We then incorporated an intercept ( ), the logarithm of non-respiratory hospitalization as the offset  Table 1) together with their associated control variables (marked with " †" in Web Table 1) were then removed from the list of control variables for model testing.

WEB APPENDIX 4 Using Maximum Entropy Bootstrap to Obtain 95% Confidence Intervals for Lasso Estimates
The ways to obtain different 95% uncertainty intervals (UI) for different methods in the application to real-world data are summarized in Web Table 2.
Web Table 2. A summary of how 95% uncertainty intervals of estimates were obtained for different methods in the application to real-world data.  To calculate approximate 95% CI of the IRR for LASSO, we explored a non-parametric bootstrap approach. Specifically, we used the maximum entropy algorithm (implemented in the R package "meboot" 10 ) to construct bootstrap replicates of the log-transformed time series for the outcome, the offset, and every control variable. We then converted the bootstrapped time series back to the natural scale, and assembled them to form bootstrap replicates of the original data set. We then fit LASSO-SF and estimated the IRR for each bootstrap data set, and calculated the 95% CI based on the 2.5 th and 97.5 th percentiles of the resulting IRR distribution. The results, presented in Web Figure 10, were based on a sample of 100 bootstrapped data sets.

Web Figure 4. Seasonal and control variables selected by LASSO-SU for simulation 1 to 100 in simulation sets
A to E (5 causal control variables).
Each panel represents a scenario with outcome simulated with a different set of five causal control variables Web Figure 10. Age-group-specific incidence rate ratios (IRR) for four countries, estimated by LASSO-SF (with different uncertainty intervals) and SC.
Each panel shows the age-group-specific IRR for all-cause pneumonia in a population whose infants were vaccinated with pneumococcal conjugate vaccines (PCV) compared to a counterfactual population in which PCV was never introduced, estimated by LASSO-SF (dark green circle: 95% prediction interval from Poisson distribution; light green square: 95% confidence interval from maximum entropy bootstrapping 10

WEB APPENDIX 5 Sensitivity Test After Removing Bronchitis and Bronchiolitis
As a sensitivity test, we removed "bronchitis and bronchiolitis" from the list of control variables that LASSO regression and SC could choose from and re-analyzed the pneumonia hospitalization data from Chile, Ecuador, Mexico and the US. In Chile, the results remained similar. In Ecuador, the reduction in all-cause pneumonia hospitalization detected in the main analysis ( Figure 4, panel B) attenuated in the youngest age group, but remained statistically significant in age groups 18 to 64 years (Web Figure 11, panel B). In Mexico, the reduction in the youngest two age groups detected by SC in the main analysis ( Figure 4, panel C) was also attenuated and was no longer statistically significant (Web Figure 11, panel C). In the US, only a marginal reduction was detected by LASSO-SF and LASSO-SU in the age group 40 to 64 years and not in older adults before removing "bronchitis and bronchiolitis" (Figure 4, panel D), but after doing so, a more pronounced reduction was observed in all the age groups from 18 to 79 years (Web Figure 11, panel D).
Web Figure 11. Age-group-specific incidence risk ratios (IRR) for all-cause pneumonia in four countries after removing "bronchitis and bronchiolitis", estimated by two LASSO methods and SC.
Each panel shows the age-group-specific IRR for all-cause pneumonia in a population whose infants were vaccinated with pneumococcal conjugate vaccines (PCV) compared to a counterfactual population in which PCV was never introduced, estimated by LASSO-SF (green), LASSO-SU (orange), and SC (blue). The four countries are A) Chile; B) Ecuador; C) Mexico; and D) the US. The 95% prediction intervals (PI) of estimates by LASSO-SF and LASSO-SU are shown by the error bars joined at a circle; the 95% credible intervals (CrI) of estimates by SC are shown by the error bars joined at a square. 95% PI and 95% CrI are different uncertainty measures and are thus not directly comparable.