Abstract

We develop a proportional incidence model that estimates vaccine effectiveness (VE) at the population level using conditional likelihood for aggregated data. Our model assumes that the population counts of clinical outcomes for an infectious disease arise from a superposition of Poisson processes with different vaccination statuses. The intensity function in the model is calculated as the product of per capita incidence rate and the at-risk population size, both of which are time-dependent. We formulate a log-linear regression model with respect to the relative risk, defined as the ratio between the per capita incidence rates of vaccinated and unvaccinated individuals. In the regression analysis, we treat the baseline incidence rate as a nuisance parameter, similar to the Cox proportional hazard model in survival analysis. We then apply the proposed models and methods to age-stratified weekly counts of COVID-19–related hospital and ICU admissions among adults in Ontario, Canada. The data spanned from 2021 to February 2022, encompassing the Omicron era and the rollout of booster vaccine doses. We also discuss the limitations and confounding effects while advocating for the necessity of more comprehensive and up-to-date individual-level data that document the clinical outcomes and measure potential confounders.

1 Introduction

During a public health emergency such as the COVID-19 pandemic, when vaccination programs are dynamic (e.g., expanding coverage and administering additional doses), public health policy makers require timely information on vaccine effectiveness (VE) in light of the current vaccine coverage. They may also want to assess counterfactual scenarios, such as what may have happened in terms of hospitalizations and other health outcomes in the absence of vaccination programs (Ogden et al., 2022). These assessments must be ongoing and conducted quickly based on available surveillance data. To meet these needs, we propose statistical methods that leverage aggregated count data from various public sources, such as routine case monitoring systems, vaccination registries, and demographic statistics.

Typically, VE studies are based on observational cohorts. Both prospective and retrospective longitudinal cohorts utilize detailed individual-level event history data. These data enable the estimation of risk functions by vaccination status, which define the relative risk (RR), and thus VE as one minus the RR, either as discrete probabilities or person-time incidence rates. However, since such data are often costly to collect, many studies rely on extensive linkage of multiple large administrative databases to create pseudo-cohorts.

A systematic review of COVID-19 VE (Teerawattananon et al., 2022) analyzed 42 peer-reviewed studies covering the period up to late 2021. All of these studies involved extensive data linkages using administrative data sets. For example, Lin et al. (2022) studied VE for three vaccine products (Pfizer, Moderna, and Johnson & Johnson) in an observational cohort assembled from the linkage of multiple large administrative databases in North Carolina that captured important information on vaccination events and measured waning of VE. Andrews et al. (2022) compared the effectiveness of two vaccines (Pfizer and AstraZeneca) in England against symptomatic COVID-19, hospitalization, and death. In Canada, Nasreen et al. (2022) estimated the effectiveness of mRNA vaccines (Pfizer, Moderna) and AstraZeneca against COVID-19 hospitalization and deaths using multiprovincial linked databases. Meanwhile, Buchan et al. (2021) studied the effectiveness of mixed vaccine schedules (Pfizer, Moderna, and AstraZeneca) against severe outcomes (hospitalization or death), stratified by circulating variant (Delta or Omicron) and time since last dose in Ontario, Canada, up to December 26, 2021. These linked databases allowed for the creation of observational cohorts with sufficient detail, enabling the use of test-negative case-control designs and logistic regression models, as in Andrews et al. (2022), Nasreen et al. (2022), and Buchan et al. (2021), or Cox regression models, as in Lin et al. (2022).

Individual-level data linkages can be time-consuming and raise privacy concerns. In our study, we use aggregated counts from publicly available administrative registries without the need for data linkage. These counts include COVID-19 hospital and intensive care unit (ICU) admissions, aggregated into small time intervals and stratified by vaccination status, age, and other covariates. They are as timely as the reporting of ongoing disease surveillance and monitoring. To our knowledge, no study has been conducted or published on the methods for assessing VE based on such aggregated registry data. Therefore, we have developed methods and models that differ from the logistic and Cox regressions used in the studies mentioned above.

Assuming that the occurrence of events arise from a Poisson process, we formulated a proportional incidence rate model for aggregated marginal counts. We then applied these methods to analyze weekly incidence counts of hospital and ICU admissions in Ontario between January 4, 2021, and February 20, 2022, in conjunction with information on at-risk populations categorized by age and vaccination status. A thorough discussion of confounding factors and limitations is also provided. Although the aggregate data do not contain the detailed information available in individual-level longitudinal data, such as individual vaccination status, our findings are consistent with those of the cited studies and corroborate evidence from related immunological research.

The organization of this paper is as follows: Section 2 outlines the development of our models and methods. In Section 3, we apply our statistical methods to study the efficacy of COVID-19 vaccines. Finally, we conclude with a discussion of our statistical methods and results in Section 4.

2 Methods

2.1 Specifying the RR Based on Aggregate Data

Suppose that severe outcomes resulting from SARS-CoV-2 infections arise from a counting process formula with an instantaneous intensity formula, where formula is the incidence rate per person, and formula is the size of the at-risk population consisting of individuals who are susceptible to infection at time t and may progress to being admitted to the hospital or ICU. The at-risk population is stratified by vaccination status.

In simple settings with only two vaccination groups, the unvaccinated (formula) and the vaccinated (formula), we denote the size of at-risk population in each group by formula for formula. We consider formula as a superposition of two counting processes, each with intensity functions formula and formula, respectively. The instantaneous RR is the ratio formula, and the VE is defined as formula.

We consider data that are only available at the aggregate level, with time divided into intervals formula, where k ranges from 1 to m. When the length of the intervals is short, it is reasonable to assume that the at-risk population sizes formula and formula remain constant during formula so that formula, for formula. The number of events during formula is represented by formula with mean values formula. These mean values are given by formula and formula, respectively, where formula and formula.

In the application of this paper, we use weekly data. Without loss of generality, we designate interval formula as week k and refer to formula as the baseline per person weekly incidence rate. The key parameter of interest is the weekly aggregated RR

1

A crude estimate of formula is given by the ratio

2

where formula is the observed realization of the random variable formula. This corresponds to the crude estimates formula and formula, with formula and formula. However, in some instances, the crude estimates may be undefined due to the lack of baseline count for a week (i.e., when formula). Additionally, when vaccine coverage or incidence is low, both formula and the crude estimates formula and formula tend to fluctuate considerably from week to week. Given that the infectious disease incidence rates are driven by smooth functions reflecting a complex dynamic system, it is desirable to ensure some smoothness in the estimates for formula and formula. Bearing this in mind, we propose the following likelihood-based methods for estimating the RR.

2.2 A Direct Likelihood-based Approach to RR Estimation

We assume that the counting process arises from a Poisson process so that the weekly data formula, formula are Poisson random variables with formula and formula. The RRs formula are specified by a vector of parameters β and denoted by formula for formula. The expected values are given by formula and formula. The log-likelihood function can be arranged as

3

through data partitioning as formula.

The direct method for estimating RR does not require modeling the baseline incidence formula, because RRs do not depend on it. In (3), the likelihood function is partitioned in such as way that

  • (i)
    the first term is a conditional likelihood based on the conditional distribution of formula, which follows a binomial distribution: formula. The corresponding likelihood function is
    4
  • (ii)

    The second term is the log-likelihood based on the Poisson distribution of formula, with a mean value of formula, which cannot differentiate formula from β.

Following the principles outlined in Kalbfleisch and Sprott (1970) and Sprott (1975), the first term is conditionally sufficient for β when knowledge of formula is absent. Thus, this approach eliminates the baseline incidence formula as a nuisance parameter. The direct estimation of formula can be achieved by maximizing the likelihood (4), which is equivalent to solving the unbiased estimating equations based on the score functions

5

where formula and formula are the mean and variance of the conditional distribution of formula, respectively. Solving (5) for formula, yields asymptotically unbiased estimates for β (Godambe, 1980; Godambe & Thompson, 1974). These estimates are also optimal as they possess the smallest asymptotic variances (Godambe, 1980).

Upon estimating β, denoted by formula, the baseline incidence rate can be estimated by

6

so that

7

The incidence rates for the vaccinated group are estimated by formula. The estimate formula in (6) is referred to as a semiparametric estimate because, although it does not rely on the parameterization of the incidence rate function, it depends on the parametric RR estimate formula. In contrast to the crude estimate formula that could be undefined if formula for some k, the formula are defined for all formula. Although data formula from the vaccinated group do not provide information about formula without knowledge of the RR, formula in (6) utilizes formula by assuming that the true RRs formula are equal to their estimated values formula.

The population incidence in the absence of vaccination is estimated by formula. The expressions in (7) enable the validation of model predictions for incidence data. Regression residuals for both the vaccinated and unvaccinated groups can be easily obtained (see Web Appendix A) and used to check model adequacy.

2.3 A Joint Likelihood Approach for Estimating the RR and Baseline Incidence

Although the primary objective is to estimate the RR formula, joint estimation of formula and the baseline incidence rate formula is useful, as when multiplied by the total population size, formula provides important counterfactual information, such as what would have happened in a population without vaccination. We specify the baseline incidence rate for the unvaccinated population using a vector of parameters θ, represented as formula. By partitioning the data into formula, the full log-likelihood function (3) can be rearranged as

8

The joint estimation of formula can be achieved by maximizing the full likelihood, as expressed in (8). It is important to note that the direct estimate formula from (4) and the semiparametric estimate formula found in (6) can be viewed as maximum likelihood estimates (MLEs) in relation to (8) if a saturated model formulaformula is used. However, formula fluctuates from week to week, similar to the crude incidence estimates formula, and is only defined up to week m without the capability to extrapolate into short-term forecasts.

Public health researchers often prefer smooth baseline incidence rate estimates that can be extrapolated into short-term forecasts. This requires model specification and parameterization of formula. If formula is correctly specified, maximizing formula will yield efficient estimates. However, formula is governed by a complex dynamic system, making it challenging to capture all aspects of the data-generating process. Furthermore, the computational demands are prohibitive.

We propose a two-step approach. In this approach, we model formula as a smooth function of k by using thin plate regression splines as described by Wood (2003). Our two-step approach is based on the likelihood arrangement in Equation (8), and we argue that data from the unvaccinated group formula is marginally sufficient for formula, because in (8):

  • (i)

    the distributions of formula, for any specified θ, are jointly ancillary for β;

  • (ii)

    in the absence of knowledge regarding β, utilizing data solely from the unvaccinated group does not result in the loss of any available information about θ.

These arguments are also following the principles outlined in Kalbfleisch and Sprott (1970) and Sprott (1975).

  • The first step: Assuming Poisson counts formula, the first term in (8) represents the marginal likelihood
    9

    Data from the unvaccinated population formula are used to estimate the baseline incidence rate formula. We denote the estimated baseline incidence rate as formula and refer to it as the penalized spline function in accordance with our model specification.

  • The second step: We insert formula into the full log-likelihood function formula (8) as fully specified. The likelihood function is then reduced to that for β only and is based on data from the vaccinated group formula. The log-likelihood takes the following form:
    10

    which leads to the estimation of formula by maximization.

The distributions of formula and formula may exhibit extra-Poisson variation. In such cases, the Poisson log-likelihoods formula (9) and formula (10) can be substituted with those derived from negative-binomial distributions Lawless, J.F. (1987). A detailed explanation of fitting thin plate regression spline models to estimate the baseline incidence rate in the first step is provided in Mullah and Yan (2022) and Wood (2004). The key points are summarized in Web Appendix B.

Concerning bias, the estimation of formula in the second step crucially depends on the model specification of formula in the first step. Misspecification of formula leads to biased estimates and the amount of bias depends on the goodness-of-fit of the model formula to data.

Regarding variance estimation, we recommend against deriving it naively from (10), because simply applying score functions and second-order derivatives to (10) would result in an underestimation of the standard errors (SEs). Meanwhile, the estimate formula, ascertained through the penalized spline method, brings its own uncertainties, necessitating estimation through bootstrapping. Therefore, the bootstrapping method becomes essential for the SE estimation of the RR.

2.4 Some Further Comments about the Two Approaches

Both approaches include the crude estimate (2) as a special case. In the direct approach, if a saturated model formula is employed as a piecewise constant function defined for each week k, the conditional likelihood (4) is maximized when formula given by (2). In the joint likelihood approach, the saturated model formulaformula concurrently with formula also returns the crude estimation of formula.

If the primary objective is to estimate the RR, we recommend using the direct approach. This method proves effective by utilizing both data sets, namely formula and formula, while also accounting for the inherent randomness present in formula. Moreover, it provides an appropriate estimation of the SE.

The thin plate regression splines, as implemented within the two-step approach, offer sufficient flexibility and potential for calibration. Our simulations (detailed in Web Appendix C) demonstrate that well-calibrated penalized splines yield estimated values of β that are very close to those obtained through the direct approach. Both estimates approximate the true parameter β very well. Furthermore, the confidence intervals (CIs) estimated for β in the two-step approach, using Wald-based bootstrapping, align well with those derived from the direct approach. In contrast, when the two-step approach employs naively estimated model-based SEs to obtain CIs for β, the resulting intervals are relatively narrower, leading to lower coverage probabilities.

Therefore, if the goal is to simultaneously estimate both the baseline incidence and the RR, we suggest calibrating the spline functions in a way that ensures the estimated β from the two-step approach aligns well with that obtained from the direct approach. To account for uncertainties associated with the estimated baseline incidence, bootstrapping should be used. Similarly, the uncertainties around the estimated RR can be addressed based on the direct method, with support from bootstrapping.

2.5 Regression Analysis

To assess VE among various groups based on factors like age, preexisting health conditions, or distinct time frames, we consider a covariate vector formula that could be time-dependent. The RR formula is formulated using a log-linear model

11

Under the generalized linear model (11), the conditional likelihood (4) in the direct approach takes the form

12

whereas the likelihood (10) in the second step of the two-step approach becomes

13

The MLEs of β corresponding to the direct method and two-step approach can be obtained by directly maximizing the likelihood functions (12) and (13), respectively, using a nonlinear optimization tool (e.g., optim or nlm in R Venebles, W.N. & Ripley, B.D. (2002)). Alternatively, MLEs can be obtained by iteratively solving the estimating equations provided in Web Appendix A. The variances of these estimates are derived using the Fisher information matrix.

Upon estimating the parameters, we can compute the predicted values for the direct estimation approach based on (7) as

Regression residuals for the direct estimation method are summarized in Web Appendix A.

2.6 Extension to Multiple Vaccination Statuses

Our main objective here is to assess the comparative risks associated with different vaccination statuses. We will focus solely on the direct method. We consider two distinct vaccination statuses labelled as formula and formula. Each status is associated with a respective response, formula and formula, compared against a baseline represented by formula. The baseline signifies the unvaccinated group, and their corresponding response is denoted as formula.

In the following Section 3.2, we incorporate the booster vaccine into this framework. We classify recipients of a third vaccine dose as a separate vaccination status, in contrast to those who have been fully vaccinated with two doses. This adjustment impacts the associated at-risk population sizes, leading to revised values of formula, and formula, respectively.

Let formula be the count of event occurrences within time interval formula for individuals who received two vaccine doses (from a population of size formula). The RR against the unvaccinated is denoted as formula. Similarly, let formula be the count of event occurrences within the same time interval formula for those who received a third (booster) dose (from a population of size formula).

The addition of a third dose to those who have already received two doses has a multiplicative effect formula, rendering the RR against the unvaccinated as formula. If formula is statistically significantly less than 1, it implies that the third dose further reduces the RR among those vaccinated with two doses compared to the unvaccinated.

In the regression analysis, covariates z may have distinct impacts on formula and formula. We can consider a pair of log-linear models

14

Given the marginal incidence numbers formula, the conditional distributions are multinomial with expectations

The conditional likelihood (12) is extended to a function of formula as

15

With MLE formula, the fitted baseline incidence rate is given by

Residual estimates are given in Web Appendix A.

3 Applications

We obtained daily counts of vaccinated and unvaccinated individuals, as well as daily data on new hospital and ICU admissions stratified by age and vaccine status from the Ontario Case and Contact Management System (Government of Ontario, 2021; Public Health of Ontario, 2022) between January 2021 and February 2022. To determine the at-risk populations, we used age-stratified population statistics for Ontario from Q2 2021, sourced from Statistics Canada (Statistics Canada, 2021).

We aggregated daily counts into weekly totals, starting with Week 1 on January 4, 2021, and ending with Week 52 on January 2, 2022. To make the time-series easier to present, we continued the labelling of weeks in 2022 from the previous year, meaning that Week 53 was the first week of 2022. Our analysis was truncated by the end of Week 59, which was on Sunday, February 20, 2022.

The first part of this section demonstrates the performance of the statistical methods by analyzing a subset of COVID-19 hospital admission data for two 10-year age cohorts: 30–39 and 70–79 years, assuming that formula is piecewise constant. The second part focuses on applying the statistical analysis to the entire adult population in Ontario, considering both hospital and ICU admissions as endpoint events. These analyses are stratified by age cohorts, with formula modeled as a linear function of time and other covariates, including the additional time effect during the Omicron era and the effect of the booster vaccine.

We do not differentiate among vaccine types or brands. We use the term “at-risk populations” to convey that individuals with varying vaccination statuses are truly susceptible each week. Unfortunately, we cannot account for those with immunity due to recent infections. A more comprehensive discussion on potential bias will be presented in Section 3.3.

3.1 Performance of the Statistical Methods Based on Analysis of Data in Two Age Cohorts

We study two age cohorts, 30–39 years and 70–79 years, as well as two vaccination groups: vaccinated and unvaccinated. Vaccinated individuals are defined as those who have received at least two doses of COVID-19 vaccine. Data for individuals with only one dose are not analyzed, because this group was very small during the latter half of 2021, accounting for less than 5% of the whole population by Week 30 and 1% by Week 59. Of the 92% who received a minimum of one dose, 98.7% received their second or third dose by the end of the study period.

The RR formula is modeled as a piecewise constant function across three distinct periods:

where Θ0 is the period from the beginning of 2021 until the last week when the two-dose coverage is less than 50% in a given age cohort; Θ1 spans from the first week when two-dose coverage reaches 50% or more until Week 49; and Θ2 covers the period from Week 50 of 2021 to February 20, 2022 (Week 59), which is the end of the time series for this analysis. For the 30–39 age group, Θ1 starts at Week 29, while for the 70–79 age group, Θ1 starts at Week 25 due to age-based vaccination prioritization in Ontario. We compare results using the two estimation approaches.

The Direct Estimation

The RRs β0, β1, and β2 are estimated by maximizing the conditional likelihood (4), which is separately defined for each of formula and formula as

The 95% CIs for formula can be directly obtained from the conditional likelihood using the likelihood ratio statistics (Kalbfleisch, 1985; Sprott, 2000). Based on the MLEs formula, the incidence rates for unvaccinated and vaccinated populations are fitted by

16

The Two-step Approach

Assuming the count data for all 59 weeks among unvaccinated individuals follow negative binomial distributions, a penalized spline estimate formula is obtained in the first step. In the second step, the RRs formula, formula, are estimated for three distinct periods by maximizing (10), which has an explicit solution given by

We use the Wald-based bootstrapping procedure to obtain CIs. The incidence rate in the vaccinated group is calculated as formula; formula, formula.

Comparison

The results from both methods are presented in Table 1. The point estimates by the two methods are nearly identical for both age groups and across the three periods. The CIs from the two approaches are in agreement in most cases. However, some discrepancies are expected because the direct approach captures the randomness in the data formula from the unvaccinated group in the conditional likelihood (4), whereas the two step approach does not involve data formula in its “likelihood” (10), and the randomness of these data is reflected in the uncertainty of the estimate formula via bootstrapping.

TABLE 1

Numerical comparisons of the two estimation approaches. Numbers in brackets are estimated 95% confidence intervals. The intervals correspond to the direct approach are based on the likelihood ratio statistics derived from the conditional likelihood (4). Those correspond to the two-step approach are calculated based on the Wald-based bootstrap standard errors.

Direct estimationTwo-step approach
30–39 years
Θ1: Week 1–28formula (0.016,0.127)formula (0.001,0.107)
Θ2: Week 29–49formula (0.013,0.031)formula (0.013,0.030)
Θ3: Week 50–59formula (0.179,0.275)formula (0.190,0.256)
70–79 years
Θ1: Week 1–24formula (0.026,0.084)formula (0.021,0.081)
Θ2: Week 25–49formula (0.057,0.079)formula (0.057,0.077)
Θ3: Week 50–59formula (0.161,0.193)formula (0.164,0.198)
Direct estimationTwo-step approach
30–39 years
Θ1: Week 1–28formula (0.016,0.127)formula (0.001,0.107)
Θ2: Week 29–49formula (0.013,0.031)formula (0.013,0.030)
Θ3: Week 50–59formula (0.179,0.275)formula (0.190,0.256)
70–79 years
Θ1: Week 1–24formula (0.026,0.084)formula (0.021,0.081)
Θ2: Week 25–49formula (0.057,0.079)formula (0.057,0.077)
Θ3: Week 50–59formula (0.161,0.193)formula (0.164,0.198)
TABLE 1

Numerical comparisons of the two estimation approaches. Numbers in brackets are estimated 95% confidence intervals. The intervals correspond to the direct approach are based on the likelihood ratio statistics derived from the conditional likelihood (4). Those correspond to the two-step approach are calculated based on the Wald-based bootstrap standard errors.

Direct estimationTwo-step approach
30–39 years
Θ1: Week 1–28formula (0.016,0.127)formula (0.001,0.107)
Θ2: Week 29–49formula (0.013,0.031)formula (0.013,0.030)
Θ3: Week 50–59formula (0.179,0.275)formula (0.190,0.256)
70–79 years
Θ1: Week 1–24formula (0.026,0.084)formula (0.021,0.081)
Θ2: Week 25–49formula (0.057,0.079)formula (0.057,0.077)
Θ3: Week 50–59formula (0.161,0.193)formula (0.164,0.198)
Direct estimationTwo-step approach
30–39 years
Θ1: Week 1–28formula (0.016,0.127)formula (0.001,0.107)
Θ2: Week 29–49formula (0.013,0.031)formula (0.013,0.030)
Θ3: Week 50–59formula (0.179,0.275)formula (0.190,0.256)
70–79 years
Θ1: Week 1–24formula (0.026,0.084)formula (0.021,0.081)
Θ2: Week 25–49formula (0.057,0.079)formula (0.057,0.077)
Θ3: Week 50–59formula (0.161,0.193)formula (0.164,0.198)

Figure 1 displays the predicted incidence rates for both unvaccinated and vaccinated populations. The “raw data” are presented as circles in the form of the crude incidence rates formula and formula. The dotted lines correspond to estimates based on the direct method as calculated by (16), and closely track the crude incidence rates. The smoother solid trend lines represent formula and formula from the two-step approach, and they agree well with the direct estimates.

Predicted weekly hospital admission rates  and  (lines) against crude incidence rates (circles). Note that the scale of the y-axis for the 70–79 years is 10 times of that for the 30–39 years.
Figure 1

Predicted weekly hospital admission rates formula and formula (lines) against crude incidence rates (circles). Note that the scale of the y-axis for the 70–79 years is 10 times of that for the 30–39 years.

3.2 Full Analysis of Hospital and ICU Admission Data for the Adult Population

In the full analysis, we consider four different age groups: 20–49 years, 50–69 years, 70–79 years, and 80+ years, with two vaccination statuses: those who received only two doses of the vaccine versus those who received a third booster dose. The questions to be investigated are as follows:

  • (1)

    Are there significant drifts over time in the RRs for those who received only two doses?

  • (2)

    How significant is the Omicron era in influencing changes in these RR s for those who received only two doses?

  • (3)

    What is the effect of a third (booster) dose on the time-trend of the RR?

The Omicron variant in Ontario was first identified on November 21, 2021 (Week 47) and its prevalence exceeded 90% among cases by December 21 (Week 51). In Ontario, the ramp up of the third dose began around Week 50 (December 13) when all individuals aged 50 years and older became eligible. Therefore, the period Θ2 from Week 50 of 2021 until February 20, 2022 (Week 59) encompasses the Omicron era while coinciding with the ramping up of the third dose.

To address Question 1, we define the time-varying covariate formula. For Question 2, we define formula if formula, and 0 otherwise. To address Question 3, we use the same covariate formula. The regression models (14) are given by

17

where β1 captures the drift, β2 captures the Omicron effect, and γ signifies the booster effect, which addresses Question 3. The conditional likelihood (15) is given by

18

where formula is the weekly incidence counts for those who only received two doses, and formula is the weekly incidence counts for those who received the third dose. The total incidence count is formula. The at-risk populations are defined accordingly and denoted as formula, and formula. The MLEs are formula.

We conducted separate analyses for weekly hospital admissions and weekly ICU admissions, and the results are presented in Table 2. The positive drift over time β1 is significant in all age groups, and the significance level increases with age (i.e. smaller p-values). This finding aligns well with VE studies Lin et al. (2022), Andrews et al. (2022), and Buchan et al. (2021), which indicate faster vaccine waning among older populations.

TABLE 2

Results for weekly hospital and intensive care unit (ICU) admissions in two separate analyses, where formula signifies the drift over time; formula signifies the Omicron effect and formula signifies the booster (third dose) effect.

Weekly hospital admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.3780.605−4.5600.359
formula0.0450.0150.00250.0300.0090.0008
formula1.8690.245formula1.5380.152formula
formula−1.0390.114formula−1.3980.065formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−4.0670.394−3.3020.203
formula0.0340.0100.00040.0410.006formula
formula1.4250.168formula0.7960.140formula
formula−1.8190.060formula−1.6730.046formula
Weekly ICU admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−9.3731.815−6.3400.851
formula0.1350.0420.0010.0620.0200.002
formula0.2980.5710.61.1670.3160.0002
formula−0.5620.3130.3−1.4940.147formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.1921.043−5.8281.058
formula0.0430.0250.090.0870.0260.0009
formula1.7490.4050.000020.2930.4910.55
formula−2.0760.161formula−1.8200.185formula
Weekly hospital admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.3780.605−4.5600.359
formula0.0450.0150.00250.0300.0090.0008
formula1.8690.245formula1.5380.152formula
formula−1.0390.114formula−1.3980.065formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−4.0670.394−3.3020.203
formula0.0340.0100.00040.0410.006formula
formula1.4250.168formula0.7960.140formula
formula−1.8190.060formula−1.6730.046formula
Weekly ICU admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−9.3731.815−6.3400.851
formula0.1350.0420.0010.0620.0200.002
formula0.2980.5710.61.1670.3160.0002
formula−0.5620.3130.3−1.4940.147formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.1921.043−5.8281.058
formula0.0430.0250.090.0870.0260.0009
formula1.7490.4050.000020.2930.4910.55
formula−2.0760.161formula−1.8200.185formula
TABLE 2

Results for weekly hospital and intensive care unit (ICU) admissions in two separate analyses, where formula signifies the drift over time; formula signifies the Omicron effect and formula signifies the booster (third dose) effect.

Weekly hospital admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.3780.605−4.5600.359
formula0.0450.0150.00250.0300.0090.0008
formula1.8690.245formula1.5380.152formula
formula−1.0390.114formula−1.3980.065formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−4.0670.394−3.3020.203
formula0.0340.0100.00040.0410.006formula
formula1.4250.168formula0.7960.140formula
formula−1.8190.060formula−1.6730.046formula
Weekly ICU admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−9.3731.815−6.3400.851
formula0.1350.0420.0010.0620.0200.002
formula0.2980.5710.61.1670.3160.0002
formula−0.5620.3130.3−1.4940.147formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.1921.043−5.8281.058
formula0.0430.0250.090.0870.0260.0009
formula1.7490.4050.000020.2930.4910.55
formula−2.0760.161formula−1.8200.185formula
Weekly hospital admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.3780.605−4.5600.359
formula0.0450.0150.00250.0300.0090.0008
formula1.8690.245formula1.5380.152formula
formula−1.0390.114formula−1.3980.065formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−4.0670.394−3.3020.203
formula0.0340.0100.00040.0410.006formula
formula1.4250.168formula0.7960.140formula
formula−1.8190.060formula−1.6730.046formula
Weekly ICU admissions
20–49 years50–69 years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−9.3731.815−6.3400.851
formula0.1350.0420.0010.0620.0200.002
formula0.2980.5710.61.1670.3160.0002
formula−0.5620.3130.3−1.4940.147formula
70–79 years80+ years
formulas.e.(formula)p -valueformulas.e.(formula)p-value
formula−5.1921.043−5.8281.058
formula0.0430.0250.090.0870.0260.0009
formula1.7490.4050.000020.2930.4910.55
formula−2.0760.161formula−1.8200.185formula

The VE is calculated by

19

The estimated parameters in Table 2 suggest that, beyond the existing linear time-drift of the declining VE, there was significant further reduction of VE during the Omicron era among individuals who received only two doses. However, the third dose offered good protection against hospital and ICU admissions, as illustrated in Figure 2. For Week 59 (between February 14 and February 20, 2022):

  • (1)

    the effectiveness of the third vaccine dose (formula) against hospital admissions increased from 58% (2 doses only) to 85% (3 doses) for 20- to 49-year-olds, 72% to 93% for 50- to 69-year-olds, 47% to 91% for 70- to 79-year-olds, and 11% to 81% for those aged 80 and above;

  • (2)

    the effectiveness of the booster vaccine (formula) against ICU admissions increased from 68% to 82% in the 20–49 age group, 78% to 95% in the 50–69 age group, 61% to 95% in the 70–79 age group, and 32% to 89% in individuals 80 years and older.

Vaccine effectiveness with respect to hospital and intensive care unit (ICU) admissions, calculated by (19) based on parameters from Table 2.
Figure 2

Vaccine effectiveness with respect to hospital and intensive care unit (ICU) admissions, calculated by (19) based on parameters from Table 2.

The model fitted baseline weekly incidence rate in the unvaccinated is given by

and the model predicted incidence counts are

Using the estimates from Table 2, predicted values are plotted in Figures 3 and 4. It is evident that the predicted values align well with the observed data. Moreover, the residual plots (not displayed) exhibit random dispersion around zero without discernible patterns, suggesting that the model is adequately capturing the data.

Predicted weekly hospital admission rates based on parameters in Table 2 against crude estimates based on reported data (shown as circles).
Figure 3

Predicted weekly hospital admission rates based on parameters in Table 2 against crude estimates based on reported data (shown as circles).

Predicted weekly intensive care unit (ICU) admission rates based on parameters in Table 2 against crude estimates based on reported data (shown as circles).
Figure 4

Predicted weekly intensive care unit (ICU) admission rates based on parameters in Table 2 against crude estimates based on reported data (shown as circles).

3.3 Limitations

3.3.1 Natural Immunity Due to Prior Encounters with Infectious Agents

The aggregated data analyzed in our study lack information on prior exposure to infections. As such, we could not determine the percentage of individuals with infection-induced immunity within the “at-risk” populations with sizes formula, formula, and formula. Assuming that each of these subpopulations shares an equal proportion of individuals with preexisting immunity, this proportion would be cancelled out in the score functions corresponding to the conditional likelihoods, leaving the estimated RRs unchanged. However, a substantial bias arises if the proportions of individuals with preexisting immunity vary among these subpopulations. For instance, if formula contains a higher percentage of individuals with immunity acquired from recent infections compared to formula and formula, RRs would be overestimated, and VE would be underestimated.

Our analysis may appear to indicate that vaccinations (including the third dose) in the 20- to 49-year age group have been less effective after Week 49 than in the 50- to 69-year age group and even less so compared to the 70- to 79-year age group in some settings (see Table 2 and Figure 2). However, publicly reported case surveillance data from Open Government Licence-Ontario (Web-1) and test positivity rates from the Open Laboratory Information System (Web-2) indicate that the population under 49 years of age had the highest proportion of test-positive cases from July to October 2021. Although these data do not provide a complete picture of the transmission due to unknown factors such as testing and ascertainment rates (Lawless & Yan, 2021), they seem to suggest that individuals aged 20 to 49 were predominantly affected by infections during that period. If most of these infections during this period occurred among the unvaccinated (according to anecdote) and if the circulating strains during these months resulted in at least partial immunity against the Omicron variant, which emerged in early December 2021, this may partially explain the apparent reduction in VE for this age group. However, these hypotheses can be only verified using comprehensive longitudinal cohorts that document vaccination timing, dose numbers, and infection events, including infecting variant. Unfortunately, the available data in this study do not have the required information.

3.3.2 Differential Risk-taking and Adherence to Other Public Health Measures

Vaccinated and unvaccinated individuals may exhibit different attitudes towards risk-taking and compliance with supplementary public health measures, including mask-wearing and social distancing. These behaviors could be further confounded with age and may also correlate with the differential acquired immunity resulting from virus exposure. However, this aspect is more difficult to quantify, even within longitudinal cohorts.

3.3.3 Incidental Hospital and ICU Admissions

Our model and method (described in Section 3.2) can be used to extend the response variables formula and formula to multiple types, not necessarily limited to vaccine types. There have been discussions concerning the distinction between hospital and ICU admission resulting directly from COVID-19 infection and “incidental admissions” (referring to patients admitted to the hospital or ICU for other medical reasons who are incidentally diagnosed with COVID-19 infection). Unfortunately, the data used in this study do not contain an indicator to distinguish the reason for admission. If a longitudinal database were available, these indicators could have been identified. If such data were available at an aggregate level, our method could be extended to analyze them.

3.3.4 Comorbidity

Comorbidity is a crucial confounding factor that operates at multiple levels. First, it affects the trends of at-risk populations formula, formula, and formula over time, since each new vaccine rollout (such as the second, third, and fourth doses) prioritizes individuals with comorbidities before expanding to the general population. Second, it is associated with incidental hospital and ICU admissions. Third, it is correlated with age. Notably, comorbidity is a prominent feature in the 80+ years age group, as clearly demonstrated in Figure 2. Longitudinal individual-based cohort data offer superior control over this confounding factor compared to coarsely aggregated count data.

3.3.5 Waning of Acquired Immunity

In Section 3.2, we intentionally avoided using the term “waning” to describe the decline of VE over time. This phenomenon cannot be accurately assessed without a high-quality longitudinally followed cohort, because waning is a function of time that must be measured for each individual from a defined vaccination date. Moreover, it also depends on the circulating variants since the time of vaccination. Lin et al. (2022) measured the waning effect starting from the vaccination date of the first dose, while Buchan et al. (2021) used the time since the last dose and stratified their analysis by Delta and Omicron variants. Unfortunately, the aggregated data we use in our study lack such detailed information. Hence, the best we can provide is an empirical observation of the diminishing VE at the population level over time.

3.3.6 Age

Age is inherently associated with all the factors discussed above, making it a confounding factor as well. Therefore, we deliberately refrained from incorporating age as a covariate in our regression analysis. Instead, we conducted age-stratified analyses to account for its influence.

4 Discussions and Conclusion

There is no substitute for high-quality, individual-based longitudinal cohort data combined with an individual-based statistical model when analyzing event history processes in a population. It is also crucial to account for potential confounders. However, obtaining representative high-quality data and controlling for confounding factors pose significant challenges. Separating the “pure” VE from the influence of other public health measures proves to be a difficult task.

Considering the readily available data sources and acknowledging their limitations, we have developed novel statistical models and methods to demonstrate their application using aggregate-level COVID-19 data. Our proposed proportional incidence rate model is based on the RRs formula (aggregated as formula in (1)), which determine VE. This model differs subtly from the proportional intensity model of the two counting processes with intensities formula and formula. Given the discrete (i.e., aggregated) nature of the data and the model formulation, we presented the conditional likelihoods (4), (12), and (15), treating the baseline incidence rate formula as a nuisance parameter. Similar to the proportional hazards model in survival analysis, formula can be estimated semiparametrically after estimating the RR function using the conditional likelihood. Although we tailored these models and methods for a specific application, they can be extended to other applications with similar settings and data features.

We have demonstrated that event counts (e.g., hospital and ICU admissions) within a vaccinated cohort are conditionally sufficient for estimating RR (when it is the only objective), given the total numbers of these occurrences across the entire population. Additional trend information for the risk functions that define the RR can be enhanced by incorporating a separate penalized spline model. This model utilizes the aggregated counts of events in the unvaccinated group as the baseline incidence rate for the counterfactuals.

The applications presented in Section 3 not only illustrate the statistical models and methods proposed in Section 2, but also extend the study period to include the Omicron era and the introduction of third vaccine dose until February 2022. These aspects have not been addressed in the existing literature we reviewed. Our findings are in agreement with other studies, including test-negative case-control and cohort studies Teerawattananon et al. (2022), Andrews et al. (2022), Buchan et al. (2021), and Lin et al. (2022). In particular, our study and Buchan et al. (2021) covered the same population using an identical mixed vaccine schedule, yielding highly consistent results.

We extensively discussed the limitations and confounding factors in Section 3.3, emphasizing the need for improved data collection practices. This includes ensuring representativeness, comprehensive coverage, individual-level documentation of event history, and capturing potential confounding variables. Our recommendation extends beyond COVID-19 and applies to other infectious diseases, promoting the development of a more resilient and adaptable public health infrastructure and response. Despite its limitations, this paper offers a valuable approach for examining time-related factors at an aggregate level.

Data Availability Statement

The data that support the findings in this paper were collected from public domains, specifically the Ontario Case and Contact Management System (https://www.health.gov.on.ca/en/pro/programs/publichealth/coronavirus/docs/contact_mngmt/management_cases_contacts.pdf) and Statistics Canada (https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1710000501). Regarding the former, it is important to note that the data were continuously evolving during the pandemic and updated beyond our study period. To obtain the original data used in the analysis, please contact directly Ping Yan, email: [email protected].

Acknowledgments

We received ethics approval for this study from the Research Ethics Board at the University of Toronto. We thank Professor J.F. Lawless, Department of Statistics and Actuarial Science, University of Waterloo, for insightful discussions and statistical advices.

References

Andrews
,
N.
,
Tessier
,
E.
,
Stowe
,
J.
,
Gower
,
C.
,
Kirsebom
,
F.
,
Simmons
,
R.
et al. (
2022
)
Duration of protection against mild and severe disease by Covid-19 vaccines
.
New England Journal of Medicine
,
386
,
340
350
.

Buchan
,
S.A.
,
Chung
,
H.
,
Brown
,
K.A.
,
Austin
,
P.C.
,
Fell
,
D.B.
,
Nasreen
,
S.
et al. (
2021
)
Estimated Effectiveness of COVID-19 Vaccines Against Omicron or Delta Symptomatic Infection and Severe Outcomes
.
JAMA Netw Open
.
5
(
9
), e2232760. https://doi.org/10.1001/jamanetworkopen.2022.32760

Godambe
,
V.P.
&
Thompson
,
M.E.
(
1974
)
Estimating equations in the presence of a nuisance parameter
.
The Annals of Statistics
,
2
(
3
),
568
571
.

Godambe
,
V.P.
(
1980
)
On sufficiency and ancillarity in the presence of a nuisance parameter
.
Biometrika
,
67
,
155
162
.

Kalbfleisch
,
J.D.
&
Sprott
,
D.A.
(
1970
)
Application of likelihood methods to models involving large numbers of parameters
.
Journal of Royal Statistical Society Series B
,
32
(
2
),
175
208
.

Kalbfleisch
,
J.G.
(
1985
)
Probability and statistical inference, volume 2: statistical inference
.
New York
:
Springer
.

Lawless
,
J.F.
(
1987
)
Negative binomial and mixed Poisson regression
.
Canadian Journal of Statistics
,
15
(
3
),
209
225
.

Lawless
,
J.F.
&
Yan
,
P.
(
2021
)
On testing for infections during epidemics, with application to Covid-19 in Ontario, Canada
.
Infectious Disease Modelling
,
6
,
930
941
.

Lin
,
D.Y.
,
Gu
,
Y.
,
Wheeler
,
B.
,
Young
,
H.
,
Holloway
,
S.
,
Sunny
,
S.K.
et al. (
2022
)
Effectiveness of Covid-19 vaccines over a 9-month period in North Carolina
.
New England Journal of Medicine
,
386
,
933
941
.

Mullah
,
M.A.S.
&
Yan
,
P.
(
2022
)
A semi-parametric mixed model for short-term projection of daily COVID-19 incidence in Canada
.
Epidemics
,
38
,
100537
, https://doi.org/10.1016/j.epidem.2022.100537.

Nasreen
,
S.
,
Febriani
,
Y.
,
García
,
H.A.V.
,
Zhang
,
G.
,
Tadrous
,
M.
,
Buchan
,
S.A.
,
Christiaan
,
H.
,
Righolt
,
C.H.
et al. (
2022
)
on behalf of the Canadian Immunization Research Network Provincial Collaborative Network Investigators
, Effectiveness of Coronavirus Disease 2019 Vaccines Against Hospitalization and Death in Canada: A Multiprovincial, Test-Negative Design Study, Clinical Infectious Diseases, Volume 76, Issue 4, 15 February 2023, Pages 640–648, https://doi.org/10.1093/cid/ciac634

Ogden
,
N.H.
,
Turgeon
,
P.
,
Fazil
,
A.
,
Clark
,
J.
,
Gabriele-Rivet
,
V.
,
Tam
,
T.
&
Ng
,
V.
(
2022
)
Counterfactuals of effects of vaccination and public health measures on COVID-19 cases in Canada: what could have happened?
Canada Communicable Disease Report
,
48
(
7/8
),
292
302
. https://doi.org/10.14745/ccdr.v48i78a01

Sprott
,
D.A.
(
1975
)
Marginal and conditional sufficiency
.
Biometrika
,
62
,
599
605
.

Sprott
,
D.A.
(
2000
)
Statistical inference in science
.
New York
:
Springer
.

Statistics Canada
. (
2021
)
Table 17-10-0005-01
. Population estimates on July 1st, by age and sex. Available at: https://doi.org/10.25318/1710000501-eng (last accessed: Octorber 2021).

Teerawattananon
,
Y.
,
Anothaisintawee
,
T.
,
Pheerapanyawaranun
,
C.
,
Botwright
,
S.
,
Akksilp
,
K.
,
Sirichumroonwit
,
N.
et al. (
2022
)
A systematic review of methodological approaches for evaluating real-world effectiveness of COVID-19 vaccines: advising resource-constrained settings
.
PLoS ONE
,
17
(
1
), e0261930, https://doi.org/10.1371/journal.pone.0261930.

Venebles
,
W.N.
&
Ripley
,
B.D.
(
2002
)
Modern applied statistics with S
.
New York
:
Springer
.

Wood
,
S.N.
(
2003
)
Thin plate regression splines
.
Journal of the Royal Statistical Society, Series B
,
65
,
95
114
.

Wood
,
S.N.
(
2004
)
Stable and efficient multiple smoothing parameter estimation for generalized additive models
.
Journal of the American Statistical Association
,
99
(
467
),
673
686
.

This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.

Supplementary data