A Bayesian approach to estimating COVID-19 incidence and infection fatality rates

Summary Naive estimates of incidence and infection fatality rates (IFR) of coronavirus disease 2019 suffer from a variety of biases, many of which relate to preferential testing. This has motivated epidemiologists from around the globe to conduct serosurveys that measure the immunity of individuals by testing for the presence of SARS-CoV-2 antibodies in the blood. These quantitative measures (titer values) are then used as a proxy for previous or current infection. However, statistical methods that use this data to its full potential have yet to be developed. Previous researchers have discretized these continuous values, discarding potentially useful information. In this article, we demonstrate how multivariate mixture models can be used in combination with post-stratification to estimate cumulative incidence and IFR in an approximate Bayesian framework without discretization. In doing so, we account for uncertainty from both the estimated number of infections and incomplete deaths data to provide estimates of IFR. This method is demonstrated using data from the Action to Beat Coronavirus erosurvey in Canada.


SUMMARY
Naive estimates of incidence and infection fatality rates (IFR) of coronavirus disease 2019 suffer from a variety of biases, many of which relate to preferential testing.This has motivated epidemiologists from around the globe to conduct serosurveys that measure the immunity of individuals by testing for the presence of SARS-CoV-2 antibodies in the blood.These quantitative measures (titer values) are then used as a proxy for previous or current infection.However, statistical methods that use this data to its full potential have yet to be developed.Previous researchers have discretized these continuous values, discarding potentially useful information.In this article, we demonstrate how

Introduction
As of April 1, 2022, there have been close to 500 million confirmed cases of coronavirus disease 2019  worldwide (World Health Organization, 2022).However, the general consensus is that this number is an underestimate of the true cumulative incidence of the disease, as this estimate is largely dependent on the number of tests being administered, the accuracy of testing (Burstyn and others, 2020a,b), and to whom these tests are being issued.If testing is extensive enough, and a correction is made for underreporting of asymptomatic cases, then a test-based case fatality rate may be a reasonable proxy for the infection fatality rate (IFR) (Luo and others, 2021).However, given that the testing early in the pandemic was sparse, and estimating IFR accurately is of the utmost importance, epidemiologists across the globe are conducting serosurveys that measure immunity of individuals by testing for the presence of SARS-CoV-2 antibodies in the blood (Chen and others, 2021).This quantitative measure (which we will call a titer value) is then used as a proxy for previous or current infection.However, how exactly these data should be used to accurately estimate important epidemiological quantities (like incidence and IFR) is an active area of research.
The standard approach is to label everyone who has a titer value above some threshold as "infected" and consider everyone else not infected.This leads to the problem of selecting the cutoff, which can be made based on known cases/controls and analysis of the receiver operating characteristic (ROC) curve.The ROC plots the true positive rate (sensitivity) versus the false positive rate (1-specificity), and it is typical to select the cutoff that results in the highest Youden Index (sensitivity + specificity − 1) (Krzanowski and Hand, 2009).Gelman and Carpenter (2020) suggest that the uncertainty in sensitivity and specificity can be considered parameters to be estimated in a Bayesian hierarchical model assuming that informative priors are used for the sensitivity and specificity.Although this method accounts for uncertainty in the sensitivity and specificity, it still suffers from the loss of information in the discretization process.Particularly in COVID-19 applications, a subject with an extremely high level of antibodies should have a lower probability of being a falsepositive than someone who is just barely above the threshold.This could be partially remedied by allowing sensitivity and specificity to be a function of covariates, but ideally methods that avoid these issues all together are preferable.
Mixture models are a natural choice to overcome the limitations of using a fixed cutoff, as they allow infection status and associated uncertainty to depend on the magnitude of individuals' titer values.Mixture models have been widely applied when studying the prevalence of infectious diseases in animals (Ødegård andothers, 2003, 2005;Nielsen and others, 2007) and in humans (Vink andothers, 2015, 2016;Kyomuhangi and Giorgi, 2022).There are several other papers that have modeled the COVID-19 antibody levels directly to infer cumulative incidence through the use of mixture models.Bouman and others (2021) showed that mixture models can outperform the methods of Gelman and Carpenter (2020) for estimation of cumulative incidence of COVID-19.Furthermore, Bottomley and others (2021) apply mixture models to Kenyan serosurvey data and show that mixture of skew normal distributions more accurately estimates cumulative incidence than methods based on thresholds.However, the applications of these models thus far has been rather limited.For instance, some unexplored questions include: how do we use these mixture models to account for survey bias and get cumulative incidence rates for the general population?How do we incorporate multiple titer values per person?How do we estimate cumulative incidence in the presence of vaccinated individuals?How do we use these mixture models to estimate IFR while accounting for uncertainty in both the number of infections and deaths?
In this article, we demonstrate how mixture models can be used to estimate cumulative incidence in an approximate Bayesian framework without discretization.Specifically, we apply a mixture of multivariate t-distributions to the log of the titer values, using a logistic regression model for the mixing parameter to account for covariates.We then use post-stratification to obtain estimates of cumulative incidence and its associated uncertainty.Furthermore, we estimate the number of COVID-19-related deaths using partially complete data and use this in combination with incidence estimates to estimate the IFR across Canada.

Data
Dry blood spot (DBS) samples were collected from participants of the Action to Beat Coronavirus (Ab-C) study (https://www.abcstudy.ca/).This article is concerned with the first two phases of the study.In Phase 1, DBS samples from 9123 participants were collected from June to November 2020 and roughly corresponding to the first viral wave (April 1-July 31, 2020).In Phase 2, DBS samples from 7299 were collected from December 2020 to May 2021 and roughly correspond to the second viral wave (October 1, 2020-March 1, 2021).These blood spots were tested for prevalence of immunoglobin G (IgG) antibodies, measured using three antigens: Spike (SmT1), RBD, and nucleocapsid (NP).Two different versions of the SmT1 antigen test were used on the Phase 1 blood spots, while all three were applied to Phase 2 blood spots.All three titers will show larger values for participants who have been exposed to COVID-19, but only SmT1 and RBD will show larger values for mRNA vaccinated individuals.This is because the mRNA vaccines do not contain the nucleocapsid (NP) protein.Therefore, people who received an mRNA vaccine and did not have a history of prior infection, will not develop anti-NP antibodies.Those that were previously infected, regardless of vaccination status, will have anti-NP antibodies (Houlihan and Beale, 2020).This will be helpful for distinguishing between vaccinated and infected individuals in Section 3.3.In Phase 1, 8919 people had one SmT1 measurement, and 8704 had two SmT1 titer measurements, along with complete covariate information.In Phase 2, 7065 had all three measurements, along with complete covariate information.Of those 7065, 624 joined the study in Phase 2 (6441 participants had complete Phase 1 and Phase 2 data).These data have been previously analyzed by Tang and others (2022) using a simpler model.Additional medical details regarding these antigen tests can be found in their paper.Tang and others (2022) also investigated the representativeness of study participants when compared to the Canadian population.They found that the study population tended to be older, more university educated, more likely to be indigenous, etc. See eTable 3 in their paper for further reading.
Although serosurveys are a proven way to accurately measure seroprevalence, the notion of seroprevalence itself has several drawbacks.Firstly, there is a chance that participants got infected and returned their blood spots soon after.Antibodies generally take between 7 and 14 days to be measurable from the onset of infection (Centre for Disease Control and Prevention, 2022).This may cause a slight under-estimation of incidence.Secondly, antibodies wane slowly over time.However, they have been shown to remain elevated for many months after infection.In a study (Alfego and others, 2021) evaluating 39 086 individuals with confirmed positive COVID-19 infection by RT-PCR between March 2020 and January 2021, the anti-NP antibody remained elevated in 68.2% [95% Cl: 63.1-70.8] of participants after 293 days, while anti-SmT1 antibody remained elevated in 87.8%  of participants after 300 days.Note that the majority of people in our study were likely infected far less than 300 days prior to submitting their blood spots, so the maintenance percentage in our study was likely higher than those in Alfego and others (2021).At this point, we simply note these limitations of seroprevalence, and examine the potential impact of waning immunity on our results in Appendix F.
Population demographics (age, sex, province, ethnicity, education, and long-term care residency) were obtained from 2016 Census data from Statistics Canada (Statistics Canada, 2016).We are using the 2016 Census data because, at the time of writing, the 2021 Census data pertaining to education and ethnicity was not available.The age/sex/geographic data for 2021 were available and while the total population increased roughly 5% between 2016 and 2021, the age-sex and geographic distributions were nearly identical.This information will be used for post-stratification as described in Section 2.3.The long-term care (LTC) COVID-19 deaths were obtained from https://ltc-covid19-tracker.ca (Samir and others, 2022) between September 2020 and March 2021 for each province.The total deaths for each province by age and sex were obtained from the different provincial governments (Ontario, Alberta, and Quebec).For additional provinces, where deaths by age and sex could not be obtained, we used the distribution of nearby provinces to approximate those deaths.The age/sex distribution of deaths in Alberta was used to infer the distribution of deaths in British Columbia and Saskatchewan.The age/sex distribution of deaths in Quebec was used to infer the distribution for the Atlantic region (New Brunswick, Nova Scotia, Newfoundland, and Prince Edward Island).Manitoba reported different age groups than Ontario but seemed to have a similar distribution.Thus, we used Ontario data to infer Manitoba's age/sex deaths for the different age groups.This means that although the aggregate IFR estimates for the Atlantic region, Manitoba, British Columbia, and Saskatchewan are likely valid, the estimates by age/sex should be treated with caution due to the imputations noted above.

Methods
Our first goal is to estimate the cumulative incidence of SARS-CoV-2 in Canada.We define cumulative incidence in Phase 1 to be the number of SARS-CoV-2 infections up until September 30th 2020, divided by the population size.The cumulative incidence in Phase 1 and 2 has the cumulative number of infections up until March 31st 2021 as the numerator.We define the incidence proportion in Phase 2 to be the number of infections from October 1st 2020 to March 31st 2021, divided by the population size.We recognize that the terms cumulative incidence and incidence proportion are used interchangeably in the epidemiology literature, and we are avoiding the term "cumulative" when presenting estimates of incidence in Phase 2 alone.We estimate incidence in two steps.First, we will fit a Bayesian mixture model to the titer values, relating an individual's infection status, a latent variable, to their measured covariates via a logistic regression model.Second, we will use post-stratification to account for the disparity between the population of survey responders versus the general Canadian population.This will yield an estimate of the number of infections in Canada for each covariate combination, and hence, an estimate of the cumulative incidence.
Our second goal is to estimate the Infection Fatality Rate, which is defined as the number of COVID-19 related deaths divided by the number of infections.This will be estimated in Phase 1, Phases 1 and 2, and Phase 2 alone with the same time periods as mentioned previously.We do this by building a Bayesian model for the number of deaths in Canada by age/sex/province group, and dividing this by the estimated number of infections.This will allow for estimates of IFR in any age/sex/province category that we want, accounting for uncertainty in both the deaths and the infections.

Notation
Lower case Latin letters are used to represent (potentially vector-valued) observed data; x are observed covariates, w is observed titer values, and d is observed deaths.The exception is p, which is an unknown probability of infection.Upper-case Latin letters represent latent variables ("missing data"), such as the unknown number of infections Y , an unknown number of deaths D, and the latent infection status Z of an individual.Greek letters will be used for model parameters.

Mixture models
In this subsection we will introduce three mixture models that will be used to infer cumulative incidence.First, we will introduce a univariate (one titer value), two-component ("not infected" and "infected") mixture model, relating each study participant's covariates to their probability of infection.We will then extend this model to the bivariate case with two titer values in Section 2.2.2.These two models will be fit to the Phase 1 data.We will then present a trivariate, three-component ("unvaccinated, not infected," "unvaccinated, infected," and "vaccinated, not infected") mixture model that will be fit to the Phase 2 data.Note that the "infected" group here contains both vaccinated and unvaccinated people as our titers values are not precise enough to determine vaccination status if a person is infected.This is likely inconsequential as we will explain shortly.
2.2.1.Univariate mixture of t-distributions-Phase 1.The infectivity status, Z i , of an individual i is latent and is measured through an antibody lab test (titer), which is a quantitative measure.The density of the logged Phase 1 SmT1 titer values is shown in Figure 1.Notice that there is an approximately symmetric mound around 0.15 which is likely to be comprised of individuals who never had COVID-19.Previously, Gaussian distributions were used to model the logged titer values in noninfected individuals (Bottomley and others, 2021).However, we expected a heavier-tailed distribution would be needed, and employ a t-distribution for both the negative and positive individuals.
The univariate, two-component version of our mixture model can be written as follows: vector of regression coefficients which will be used for post-stratification as described in Section 2.3, f 1 is the univariate (shifted and scaled) t-density, and p i = logit −1 (β T x i ) is the probability that individual i has been infected with COVID-19.That is, the probability that someone had COVID-19 is a function of their covariates, but the parameters of the t-distributions are not.The covariates used in our mixture models were age (<20, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, and 80+), sex (male, female), province (Alberta, Atlantic Region, British Columbia, Manitoba, Ontario, Quebec, and Saskatchewan), ethnicity (white, indigenous, not white or indigenous), and education (university degree, college degree, and less than college degree), meaning that m = 18.Since Z i is a latent discrete variable, certain MCMC software programs cannot sample it directly.However, we can marginalize Z i out to obtain the following likelihood: where ξ = {μ 0 , μ 1 , σ 0 , σ 1 , ν 0 , ν 1 } is a vector of parameters which need to be estimated but are not used to infer incidence directly.
For both Phase 1 and Phase 2, we have continuous values for multiple titers and thus will now extend this univariate mixture model to a mixture of multivariate t-distributions.

A bivariate mixture model for Phase 1.
For Phase 1, we have two measurements of SmT1 for each sample.Using both titers should improve our ability to identify individuals who were infected.Our model naturally extends to the bivariate case by replacing the univariate t-distribution by a bivariate t-distribution (f 2 ): where μ k is a vector of length 2, k is a 2 × 2 covariance matrix, and the rest of the parameters are the same as Section 2.2.1.Note that the logistic regression model for Z i in the second level is still univariate.This allows the model to accommodate multiple titer values per person without the number of parameters getting out of control.We fit this bivariate model on the two Phase 1 titer values using MCMC to obtain posterior samples of β which will be used later for post-stratification.

A trivariate, three-component mixture model for Phase 2.
In Phase 1, vaccinations had not yet been made available and Z i could only take on two values: "infected" or "not infected".However, during Phase 2, a non-negligible proportion (≈ 2.5%) had claimed to have been vaccinated.Given that vaccinated people are distinguishable from infected people based on the three titer values that we have available, we now have three mutually exclusive values for Z i : "unvaccinated, not infected," "unvaccinated, infected," and "vaccinated, not infected."We did not include a fourth group "vaccinated, infected," as there were likely to be very few participants in this category.Note that we can differentiate between "vaccinated, not infected" and "unvaccinated, infected" individuals because infected individuals will tend to have high titer values for all three titers, while vaccinated individuals should not have an elevated titer value for NP.That is, if a participant shows a high value of SmT1 and RBD, and a low value for NP, it should predict a small probability of infection.If a participant has a large value for all three, then the model should predict a large probability of infection.Furthermore, we decided not to use self-reported vaccination status as data, as only about half of the participants who claimed to be vaccinated were showing large values of SmT1 and RBD.This may be because they had only received one dose, or perhaps they had provided their blood spot less than 2 weeks since their second dose.Either way, we want the data (titer values) to determine SARS-CoV-2 incidence, rather than rely on self-reported claims of vaccination.
In addition to having three infected statuses, we also now have three titer values which we can use to define a mixture of three trivariate t-distributions (f 3 ).The likelihood for this trivariate model is: where ρ = Prob(y i = 2|y i = 1).Here, Prob( We fit this trivariate model to Phase 2 data using Bayesian MCMC to obtain posterior samples of β which will be used for post-stratification.

Estimating incidence using post-stratification
Incidence is defined as the number of people with an infection in a given time frame, divided by the population.We estimate incidence of COVID-19 in a subgroup of Canadians G by taking posterior samples of I G where h is ethnicity/education, is age/sex, j is province, p h j is the probability of COVID-19 infection (as in Equation 2.2) for a person with covariate combination h j, Y h j is the number of people in Canada with covariate combination h j who were infected with COVID-19, and n h j is the number of people in Canada with covariate combination h j.To obtain samples of I G we first fit the mixture models presented in Section 2.2 to obtain T posterior samples of p h j .We then use post-stratification (Little, 1993) to generalize these results to the Canadian population.That is, we draw one sample from for each t = 1...T.We then compute h j h j∈G n h j for t = 1...T, which are then used to obtain point estimates and credible intervals for cumulative incidence in Phase 1 and Phases 1 and 2 combined.The incidence proportion in Phase 2 is estimated by computing these two cumulative incidence estimates for each t, then taking the difference.

Estimating infection fatality rates outside of long-term care homes
The infection fatality rate (IFR) is a measure of the deadliness of a disease.It is defined as The methods described in Sections 2.2 and 2.3 provide estimates of the denominator with associated uncertainty, but we still need to estimate the number of deaths in the numerator.The number of COVID-19 related deaths in Canada are publicly available, but include long-term care (LTC) residents.Our target of inference is the IFR for the "community-dwelling" Canadian population and does not apply to people living in LTC homes.The spread of COVID-19 is substantially different in LTC homes than in the general population and residents of LTC homes are particularly vulnerable to severe illness and death from infection; see Danis and others (2020).Indeed nearly 80% of the reported deaths from COVID-19 prior to September 2020 in Canada were in LTC homes (Samir and others, 2022).Modeling the spread and mortality of COVID-19 within LTC homes will require unique approaches and should be considered in a separate analysis; see the recommendations of Pillemer and others (2020).The Ab-C study excludes residents of LTC and thus we need to exclude this population from our numerator as well.To do this, we will extend our post-stratified mixture models to estimate the deaths outside of long-term care homes, using publicly available COVID-19 deaths data and long-term care deaths data described in Section 1.1.
In the rest of this section, we describe the extended mixture model and algorithm used to estimate IFR in this article.We start by displaying the full model with a description of each component.We then provide a directed acyclic graph (DAG) that displays the relationship between all quantities in the model.We then provide a full factorization of the posterior distribution and explain how our algorithm approximates this posterior.
2.4.1.The complete model.The full model is shown in Equations 2.3a-2.3h,followed by a description of each component.Equations 2.3a-2.3crepresent the mixture model and post-stratification described previously, and will be referred to as "Module 1" of our IFR model.Equations 2.3d-2.3hrepresent the model extension to estimate the number of deaths outside of long-term care and will be referred to as "Module 2." Left aligned are the model components, right aligned are the nomenclature used in the posterior factorization in Section 2.4.2. ) • Indices: h, , and j represent education/ethnicity, age/sex, and province groups, respectively.Subscripts 1 and 2 are used to distinguish between quantities outside and within long-term care respectively.
• 2.3a: The log of the titer values, w i , of individual i follow a (shifted and scaled) multivariate t-distribution, with parameters that depend on the infectious status Z i = k of that individual.k = 0: "unvaccinated, not infected," k = 1: "unvaccinated, infected," k = 2: "vaccinated, not infected" (for Phase 2 only).
• 2.3b: an individual's infection status, Z i , depends on the infection probability corresponding to that individual's covariate combination, p h j[i] .
• 2.3c: The number of infections in Canada with covariate combination h j is determined by the number of people in Canada with that covariate combination, n h j , and the probability, p h j , that a person with that covariate combination was infected.
• 2.3d: The number of deaths outside long-term care in age/sex/province group j, D 1 j , depends on the number of infections in that group, Y • j , and the infection fatality rate in that group, η j .Note that we do not attempt to estimate the deaths by education and ethnicity, which is why we sum over h in Y • j .
• 2.3e: The total number of COVID-related deaths in age/sex/province group j, d j , has death rate equal to the sum of the death rates outside long-term care, λ 1 j , and the death rate inside long-term care, λ 2 j .
• 2.3f: Outside long-term care, we only know the death rates aggregated by province (the age/sex distribution is unknown).If we assume that the number of deaths outside long-term care in age/sex group and province j follows an independent Poisson process with mean λ 2 j , then the deaths aggregated by province, d 2•j , will be Poisson distributed with mean λ 2 j .Note that if we knew d 2 j , there would be no need for Module 2.
• 2.3g: In each age/sex/province group, the mean number of deaths (death rate) outside longterm care, λ 1 j , is the product of the number of infections outside of long-term care Y j , and the infection fatality rate outside long-term care, η j .
• 2.3h: In each age/sex/province group, the mean number of deaths (death rate) within longterm care, λ 2 j , is the product of the number of people in Canada in long-term care n 2 j , and the COVID-19 death rate in long-term care, θ j .Fig. 2. Directed acyclic graph corresponding to the model presented in equations 2.3a-2.3h,with subscripts omitted.Lower case Latin letters are known, all other terms are unknown.Module 1 is the portion of the model concerned with estimating infections.Module 2 is the portion of the model concerned with estimating deaths.The red arrows indicate a one-directional flow of information, and are the reason we are sampling from the cut distribution as opposed to the Bayesian posterior.β is the effect of covariates, x, on the log(odds) of infection; Z is infection status, w represents titer values from the serosurvey; ξ are the parameters of the multivariate t-distributions; Y is the number of infections outside of long-term care; D is the number of deaths outside long-term care; d is the total number of deaths by age/sex/province; d 2 is the number of deaths inside long-term care by province; η is the population average probability of death given infection; θ is the COVID-19 death rate in long-term care.
2.4.2.Approximating the Bayesian posterior.Figure 2 displays the model represented in Equations 2.3a-2.3has a DAG.Based on this DAG, the full posterior can be factored as follows: (2.4)However, sampling from this posterior poses a computational challenge, as Y and D are both discrete latent variables, and all three terms in π(D|Y , η) are unknown.Instead, we sample from the "cut distribution" (Plummer, 2015), which is the same as Equation 2.4 but the dependence on d in π(Y |β, x, d) is dropped.The removal of this dependence is sometimes referred to as "cutting feedback."Since we are not allowing our deaths data to influence our infection estimates, we are only approximating Bayesian inference when computing IFR.The cut distribution has been shown to give more sensible results than the full posterior in some scenarios where certain portions (modules) of the model are misspecified, or data quality is poor (Lunn and others, 2009).It is important to note that our serosurvey data are very high quality individual level data, but our deaths data are partially imputed and is from an unofficial source.The cut model allows us to base our estimates of incidence solely on the serosurvey data (and census data), while still utilizing all data sources to estimate IFR.We sample from the cut distribution using the following two step algorithm: (1) We first sample from the joint posterior of the parameters in the first module: which is the same as the Module 1 portion of Equation 2.4 but with the dependence of d dropped in the first term.We sample from this distribution by obtaining T (post burn-in) posterior samples of each parameter using π(β, ξ , Z|x, W) = π(W|ξ , Z)π(Z|β, x)π(β)π(ξ ) as a target distribution in MCMC.We then draw a sample, Y (t) , from π(Y |β (t) , x) for t = 1...T.
We used this algorithm for both Phase 1 and Phase 2 data, obtaining T samples of (Y • j , D 1 j ) from π cut (Y , D).We then estimate IFR by computing samples from π cut (IFR G ) for any subgroup of Canadians G outside of long-term care: • j (2.5) for each t = 1...T.We can then compute point estimates with uncertainty for all of Canada, and any age/sex/province combination that we so please.We compute the IFR G for various age/sex/province combinations using univariate and bivariate models to estimate the denominators for the Phase 1 data, and the multivariate model for Phase 1 and 2 combined.We do not attempt to estimate IFR by education/ethnicity, so we sum over h in Y • j .Since individuals who were likely to be positive in Phase 1 were also likely to be positive in Phase 2, estimating incidence and deaths just based on Phase 2 data will also include people who were likely infected in Phase 1.In order to estimate the new infections and deaths (and as a result, IFR) in just Phase 2, we found posterior samples of Y from the multivariate model and subtracted the posterior samples from the bivariate model to get the denominator.The same was done for the deaths D for each posterior sample, allowing us to calculate IFR for any subgroup we desire.

Priors
In all three mixture models, a weakly informative prior of N(0, 1) was used for each β.This will stabilize estimates in groups with a small amount of data, and have little effect on those that have a lot of data.A weakly informative penalized complexity prior was put on the degrees of freedom in all three models (see Appendix A).In the multivariate cases, informative priors were used to overcome well-known computational challenges of fitting Bayesian mixture models as noted in the Stan documentation (Betancourt, 2017).We describe our informative priors and their justifications in detail in Appendix D.1.In the reproducible example that we provide in the Supplementary material available at Biostatistics online, we show that our results are not too sensitive to "mis-specified" informative priors on the mixture components.We also note that it is primarily the estimation of β's that influence the results of this article.A weakly informative prior was used on as recommended by Section 1.13 of the Stan User's Guide (Stan Development Team, 2021).A complete list of priors for all models is presented in Appendix D.

Inference
Each model was run using No-U-Turn sampling, a form of Hamiltonian Monte Carlo that is readily available in the Stan software (Carpenter and others, 2017;Stan Development Team, 2021).Four chains with 1000 iterations, with the first half being warmup, were used for each model component.Traceplots were used to visually assess convergence of Markov chains, alongside values of Rhat < 1.01 confirming an appropriate amount of mixing (Vehtari and others, 2021).Point estimates are taken to be the 50th percentile of the (approximate) posterior distributions, and credible intervals (CrI's) are computed using the 2.5th and 97.5th quantiles.

Univariate model-Phase 1
Estimated cumulative incidence and IFR by age group is presented in Figure 5.Using the univariate model, the overall estimated cumulative incidence in Phase 1 (February-Sept 2020) is 1.79% (95% CrI: 1.21-2.66),which is similar to the estimate presented in Tang and others (2022) of 1.9% (95% CI: 0.7-4.7).Using this model for the denominators in the IFR calculation leads to an estimated infection fatality rate of 0.35% (95% CrI: 0.24-0.52)for all Canadians outside of long-term care homes.This is, again, consistent with the estimates presented in Tang and others (2022) of 0.373 (95% CI: 0.153-1.024).
When we look at the age distribution of cumulative incidence, we see a general downward trend with increasing age, with estimates for the age group 70+ being the smallest at 0.71% (95% CrI: 0.24-1.74).However, the credible intervals all overlap which suggests that incidence is similar between age groups.We see an upward trend in IFR with increasing age, with non-overlapping credible intervals.This is to be expected, as COVID-19 is now known to be much deadlier in older populations (Williamson and others, 2020).
A plot of the two univariate t-distributions is shown in Figure 1.Notice that the density plot for the positive group has mass to the left of the cutoff used by Tang and others (2022), and the negative group has mass to the right of the cutoff.Large values of titers (>2) will show high probability of SARS-CoV-2 incidence from our model, but this is not true for titer values around 0.5.If these values had been discretized using a fixed cutoff, participants with very large titer values would be indistinguishable from those with values of ≈ 0.5, thus would have the same probability of being false positives.Although this univariate case works well to demonstrate our method, we will use the results from the bivariate model when computing estimates for Phase 1.

Bivariate model-Phase 1
Figure 5 presents estimated cumulative incidence and infection fatality rates for the bivariate model in Phase 1 using both SmT1 titers.The overall cumulative incidence for Canada was 1.60% (95% CrI: 1.15-2.23).This point estimate is somewhat consistent (slightly lower) with the univariate results, with a smaller credible interval.This is reassuring, since our uncertainty should decrease as more data is used in the model.Our Phase 1 estimates are comparable with the estimate for seroprevalence in Canada from O'Driscoll and others (2021) of 1.4% (CI: 1.16-1.68,as of September 1st 2020).The estimated overall infection fatality rates for residents outside of long-term care homes was 0.39% (95% CrI: 0.27-0.56),which is also consistent with our univariate results.We will use the bivariate results for Phase 1 going forward.
When broken down by age, we see very similar trends in both cumulative incidence and IFR as with the univariate model.We also see slightly reduced uncertainty in all age groups, which is to be expected since we are adding more information (an extra titer value) into the model.The decrease in uncertainty is small, suggesting that the additional assay did not provide much additional information when predicting infection.We can investigate which titer value had more influence on the probability of infection by computing That is, we compute the probability of infection given the titer values, which are easily computed based on results from (2.2).
Figure 3 shows the probability of infection given each individual's titer values using the Bivariate mixture of t-distributions.Our model seems to "trust" the Sinai titer value more, given that it predicts a high probability when the Sinai value is high, even if the Euroimmune titer value is low.Our model seems to be indeterminate around the cutoff (Sinai titer value ≈ 0.5) that was chosen by Tang and others (2022), which implies some agreement between the two methods.

Trivariate model-Phase 2
Estimates of cumulative incidences and infection fatality rates in Phase 2 are presented in Figure 5(c) and (d).Using a trivariate mixture of t-distributions with three latent groups and post-stratification, the estimated incidence proportion in Phase 2 was 6.81% (95% CrI: 5. 35-8.42).This is obviously much higher than our estimates in Phase 1, which is to be expected.The estimated infection fatality rate in Phase 2 was 0.31% (95% CrI 0.25-0.39),which is slightly lower than Phase 1.This is comparable, but slightly lower than other estimates for Canadian IFR (∼ 0.65% from O'Driscoll and others (2021)), which is unsurprising since our study excluded those in nursing homes.
The incidence proportion in Phase 2 was comparable across age groups, with the IFR again trending upwards with age.In Phase 2, see that each age category had a lower IFR than Phase 1.Our estimates of IFR by age were highly comparable to international estimates (see Table S3 of O'Driscoll and others (2021)).
The cumulative incidence and IFR's for Phase 1 and Phase 2 combined are shown in Figures 5(e) and (f).The cumulative incidence estimate is 8.41% (95% CrI: 7.04-9.92),with an IFR of approximately 0.31% (95% CrI: 0.27-0.37).The patterns in incidence and IFR by age are highly similar to those in Phase 2 alone.The probabilities of infection given the titer values of each participant are shown in Figure 4. Since our outcome is three-dimensional, three separate plots are required.Blue dots in the bottom right corner of Figures 4(a) and (b), and the top right corner of Figure 4(c), identify participants that are likely showing immunity due to being vaccinated, as vaccinated individuals should be low on NP and high on the other two.We see that our model tends to "trust" the NP and SmT1 titers more when predicting infection.People who are high on NP or SmT1 tend to have higher probabilities, while people with only high RBD values tend to have a low probability of infection.

Cumulative incidence and IFR by province
One advantage to the methods presented in this article, is that once we have posterior samples for infections and deaths outside of long-term care, we can break the results down by any covariate combination that we so please.Figure B2 shows the cumulative incidence and infection fatality rates by province in both phases.In Phase 1, Ontario had the highest point estimate for cumulative incidence, and Quebec had the highest IFR.Our estimated IFR in Ontario was 0.27% (95% CrI: 0.19-0.41) in

Probability Infected
Fig. 3. Probability of infection given each individual's titer values using the bivariate mixture of t-distributions in Phase 1.Each dot represents a participant in the Ab-C study.On the x-axis is the titer value that was used in the univariate model.On the y-axis is an second SmT1 protein assay.A red dot indicates that this model predicts a high probability of infection, with blue being a low probability of infection, and purple being indeterminate.
Phase 1, which is much lower than the estimate given by Public Health Ontario at the time (2.8% as of May 17, 2020(Public Health Ontario, 2020)).Although these numbers are not directly comparable, as our estimates do not include people in nursing homes, this likely doesn't account for all of the disparity.Public Health Ontario's number was estimated based on IFR numbers obtained using individual-level data from China (Verity and others, 2020) and was adjusted to match the age distribution of Ontario.We therefore remain somewhat skeptical of the numbers presented in Public Health Ontario (2020).When comparing our overall estimate to the estimate in Verity and others (2020) (0.657%, CI 0.389-1.33),our number is much more comparable.
In Phase 1, Quebec had a very high reported number of deaths, which was not proportional to the number of long-term-care home deaths, resulting in a high IFR.In Phase 2 Quebec's incidence went up substantially, while the IFR dropped significantly.In Phase 2, the credible intervals for both cumulative incidence and IFR overlap between provinces.
Estimates by age group in each province are shown in Figure B1.In all provinces, incidence in Phase 1 was highest in 18-to 39-year-old, and lowest in 70+ year old.With the exception of Alberta, this pattern did not hold in Phase 2, as incidence seems to be less predictable as a function of age.In each province and phase, IFR reliably trends upwards with age.
Estimates of incidence by ethnicity in each province are shown in Appendix C. In both phases, the white and indigenous groups have comparable incidences in each province.The "not white or indigenous" group (NWoI) has relatively high incidence in Ontario and British Columbia in both phases, and low incidence in the Atlantic region and Saskatchewan in Phase 2. Note that estimates of IFR are not reported by ethnicity, as we do not have (even aggregate) COVID-19 deaths data by ethnicity.

Discussion
In this article, we developed an approximate Bayesian approach to estimate cumulative incidence and IFR using a multivariate mixture of t-distributions.We used data from the Ab-C serosurvey to estimate the probability that individuals were infected with COVID-19 based on their titer values and covariate combinations, and used post-stratification to generalize our results to the Canadian population that resides outside of long-term care.Our Phase 1 cumulative incidence estimates were slightly lower than previous estimates based on fixed cutoffs.Our Phase 2 estimate was higher than the one in the literature.Furthermore, our method accounts for uncertainty in both the number of infections and the number of deaths, and is essentially a cut model where we do not allow the deaths data to affect the estimation of the number of infections.
Estimates of incidence by age do not show any noteworthy patterns other than a slight upward trend in Phase 1.In both Phase 1 and Phase 2, IFR increased with age.Furthermore, IFR was higher in Phase 2 than Phase 1 in each age group, although the overall IFR was the same.
The main strength of our approach is that it uses the exact titer values as outcomes in our model, as opposed to a discretized version which discards information.Furthermore, we can leverage multiple titer values in a multivariate model to improve estimated probabilities of infection, while being able to differentiate between previously infected and vaccinated individuals.An additional strength of our study is that error is correctly accounted for in both the calculation of the number of infections and deaths outside long-term care, and consequently, IFR.We have not considered under-reporting of COVID-19 deaths, and we acknowledge this could be a potential issue.One way to accommodate this would be to make an assumption that a known proportion of COVID-19 deaths go unreported and include draws of unreported deaths in each posterior sample of the IFR.In the absence of information of what this proportion should be, we have treated the reported death counts as correct with the caveat that the estimated IFRs only refer to deaths directly attributed to COVID-19.
A methodological limitation of this study is that we are assuming that both the infected and uninfected groups follow a multivariate t-distribution.This may not be the most appropriate distribution for these data, and perhaps a distribution that allows for skewness may be more appropriate.Although our model makes no direct assumption about sensitivity and specificity, these two quantities are directly related to the length of the tails of the t-distributions for any given cutoff.However, the parameters of the multivariate t-distribution are estimated from the data, so our method is analogous to a non-discretized version of the methodology presented in Gelman and Carpenter (2020), where sensitivity and specificity are parameters to be estimated in the model.
A second limitation is that some responses to the survey happened before the end of the survey, such that they could have returned a "negative" dry blood spot sample and subsequently gotten infected.This would lead to slightly underestimating incidence (overestimating IFR).On the other hand, there is a time lag between infection and death, so if we counted infections up until the end of September 2020, then those infected people could experience death several weeks later and not be recorded.However, given that the vast majority of participants returned their blood samples study more than two weeks prior to each Phase's end date (see Figure G1), we figured that accounting for this time lag was not necessary.
A third limitation of our methodology is that we were unable to incorporate information regarding Phase 1 infection probabilities (from SmT1 protein) into our Phase 2 estimates of incidence.Although Phase 1 and Phase 2 SmT1 protein titer values are not directly comparable (due to the assays being calibrated slightly differently), we recognize that there is some potential to treat the SmT1 titer longitudinally from Phase 1 to Phase 2. However, we figured that this would require a drastic reworking of our current model and inference framework, and thus we deemed it out of the scope of this article.The potential consequence of this is a slight underestimate of cumulative incidence at the end of Phase 2, as some "infected" individuals in Phase 1 may be overlooked by solely looking at Phase 2 titer values (see Appendix E for a sensitivity analysis and discussion), with waning being one potential cause.However, Tang and others (2022) show that roughly 80% of people retain their "seropositivity" status from Phase 1 to Phase 2. The exploratory analysis presented in Appendix F suggests that waning may not be a large issue.It is also possible that people who were infected in Phase 1 were reinfected in Phase 2. Reinfected individuals will likely have titer values that are exceptionally high, which would affect our estimates of the parameters for the mixture distributions.This also would make the interpretation of incidence murky, as reinfected people only count as one infection.We suspect this to be more of an issue when estimating incidence/IFR at later dates, as the number of reinfected individuals in our study is expected to be very small.
A direction for future work will be to apply these methods to upcoming Phase 3 and Phase 4 data that includes a much larger vaccinated population, as well as breakthrough infections in people who have been vaccinated.Furthermore, we will have to account for reinfections as the populations' immunity wanes and new variants emerge.This could involve a longitudinal mixture model or Hidden Markov Model.Furthermore, an improved serosurvey design and associated statistical methodology that allowed for estimation of incidence (and consequently, IFR) in real-time would be an ambitious and highly interesting area of future research.
This study only looks at humoral immune response, but cellular immunity also plays an important role in the immune response to SARS-CoV-2.Other studies have evaluated the effects of T-cell response in infected people (Guo and others, 2022;Moss, 2022).An interesting line of future work would be to develop similar methods to incorporate T-cell response data into estimates of incidence and IFR.
Although we focused on SARS-CoV-2 infections and deaths in this paper, the methods presented can be applied to a variety of outcomes for any infectious disease of interest in which serosurvey data are available.There are plenty of potential extensions to this model that can be implemented to suit a variety of problems in epidemiology and biostatistics.
Meaning that the d-dimensional integral can be reduced to one-dimensional integral.Since we are interested in the KLD between a multivariate T and a multivariate normal, we substitute ν = 200, and compute this integral numerically as a function of ν.We then approximate the distance, δ(ν) = √ 2 • D KL with a polynomial.For example, δ(ν) for the bivariate model was δ(ν) ∝ ν −1.3 .We then say that π(δ(ν)) ∼ exp(λ) with λ = − log(α)/δ(U), where α and U are chosen such that our prior belief is that there is a 50% chance that ν is greater than 30.

D. Prior distributions Table D1. Priors used in Phase 1 univariate model
Parameter Prior   As mentioned in the main text, we require informative priors for computational reasons.In this Section, we justify our choices of informative priors for the Phase 2 trivariate model.We note that these priors are not very sensitive to • μ 0 corresponds to the means of the "not infected" group.The first element of μ 0 corresponds to the mean NP titer values in "not infected individuals".Alongside the NP titer values collected from the survey, the lab also provided us with "control" samples of known negatives.We found that the vast majority of the control samples fell between −2.5 and −1 on the log scale.Therefore we are very confident that the mean of NP titer values from "not infected" people should be in this range.Therefore, we applied the conservative but informative prior N(−1.75, 0.25).Similar reasoning was used for the prior on the second element of μ 0 , corresponding to the mean of RBD titer values in "not infected" people.
• When setting priors for the "not vaccinated, not infected" and "infected" groups based on Smt1 titer values, we used the corresponding posterior distributions from Phase 1.Although the tests are calibrated slightly differently, and there will be a small amount of waning between phases, we do expect these values to be somewhat similar.

•
To determine the posterior of the mean of the infected group for NP titer values (first element of μ 1 ), we consider the fact that any titer value above mean+3SDs is likely a previous infection (this is how the cutoff was chosen in Tang et al.).We then ensure that the bulk of the prior distribution for the positive N group was above this value, with some overlap.We used similar reasoning for the RBD positive group.

•
To determine the prior for the mean RBD/SmT1 titer values in the vaccinated groups, we used similar reasoning as above, trying to ensure that the prior has most of it's mass above that of the infected group's with some overlap.

•
We used a weakly informative prior for k using the LKJ distribution with shape = 0.5.This provides a roughly uniform distribution across positive-semidefinite 3 × 3 matrices.We then add additional information for each off-diagonal by multiplying by normal densities.For instance, if we suspect that the correlation between two parameters should be positive (i.e., off-diagonal element c of k is positive), we multiply the prior for c by N(c|0.5, 0.2) which gently encourages the correlation to be positive but still has mass below 0.

E. Longitudinal sensitivity analysis
As mentioned in Section 4, there is potential for these data to be used in a longitudinal way, as roughly 6300 survey participants had titer values in both Phase 1 and Phase 2. SmT1 titer values are measured in both phases, while RBD and NP are only available in Phase 2. Thus in this section, we wanted

F. Potential waning immunity
It is well known that antibodies decay over time, but how much this effects our results is unclear.Unfortunately, we cannot simply compare antibody results from Phase 1 to Phase 2, as these numbers are not directly comparable.Instead, we compared the Phase 1 and Phase 2 probabilities of participants who had a high probability of infection in Phase 1.A comparison of these predicted probabilities is shown in Figure F1.It appears that those with large predicted probabilities in Phase 1 still had large predicted probabilities in Phase 2. This is largely because in Phase 2, we see relatively lower parameter estimates for the means of the infected group.This likely will also make estimates of infection noisier, as the variance will also increase.So although our model does not appear to be underestimating Cumulative Incidence due to waning, waning likely does cause more uncertainty when predicting infection.More work needs to be done to confirm this assertion.

Fig. 1 .
Fig. 1.Mixture of t-distributions for the Phase 1 univariate model fit to the SmT1 titer values.The posterior median for each parameter is used.The vertical dashed line represents the cutoff used in Tang and others (2022).Keep in mind that this plot does not display uncertainty in the model parameters of the t-distributions.

Fig. 4 .
Fig. 4. Probability of infection given each individual's titer values using the trivariate mixture of t-distributions in Phase 2. A red dot indicates that this model predicts a high probability of infection, with blue being a low probability of infection, and purple being indeterminate.In theory, participants who have never been infected or vaccinated should have low values for all three titers.Vaccinated, but never infected individuals should have high SmT1 and RBD, but low NP, and infected individuals have high values for all three.

Fig. 5 .
Fig. 5. Incidence/IFR by age (years) for each time period.Posterior medians are used as point estimates, and the 2.5th and 97.5th posterior quantiles define the error bars.

Fig
Fig. B1.Incidence/IFR by age (years) in each province.Posterior medians are used as point estimates, and the 2.5th and 97.5th posterior quantiles define the error bars.

Fig
Fig. B2.Incidence/IFR by province.Posterior medians are used as point estimates, and the 2.5th and 97.5th posterior quantiles define the error bars.

Fig. B3 .
Fig. B3.Incidence by ethnicity in each province.Posterior medians are used as point estimates, and the 2.5th and 97.5th posterior quantiles define the error bars.

Fig. E1 .
Fig. E1.Comparing infection probabilities between the bivariate (Phase 2) and longitudinal (Phases 1 and 2) models.In (a) and (b), blue points indicate low probability of infection, while red indicates a high probability of infection.In (c), blue indicates agreement between the two models, while a more red color indicates a high estimated infection probability from the longitudinal model.

Fig. F1 .
Fig.F1.Phase 1 versus Phase 2 predicted probabilities for participants who had large predicted probabilities in Phase 1. Points above the red line indicate that Phase 1 predicted probability was higher.

Fig. G1 .
Fig. G1.Distribution of dates of samples received for Phase 1 and Phase 2.

Table D3 .
Priors used in Phase 2 mixture model

Table D4 .
Priors used in deaths module (Section 2.4.2)