Refining Reproduction Number Estimates to Account for Unobserved Generations of Infection in Emerging Epidemics

Abstract Background Estimating the transmissibility of infectious diseases is key to inform situational awareness and for response planning. Several methods tend to overestimate the basic (R0) and effective (Rt) reproduction numbers during the initial phases of an epidemic. In this work we explore the impact of incomplete observations and underreporting of the first generations of infections during the initial epidemic phase. Methods We propose a debiasing procedure that utilizes a linear exponential growth model to infer unobserved initial generations of infections and apply it to EpiEstim. We assess the performance of our adjustment using simulated data, considering different levels of transmissibility and reporting rates. We also apply the proposed correction to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) incidence data reported in Italy, Sweden, the United Kingdom, and the United States. Results In all simulation scenarios, our adjustment outperforms the original EpiEstim method. The proposed correction reduces the systematic bias, and the quantification of uncertainty is more precise, as better coverage of the true R0 values is achieved with tighter credible intervals. When applied to real-world data, the proposed adjustment produces basic reproduction number estimates that closely match the estimates obtained in other studies while making use of a minimal amount of data. Conclusions The proposed adjustment refines the reproduction number estimates obtained with the current EpiEstim implementation by producing improved, more precise estimates earlier than with the original method. This has relevant public health implications.

The wide-ranging impacts of the current coronavirus disease 2019 (COVID- 19) pandemic have highlighted the threats posed by infectious diseases to our society. The impacts of such threats are not exclusively confined to the domain of public health: in addition to the millions of confirmed deaths worldwide, COVID-19 has caused severe economic and societal disruption around the world. As such, it is crucial that the properties of all emerging pathogens are adequately characterized as soon as possible, making the best use of the limited data that are typically available in the early phases of an epidemic. In particular, understanding the transmissibility of a novel pathogen allows for the evaluation of the risks involved, to inform public health decision making and the implementation of interventions. In this context, the basic and instantaneous reproduction numbers, R 0 and R t , represent key epidemiological parameters. R t represents the average number of secondary infections caused by a single infectious individual at time t, and R 0 represents the average number of secondary infections generated by a typical infection in a completely susceptible population. These parameters relate to important quantities such as the final size of an epidemic [1] and the critical herd immunity threshold [2,3] and are essential to project the expected future number of cases, hospitalizations and deaths. In the last couple of decades, several methods have been developed to estimate R t . EpiEstim, developed by Cori et al [4], has been recommended as one of the best methods for near real-time estimation of R t to detect changes in transmissibility patterns [5]. However, EpiEstim and other commonly used statistical methods can suffer from systematic overestimation of the basic reproduction number in the early stages of an epidemic [6]. In the initial epidemic stages, EpiEstim is outperformed by simpler inference methods based on exponential growth [7], which produce smaller bias and better quantify uncertainty in R 0 estimates. Following the theory of exponential growth [7], we propose an adjustment to EpiEstim to account for missing initial generations of infections and use simulated data to test its effectiveness. We show how the adjustment entails a reduction in the bias of both the R 0 and early R t estimates produced by the original EpiEstim method. Finally, we compare the estimates produced with and without the proposed adjustment on early COVID-19 data in Europe and in the United States.

METHODS
Several methods, based upon different frameworks, have been developed to estimate the reproduction number. One of the simplest methods assumes exponential growth of the number of new infections, which can be characterized by applying a simple linear regression on the log-transformed incidence data. Wallinga and Teunis [8] base their methodology on determining likelihoods of chains of infections, White and Pagano [9] use branching process theory, whereas Bettencourt and Ribeiro [10] base their approach on SIR differential equations. In this work we focus on EpiEstim, the method developed by Cori et al [4], which has been recommended for near real-time estimation [5] and has been applied extensively during the ongoing and past epidemics [11].
Despite differences in the mathematical formulations, all methods described above infer R t or R 0 from case incidence data using assumptions on the generation interval or the serial interval distributions. The generation interval is defined as the average length of time between the moment an individual becomes infected and the moment in which they infect a secondary case. Similarly, the serial interval corresponds to the difference between symptom onsets of the primary case and symptom onset of the secondary case. Under weak assumptions, the distributions of the 2 intervals are equivalent [4]. Therefore, we here focus on the generation interval distribution, denoted w.

Exponential Growth
One of the simplest methods to characterize the speed of an epidemic, as measured by the growth rate r, is to fit the logtransformed incidence data by linear regression. The growth rate r can then be used to estimate the reproduction number R 0 if the generation interval w is known [12]. The assumption of exponential growth is justifiable in a first epidemic period where the proportion of susceptible individuals in a population is large. As more individuals get infected and immunity accumulates in the population, the growth of the number of new cases slows down and deviates from being exponential. A rule of thumb to identify the time window of the exponential growth, proposed by White and Pagano [9] and based on the theoretical work of Ball and Donnelly [13], is that the cumulative number of infected individuals does not exceed the square root of the population size.

EpiEstim
Cori et al [4] model observed infections as a Poisson process, where the mean is defined via 2 quantities: the effective reproduction number R t , and population infectiousness Λ t = ∑ s = 0 n I t-s w s . The reproduction number is assumed to remain constant over sliding windows of time with length τ, allowing for data aggregation over time and reduced variance. Assuming a Gamma prior distribution on R t,τ , the posterior distribution is Gamma distributed with parameters: where a and b refer to the shape and scale parameters, respectively. Considering τ = 1 and uninformative priors (ie, letting a prior , b prior tend to 0) yields the posterior mean: Thus, we can effectively think of EpiEstim's estimates of R t as closely related to the ratio between the number of observed infections on day t over the population infectiousness on the same day. Given enough data, this estimator can accurately identify R t and detect abrupt changes in transmissibility [5]. However, the method can suffer from systematic bias in the initial period of estimation, as described by O'Driscoll et al [6]. When the first chains of infections are not observed, the estimator will tend to attribute all new cases to the first observed cases, overestimating the reproduction number.

EpiEstim: Proposed Adjustment
We propose an adjustment to account for unobserved initial infections and exponential growth similar to the one developed in Dorigatti et al [14]. Specifically, we assume that the epidemic is growing exponentially to back-impute infections in the period prior to the fist observation. Note that this assumption is theoretically justified in the early stages of an epidemic [7], and exponential growth inference methods are between the most accurate in this initial period, as observed in O'Driscoll et al [6]. Our procedure can be summarized in 3 simple steps and is visualized in Figure 1. First, we fit a linear model on the log-transformed incidence data to estimate the growth rate, as shown in Figure 1A. Second, we use this linear model to back impute incidence data, prior to the time of the first observed case. In particular, we obtain an estimate of the number of cases for S days, where S is the largest possible generation interval length, i.e., the largest value such that w s > 0. We highlight that it is not necessary to round the output of the inferred number of cases, and estimates lower than 1 should not be removed. Finally, we apply EpiEstim to the extended epidemic curve (including the back imputed incidence data).

Simulations
To evaluate the effects of our correction, we compared the performance of EpiEstim with and without the adjustment on simulated data. We additionally compared the results to those obtained by fitting a linear exponential growth model, as the performance of our correction strongly depends on the accuracy of the estimates of the growth rate. We used a stochastic We considered a large population of 10 6 individuals and epidemics initiated by 5 initial infections. We assumed 3 days and 3.5 days as the mean times spent in the exposed and infectious compartments respectively, yielding a 6.5 mean generation interval commonly used to model COVID-19 [6]. To account for imperfect reporting, we simulated 2 types of issues commonly affecting the observed data: underreporting and unobserved initial generations of infections. We simulate unobserved initial generations by considering the simulated data from day 15 to day 28, corresponding to two generation intervals worth of data after having missed the first two generations. Furthermore, we simulated underreporting by considering a constant reporting rate ρ in {0.15, 0.3, 0.5, 1}. Daily observations were sampled as binomial realizations of true incidence with success probability equal to the reporting rate ρ.

Method Comparison
We fitted the exponential growth, EpiEstim and adjusted EpiEstim methods to the simulated data assuming a Gamma distributed generation interval matching the mean and variance of the generating process (ie, 6.5 mean generation interval). Figure 2 shows the effect of our adjustment when the reporting rate is 50% and shows that the initial bias observed in the estimates obtained with EpiEstim is strongly reduced by the proposed adjustment, and the mean estimates are comparable with those produced by the exponential growth method. These trends are observed for each value of ρ, implying consistency in the estimates obtained with different reporting rates ( Figure 2 and Supplementary  Figures 4-6). Our adjustment not only improves the accuracy of the central estimates of R 0 , but also improves uncertainty quantification. Average credible/confidence interval widths and coverage are reported in Figure 3. Coverage remains constant across different values of the reporting rate for the estimates obtained with the exponential growth and adjusted EpiEstim methods. On the other hand, the large bias observed in the estimates obtained with EpiEstim implies that as credible intervals get narrower, coverage decreases dramatically ( Figure 3B and 3C). The EpiEstim adjustment proposed in this article produces generally narrower credible intervals as compared to the exponential growth method, at the price of a slight decrease in coverage. This trade-off is in favor of our adjustment for values of R 0 larger than 2, when coverage among the 2 methods is similar. Furthermore, we highlight that the proposed adjustment does not influence later estimates of R t produced by EpiEstim (Figure 1).

Impact of Missed Generations
Beyond simulating undetected cases, reflecting a surveillance system which may be unprepared or unaware of a newly unfolding epidemic, we investigated the impact of the number of unobserved generations on the estimates obtained with our adjustment using simulated data generated under a scenario with perfect reporting rate (ρ = 1) and R 0 = 2.5. From each epidemic trajectory, we considered 3 different left truncations of the data to account for 0, 1, and 2 unobserved generations. We then applied EpiEstim with and without the proposed adjustment using biweekly time windows starting on weeks 0, 1, and 2. Figure 4 shows the distribution of the mean estimates for the 100 simulations. Each row identifies the number of unobserved generations, whereas on the x-axis we show the time window used to obtain the estimate. Figure 4 also shows that the adjustment is particularly useful when larger numbers of initial generations are unobserved. Although the proposed adjustment produces comparable estimates to those obtained with EpiEstim when all generations are observed (top row of Figure  4), its dependence on the exponential growth method introduces a layer of stochasticity that increases the variance of the mean estimates. On the other hand, when 1 or 2 generations are unobserved, the proposed adjustment adequately compensates for EpiEstim's bias, and the median estimates become closer to the true R 0 value used to simulate the data.

Application to Reported COVID-19 Data
In addition to validating the proposed adjustment on simulated data, we applied it to real-world COVID-19 incidence case data reported in the John Hopkins Center for Systems Science and Engineering database [15,16]. During the initial phases of the outbreak, surveillance systems of several countries were unprepared to detect infectious individuals, and it is likely that the first generations of infections were not observed, justifying our back-imputation adjustment. We selected Italy, Sweden, United Kingdom, and the United States as case studies. For each country, we fitted the log-transformed incidence case count reported for the first sequential 7 days of sustained transmission (no days with 0 new cases in the selected time window) with a linear regression model. We then inferred the number of unobserved cases before the selected time window and applied EpiEstim with and without adjustment to the observed and imputed data, using weekly sliding windows and assuming a generation interval with mean 5.7 days and standard deviation of 1.72 days [17]. Figure 5 shows that the adjustment lowers the estimates of R 0 in every scenario, suggesting a role for unobserved generations of infections in overestimating the early R 0 estimates of SARS-CoV-2, which in turn also affect the early estimates of R t . We obtained average R 0 estimates of 3.6 with 95% credible interval (CrI) (2.8, 4.6) in the United Kingdom, 5.2 (95% CrI 4.9, 5.6) for Italy, 8.7 (95% CrI 7.8, 9.6) for the United States, and 3.9 (95% CrI 3.5, 4.3) in Sweden.

DISCUSSION
We propose a correction for the systematic overestimation of R 0 that occurs in the early stages of an epidemic when using EpiEstim and other common inferential methods, utilizing a back-imputation procedure that relies on exponential growth. The proposed correction aims to account for unobserved generations of infections. For this reason, we deem it to be most applicable in scenarios where generations of infections may have been missed due to emergency situations and limited testing, and generation intervals are relatively short. In practice, the adjustment will prove especially useful to evaluate transmissibility of diseases that are either asymptomatic or cause mild symptoms for a large proportion of infected individuals. Similarly, the adjustment may prove applicable to diseases characterized by long time lags between infection to symptom onset, especially if infectiousness develops significantly earlier than symptoms. In this article, we demonstrate the application of this adjustment to the EpiEstim method, although it can be applied to other statistical methods where this bias may occur. The resulting adjusted EpiEstim method combines the best features of EpiEstim and the exponential growth method, which is more adequate for the early phases of an epidemic. In particular, the long-run R t estimates of the original EpiEstim method are preserved, whereas the initial bias observed in the estimates is reduced by the proposed adjustment. This was confirmed in all scenarios of our simulation experiments, independently of the reproduction number and the reporting rate values used to simulate the data. The adjusted estimates of R 0 strongly outperformed the estimates obtained with the original EpiEstim method in terms of bias and coverage, and produced tighter 95% credible intervals. Our results show that the R 0 estimates obtained with the adjusted EpiEstim and linear exponential growth method are very similar. This is likely due, in part, to the fact that the proposed adjustment relies on linear regression, which is used for the back imputation of the unobserved generations of infections. The proposed method does not currently incorporate uncertainty in the growth rate estimate nor in the imputed cases. Further work is required to understand how the uncertainty in our correction may be best propagated throughout the R 0 or R t estimation process.
The effect of our adjustment was evident when working with early COVID-19 data from Italy, the United Kingdom, Sweden, and the United States. The adjusted estimates were significantly lower than the estimates obtained with the original EpiEstim method and were found to be largely consistent with estimates derived in Ke et al [18]. This is especially encouraging considering that our adjusted R 0 estimates were obtained by making use of 7 days' worth of data, while those of Ke et al [18]. were obtained on both case and death time series data spanning months. This suggests that our method can yield improved precision with limited information, which may prove valuable in emerging epidemics. Although several papers and meta-analyses reported estimates of the exponential growth rate for the 4 countries [19][20][21], the parametrization of the generation interval used and the reproduction number estimates were often lacking [22] thus hindering further comparisons. Furthermore, the adjustment alone was not enough to explain the estimated decrease in R t in the study period considered (see Figure 5), possibly suggesting that changes in individual behavior and governmental interventions lowered transmissibility of the disease. However, other biases may be playing a role, such as changes in reporting rates, overestimation due to importations [23], or model misspecification.
Critically, the results are sensitive to the choice of EpiEstim parameters, such as the generation interval and the length of the time window used. As expected, longer generation intervals produce larger reproduction number estimates, and we observe larger discrepancies between the estimates obtained with the original and adjusted EpiEstim methods, even if the coefficient of variation is kept constant (Supplementary Figures 1-3).
Concerning the choice of the sliding window, the larger time windows produce smoother estimates, which reduces the impact of the imputed cases. This means that our adjustment has a much smaller effect when considering sliding windows covering 2 generation intervals worth of data or more.
To test the sensitivity of the estimates to the exponential growth assumption, we explored the performance of the method applied to case data obtained assuming sub-exponential growth [24,25]. The results of this sensitivity analysis, which are reported in the Supplementary Information, show that our adjustment performs well in cases of moderate departures from the exponential growth assumption. Here we have used a linear exponential growth method to infer missing initial generations of infection, improving the accuracy of early R 0 and R t estimates produced by commonly used statistical methods such as EpiEstim. Our analysis shows how a simple adjustment can reduce the initial bias in reproduction number estimates in a newly emerging epidemic when the interpretation of reproduction number estimates is crucial for public health decision making.

Supplementary Data
Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author. Potential conflicts of interest. I. D. reports grants from Wellcome Trust, Medical Research Council, Janssen R&D, and Natural Environment Research Council, outside the submitted work. All other authors report no potential conflicts. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Conflicts that the editors consider relevant to the content of the manuscript have been disclosed.