## Abstract

The basic reproduction number is a key parameter determining whether an infectious disease will persist. Its counterpart over time, the effective reproduction number, is of value in assessing in real time whether interventions have brought an outbreak under control. In this paper, we use theoretical arguments and simulation to understand the relationship between estimation of the reproduction number based on a full continuous time epidemic model and 2 other recently developed estimators. All these methods make use of “epidemic curve” data and require assumptions about the generation time distribution. The 2 simplest estimators do not require information about the—often difficult to obtain—population size. The simplest estimator is shown to require further assumptions that are rarely valid in practical settings and to produce severely biased estimates compared to the others. Furthermore, we show that in general the parameters of the generation time distribution and the reproduction number are non-identified in the early stages of an incomplete outbreak. On the basis of these results, we recommend that, wherever possible, estimation of the basic and effective reproduction numbers should be based on a well-defined epidemic model; moreover, if external information is available then it should be incorporated in a Bayesian analysis.

## INTRODUCTION

The emergence of the H1N1 pandemic influenza virus in April, 2009, highlighted the vulnerability of the world's population to new infectious disease threats. While the virus is less severe than originally feared (Presanis *and others*, 2009), its rapid global spread highlights the need to quickly understand the epidemiological characteristics of any new disease. One of the key quantities determining the speed and the extent of spread is the basic reproduction number, *R*_{0}, defined as the expected number of people that a typical infected person will infect in an entirely susceptible population. It has the threshold property that if *R*_{0} > 1, then the disease may carry on spreading indefinitely, whereas if *R*_{0} < 1, it will eventually die out without an intervention being necessary. For a homogenously mixing population, transmission needs to be reduced by a factor 1 − 1/*R*_{0} to control an outbreak. Higher values of *R*_{0} also mean that an outbreak will spread more rapidly (for a given timescale of each individual's infectivity); and, when there is complete immunity after recovery, *R*_{0} determines the number of people likely to be affected (the infection attack rate). The effective reproduction number *R*(*t*) is defined as the expected number of infections caused by a typical person who is infected at time *t* following the start of the outbreak.

Traditionally, estimation of the basic reproduction number for an infectious disease was undertaken in one of the 3 ways: by estimating the biological and behavioral quantities upon which it depends; by direct estimation from the early exponential growth of an outbreak (Lipsitch *and others*, 2003); and, for endemic diseases, through its relationship with the age at infection and life expectancy (Anderson and May, 1991). More recently, with the use of quantitative infectious disease epidemiology in real time to inform policy decisions (Lipsitch *and others*, 2003; Riley *and others*, 2003), more robust estimation methods have been proposed. For any outbreak, if the population at risk is known, then one can specify an epidemic model and use either likelihood-based method or Bayesian method to obtain parameter estimates. This approach has been adopted for outbreaks in animal populations where such denominator data are commonly available (Boender *and others*, 2007; Ferguson *and others*, 2001; Haydon *and others*, 2003). However, such data are often difficult to obtain, and inference is based on observed cases only.

During the Severe acute respiratory syndrome (SARS) outbreak in 2003, a simple method for retrospective reproduction number estimation based on epidemic tree reconstruction was proposed that does not require denominator data (Wallinga and Teunis, 2004). An extension of this method to jointly estimate the generation time and reproduction number has also been applied (Garske *and others*, 2007). More recently, another method not requiring denominator data was proposed; it is based on a branching process model and in theory is suitable for simultaneous real-time estimation of the reproduction number and the parameters of the generation time distribution (White and Pagano, 2008).

In this paper, we compare 3 methods for jointly estimating the reproduction number and generation time. These methods are based on (i) a fully specified epidemic model; (ii) a continuous-time version of the method of White and Pagano (2008); and (iii) the semiparametric method of Garske *and others* (2007). We first highlight the assumptions required by the last 2 methods by comparing the partial likelihoods of each to the likelihood under the full epidemic model. Then by simulation, we assess the sensitivity of estimates from both the simple methods to failure of these assumptions. In this way, we make a coherent link between all 3 methods and assess how useful the 2 simple methods are in practice.

## PRELIMINARIES

### Notation

We assume a closed, homogeneous population of *N* people (or units). The outbreak begins at time 0 with the simultaneous infection of *k* people by external sources, where *k*≪*N*. The epidemic develops in continuous time with *n* people infected by the end of the observation period at time *T*. Surveillance procedures are assumed to yield the ordered “epidemic curve” *t* = (*t*_{1},…,*t*_{n}), where *t*_{i} is the infection time of case *i*, and *t*_{1} = ⋯ = *t*_{k} = 0.

### The generation time

The generation time is defined as the interval between a case becoming infected and its subsequent infection of another case (Svensson, 2007). Each time interval *u* between the infection times of a primary and secondary case may be attributed as the generation time of the secondary case, so that each case other than the index case(s) has a uniquely defined generation time; the generation time distribution has probability density function *w*(*u*;*θ*) with parameter(s) *θ*. In practice, it is often the onset of symptoms rather than the times of infection which are observed. Intervals between any specific event in the time course of infection of the primary case and the corresponding event of the secondary case (such as the onset of clinical symptoms) are known as serial intervals. The distribution of serial intervals can be defined analogously to that of the generation time. In general, the 2 differ, but for simplicity, we concentrate on the generation time distribution.

## THE REPRODUCTION NUMBER ESTIMATORS

### Semiparametric method

##### Known generation time distribution.

Wallinga and Teunis (2004) proposed an estimator and applied it to data from SARS outbreaks. As originally formulated, it assumes that the parameters *θ* of the generation time density *w* are known. By considering a likelihood which we describe in the next section, they argue that

*j*was infected by case

*i*, given the observed set of infection times. Let

*R*(

*i*) be the “individual” reproduction number for case

*i*, namely, the number of secondary cases actually generated by

*i*. For a completed outbreak, (3.1) implies that the expected value of

*R*(

*i*)given the observed infection times is

##### Unknown generation time distribution.

More recently, the above method was applied to avian influenza outbreaks, with the likelihood function of Wallinga and Teunis used to estimate the generation time distribution parameters directly from epidemic curve data (Garske *and others*, 2007).

Denote the infection tree by the vector *v*, where *v*(*j*) is the infector of *j*. If *v* and *t* were both observed, the likelihood would be

*P*(

*v*|

*θ*,

*η*) is the probability of infection tree

*v*occurring and

*η*represents parameters not involved with the generation time distribution. Under the assumption that all possible infection trees for an outbreak of a given size are equally likely, the integrated likelihood taken over all infection trees is

*and others*(2007) use (3.4) to estimate the generation time parameters

*θ*, which are then used to estimate individual reproduction numbers using (3.2). The basic reproduction number can be estimated using the first or the average taken over the first few individual reproduction numbers.

The integrated likelihood is a form of partial likelihood and consistency of estimators based on (3.4) depends on the assumptions used to obtain it. However, we show in the Supplementary Material available at *Biostatistics* online that a crucial assumption—that all possible infection trees are equally likely, given the observed infection times and outbreak size—does not generally hold, even for simple transmission models. Specifically, it fails when there is heterogeneity between cases in the probability that they will infect any other randomly chosen member of the population during the time we observe the outbreak. Therefore, it will not hold in any of the following situations:

if there is variation in contact rates between people;

if there is variation in total infectivity (e.g. if there is a constant infection rate but variable duration of infectiousness);

if the outbreak grows to infect more than a small fraction of the population;

if there is an intervention at some point to reduce infectivity;

or if the outbreak is still in progress when we observe it (i.e. incomplete).

Unless there are a number of primary cases and *R*_{0} < 1, an outbreak which is large enough for us to make reasonable parameter estimates will usually go on to infect a large proportion of the population in the absence of an intervention. In practice, a large-scale outbreak without a proportionally scaled intervention is unlikely, and so the conditions for all infection trees to be equally probable will rarely be satisfied.

#### Methods based on a transmission model

The other 2 methods we investigate are based on the likelihood for a model of disease transmission in which infectivity varies over time but not between individuals. So at time *u* after infection, each case *i* has a hazard of infecting any other specific susceptible person *j* given by *h*(*u*) = *p**w*^{*}(*u*), where *w*^{*}(*u*) is the normalized timecourse of infectivity which integrates to one and *p* is the total infectivity of *i* to others. For small *p*, *w*^{*}(*u*) is related to the generation time distribution by *w*^{*}(*u*)≈*w*(*u*) (Svensson, 2007), and the basic reproduction number can be expressed as *R*_{0} = (*N* − 1)*p*. The model differs from the more widely used susceptible-infectious-recovered epidemic model in 2 ways: individuals do not explicitly enter a recovery state unless the generation time distribution has bounded support and total infectivity does not vary between people.

Ignoring saturation effects stemming from the finite population size, the total number of infections a primary case causes is Poisson distributed with mean *R*_{0}. The finite population effect is 2-fold: first, through the depletion of susceptibles, the number of secondary cases per primary case is reduced and second, when there are many infectious cases, competition to infect the remaining susceptibles shortens the actual generation time distribution (Svensson, 2007). In the remainder of the paper, we take the generation time distribution to mean the distribution in the absence of competing sources of infection or equivalently (for a large enough population) the normalized infectivity curve.

###### The model likelihood.

The likelihood for an outbreak observed up to time *T* is that for a survival model where each susceptible person is subject to a competing risk of infection from the currently infectious cases. Parameterizing in terms of *R*_{0}, the log-likelihood is

*W*

^{*}(

*u*;

*θ*) = ∫

_{0}

^{u}

*w*

^{*}(

*v*)d

*v*and

*ℓ*

_{w}(

*θ*) = ∑

_{j = k + 1}

^{n}log(∑

_{i = 1}

^{j − 1}

*w*

^{*}(

*t*

_{j}−

*t*

_{i};

*θ*)).

Consider the likelihood term-by-term. The second term *ℓ*_{w}(*θ*) is almost equal to the log of (3.4), the partial likelihood for *θ* devised by Wallinga and Teunis (2004). The latter involves the density function *w* rather than the normalized hazard *w**, which is a relatively minor difference when *p* is small. More seriously, however, the third and fourth terms of (3.5) also depend on *θ* but are ignored in (3.4). If *n*≪*N*, then the third term is negligible; if the outbreak is complete, with *W*^{*}(*T* − *t*_{n};*θ*)≈1, then the fourth term is independent of *θ*. Hence, when both these conditions are satisfied, for this model with the same time course of infectivity for different cases, the full model log-likelihood for *θ* reduces to *ℓ*_{w}(*θ*). This confirms the argument in Section 3.1 that these are the conditions under which (3.4) is correct.

###### Infinite population likelihood.

If the outbreak is small relative to population size *N* then we can ignore saturation effects and write down a likelihood that does not depend on *N*. Taking the limit as *N*→*∞* of (3.5) for fixed *R*_{0} gives log-likelihood

*W*(.) is the cumulative distribution function of

*w*(.). Note that

*W*(

*u*) =

*W*

^{*}(

*u*) in the limit for fixed

*R*

_{0}as

*N*→

*∞*and

*p*→0. White and Pagano (2008) recently derived a discrete-time form of this likelihood and assessed its properties when fitting to a still-growing outbreak. Herein, we refer to (3.6) as the “infinite population likelihood,” and (3.5) as the “finite population likelihood.” The corresponding estimators when there have been interventions to reduce transmission are given in the Supplementary Material at

*Biostatistics*online.

### SIMULATION STUDIES

We carried out simulation studies to answer the following 2 questions:

Are the basic reproduction number and generation time parameters identifiable using data from an outbreak in its initial growth stage?

How accurately do the methods estimate the parameters?

We simulated epidemics using the probability model described in Section 3.2, with differences in the simulation setup for each question as described in the following sections. The principal generation time distribution considered is gamma with a shape parameter of 3 and so has a similar shape to the serial interval distributions estimated for SARS (Lipsitch *and others*, 2003) and H1N1 pandemic influenza (Cauchemez *and others*, 2009; Ghani *and others*, 2009). As the models were specified in continuous time, and the outbreaks analyzed either at completion or after a prespecified number of cases had been reached, the assumed timescale of the generation time distribution is arbitrary. We also considered an exponential distribution with the same mean in Section 4.1 to show how the estimates are affected if the distribution has a higher coefficient of variation. The simulated values of *R*_{0} ranged from 1 to 10: it must be at least 1 to cause an outbreak, and the range includes the estimated values for the most important recent novel or reintroduced infectious diseases such as around 1.5 for H1N1 pandemic influenza or 2.4 in schools (Fraser *and others*, 2009; Yang *and others*, 2009), 2.7 for SARS (Riley *and others*, 2003), and 4.5 to 8 or more for foot and mouth disease (Ferguson *and others*, 2001).

#### Identifiability of parameters in an incomplete outbreak

Under many simple epidemic models, an outbreak which has only infected a small proportion of the population will after the first few cases exhibit approximately exponential growth, with the growth rate determined by both *R*_{0} and *θ*. Hence, with such data, it may be difficult to jointly estimate these parameters. As we show in the Supplementary Material at *Biostatistics* online, this lack of identifiability manifests itself as a ridge in the log-likelihood, with similar values of the log-likelihood across a wide range of combinations of *R*_{0} and the scale parameter of *w*(.).

We investigated how often the basic reproduction number and the generation time parameters were jointly identifiable from an incomplete outbreak with no interventions. Exponential and gamma generation time distributions were considered, with *R*_{0} equal to 1.5 or 4. Outbreak sizes of 20, 50, 100, and 200 were analyzed; for each outbreak size, scenarios were generated where the initial number infected was 1, 5, and 10. The population size was 10 000, large enough relative to the outbreak size so that there are negligible saturation effects. We fitted the models to simulated outbreaks which were still in progress (i.e. incomplete) when they reached the required size, with 2000 outbreaks generated for each scenario; the same distribution was used for fitting as was used to simulate the data. The parameter estimates for the finite and infinite population likelihoods (3.5 and 3.6) were almost identical, hence, only the infinite population results are shown here. Further simulations in the Supplementary Material available at *Biostatistics* online show that estimates of *R*_{0} that do not account for depletion of susceptibles (infinite population models) are lower than those that do, but that the difference is mostly less than 5% when up to 5% of the population has been infected.

Figure 1(a) shows the proportion of incomplete outbreaks for which the likelihood had a maximum in the interior of the parameter space (defined by an estimated *R*_{0} of less than 2000). For a given value of *R*_{0}, the parameters are much more likely to be identifiable if the generation time has a gamma distribution with shape parameter of 3 than if it is exponentially distributed. This is likely to be because when the generation time distribution has a low variance, the initial generations cluster together in time, allowing the timescale of the generations to be estimated. We also found that a lower *R*_{0} and more primary cases increased the probability of there being an interior maximum to the likelihood. In our simulations because the data are analyzed after a prespecified number of infections, with a low *R*_{0} there are likely to be more generations observed at that point than with a higher *R*_{0}. Thus, it seems that a number of cases spread over many generations is more informative than the same number of cases produced in a few generations.

In many cases, even when there was a maximum in the interior, the log-likelihood was not much greater at this point than it was for extreme parameter values. Thus, confidence intervals for these interior estimates will be extremely wide. The proportion of outbreaks where there was an interior maximum and the log-likelihood was at least 2 greater than at extreme parameter values in the ridge of the log-likelihood are shown in Figure 1(b); these correspond approximately to outbreaks with a finite upper 95% confidence limit for *R*_{0}. The parameter vector in the ridge at which the log-likelihood was calculated was chosen according to the procedure in the Supplementary Material at *Biostatistics* online (Section S4.2 of the Supplementary Material available at *Biostatistics* online). For the scenarios considered, usually fewer than half of the outbreaks had identifiable parameters in this more strict sense.

Using a discrete-time version of the infinite population likelihood (3.6), White and Pagano (2008) found by simulation that *R*_{0} and the mean and variance of the generation time distribution were mostly well estimated, with low median bias and narrow interquartile range of the point estimates, although the bias was greater with smaller gamma shape parameters. However, their scenarios were ones where we found that the parameters are more likely to be identifiable, with low values of *R*_{0} (0.9, 1.25, or 2), gamma distributed generation times with shape parameters from 1.8 to 9, more than one index case, and outbreak sizes of more than 100.

#### Accuracy of estimators

To assess the bias of the estimators, we applied both the full epidemic model likelihood and the semiparametric method using the likelihood of (3.4) to simulated outbreaks. We here consider only a gamma distributed generation time with a shape parameter of 3: as noted, at the beginning of Section 4, this is more similar in shape to the serial interval distributions observed for SARS and H1N1 than is the exponential distribution. The outbreaks were uncontrolled for the results in the main text. Additional simulation results with an intervention to control the outbreak are included in the Supplementary Material at *Biostatistics* online. Here, *R*_{0} took the values 1–3 in steps of 0.25 and 3–10 in steps of 0.5. There was one index case in a population of 800, and 4000 data sets were simulated for each set of parameters. We only fitted to outbreaks where at least 10 people became infected. In the results, “ML” (for maximum likelihood) refers to the finite population maximum likelihood estimates, as the infinite population likelihood is not applicable to completed uncontrolled outbreaks.

For the semiparametric method (referred to here as “SP”), having estimated the generation time parameters *θ* using the likelihood (3.4), we estimated *R*_{0} as the mean from *i* = 1 to 10 of the estimated individual reproduction numbers *R*(*i*) found using the Wallinga and Teunis formula (3.2). For comparison, we also applied the same formula with the true parameter vector *θ*. This is the Wallinga and Teunis method as usually implemented (referred to here as “WT”), although in reality *θ* would not be known exactly. Results are the median of the point estimates for each method.

###### Results.

The SP method badly underestimates *R*_{0} for 2 reasons (Figure 2(a)). First, the mean generation time is underestimated (Figure 2(b)) because one of the main assumptions required for the likelihood (3.4) to be correct is violated, namely, only a small proportion of the population should be infected. Second, *R*(*i*) in (3.2) is estimating the actual number of cases that *i* infected. The larger the value of *R*_{0}, the faster the outbreak will grow relative to the generation timescale, and so even for the first few cases, depletion of susceptibles will mean that on average fewer than *R*_{0} secondary cases are infected. The estimates of *R*_{0} using the WT method show much less bias; as this differs from the SP method only in that the generation time parameters *θ* are assumed to be known, the underestimation of the mean generation time must be the primary reason for the poor performance of the SP method. However, the WT estimates are somewhat biased downward at high *R*_{0}, due to the second source of bias, which is common to both methods. The ML have negligible median bias for *R*_{0} below 7 (Figure 2). Much of the information about the value of *R*_{0} comes from the final size of the outbreak. With a larger *R*_{0}, all or almost all the population is usually infected and so it is more difficult to determine the precise value of *R*_{0} and hence the parameters in *θ*.

Simulation results with an intervention to control the outbreak after the 10th infection are included in the Supplementary Material at *Biostatistics* online. These show that the ML are biased upward for small *R*_{0} when only outbreaks above a certain size are analyzed (truncation bias), whereas SP estimates of *R*_{0} are again badly biased downward. Both the ML and SP methods performed reasonably well at estimating the reproduction number after the intervention.

### DISCUSSION

Real-time estimation of the basic reproduction number is of paramount importance if appropriate advice is to be given to policy makers during an infectious disease outbreak. Our results, however, demonstrate that it is difficult to jointly identify the parameters of the generation time distribution and the basic reproduction number during the early exponential growth phase of the epidemic. During this phase, there is a clear ridge in the log-likelihood under plausible generation time distributions and reproduction number values. However, it becomes easier to jointly identify both sets of parameters when the reproduction number is low, the generation time distribution has low variance (and hence cases appear in distinct generations), and there are several index cases infected at exactly the same time. These are the types of scenarios examined by White and Pagano (2008) and hence this explains why their results did not reveal the identifiability problems demonstrated here. An additional drawback of the method is that there may be periodicity in the reporting of cases, for example, day of the week effects, which makes inference about the generation time dangerous if it is based purely on the time series of infections in a growing outbreak. The partial likelihood described by Wallinga and Teunis (2004) and applied by Garske *and others* (2007) depends on the assumption that all infection trees are equally likely (conditional on the observed outbreak size). This implies that there must be no variation in contact rates or infectivity between people, no intervention that changes infectivity, and the outbreak must be complete. These conditions are unlikely to be satisfied in practice.

To obtain unbiased estimates of key parameters, we have a number of options. If we know the size of the population at risk then a full epidemic model likelihood can be used and both sets of parameters can be jointly estimated with reasonable precision for completed epidemics. Specification of a full epidemic model likelihood makes any assumptions clear, and it provides a natural framework for the correct handling of missing data. However, specifying the population at risk is not trivial, particularly for new diseases with uncertain aetiology and mode of transmission.

If we are sure that the outbreak has reached at most a few percent of the population, then the infinite and finite population likelihoods will be similar. However, when there are no saturation effects, we have the problem of identifiability. This can be dealt with by using prior information on the generation time distribution from other settings or similar diseases, or inferring the timescale of infectivity from data such as viral shedding. The former approach has generally been employed in estimating *R*_{0} for H1N1 influenza, with most studies either using the initial estimate of the generation time distribution provided by Fraser *and others* (2009), or assuming a generation time distribution based on prior knowledge from past influenza outbreaks. Using a Bayesian framework is preferable to assuming fixed values, but few studies have adopted this approach, although most examine the sensitivity of results to the assumed generation time distribution mean (but not variance).

In our simulation study, we used the same model to simulate and to fit to the outbreaks. In reality, it is difficult to determine whether an assumed transmission model is close enough to the true process. Hence, estimates obtained using a fully specified epidemic model should be treated with caution and the plausibility of the model assessed. We also only considered simple scenarios in which all the times of infection are known. In practice, data emerging from epidemics will be more limited. Data are typically grouped by time units such as days. We do not expect this to change our conclusions, as long as the timescale of aggregation is shorter than the timescale of the disease process. Much work has already been undertaken in developing parameter estimation methods for situations in which infection times are not observed (Gibson and Renshaw, 1998; O'Neill and Roberts, 1999). Missing data may also be more widespread, for example, with unidentified cases, particularly if the infection does not always result in clinical symptoms or signs, as is the case for H1N1 influenza. The full likelihood based on the epidemic model provides a natural framework for incorporating these aspects, although further work remains to develop estimation methods that can be implemented rapidly.

In summary, we would recommend that wherever possible, estimation of the reproduction number and the parameters of the generation time distribution are jointly undertaken using the full epidemic model rather than using simpler methods which make heavy assumptions that are often violated.

### SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

### FUNDING

Bill and Melinda Gates Foundation Vaccine Modelling Initiative (49276 to J.T.G., A.C.G.); Medical Research Council (G0600719 to T.G., A.C.G.).

We would like to thank the editor, associate editor, and anonymous reviewers for their comments which greatly improved this paper. *Conflict of Interest:* None declared.

## References

*PLoS Curr Influenza*RRN1130