Abstract

Epidemic forecasts are only as good as the accuracy of epidemic measurements. Is epidemic data, particularly COVID-19 epidemic data, clean, and devoid of noise? The complexity and variability inherent in data collection and reporting suggest otherwise. While we cannot evaluate the integrity of the COVID-19 epidemic data in a holistic fashion, we can assess the data for the presence of reporting delays. In our work, through the analysis of the first COVID-19 wave, we find substantial reporting delays in the published epidemic data. Motivated by the desire to enhance epidemic forecasts, we develop a statistical framework to detect, uncover, and remove reporting delays in the infectious, recovered, and deceased epidemic time series. Using our framework, we expose and analyze reporting delays in eight regions significantly affected by the first COVID-19 wave. Further, we demonstrate that removing reporting delays from epidemic data by using our statistical framework may decrease the error in epidemic forecasts. While our statistical framework can be used in combination with any epidemic forecast method that intakes infectious, recovered, and deceased data, to make a basic assessment, we employed the classical SIRD epidemic model. Our results indicate that the removal of reporting delays from the epidemic data may decrease the forecast error by up to 50%. We anticipate that our framework will be indispensable in the analysis of novel COVID-19 strains and other existing or novel infectious diseases.

Significance Statement

Accurate forecasts constitute the first step toward controlling an epidemic outbreak. However, the accuracy of such forecasts strongly depends on input data. Here, we develop a statistical framework to identify and remove reporting delays from the epidemic data. We use this framework to de-noise epidemic data of the first COVID-19 wave in eight regions worldwide. We demonstrate that the removal of reporting delays may increase the accuracy of epidemic forecasts by up to 50%. We anticipate that our framework will be invaluable in the analysis and forecasts of existing and future epidemics.

Introduction

The COVID-19 pandemic has been crippling the world’s health, economies, and quality of life for over 3 years. We are gradually understanding that COVID-19 is here to stay. Fast mutation rates of the virus and its overwhelming spreading capability make it extremely hard, if not impossible, to eradicate (1). COVID-19 is not the first and almost surely not the last pandemic to hit humanity. Therefore, in order to better prepare and withstand other contagious diseases in the future, we need to extract as many lessons as possible from the COVID-19 pandemic.

Public awareness is, arguably, the first line of defense against any infectious disease. Efficient collection of epidemic data and accurate epidemic forecasts allow for timely containment of the spread or flattening of the curve to win time for the development of pharmaceutical treatment methods. Due to the success of network epidemiology and the broad availability of data, significant advances in epidemic modeling and forecast methods (2–6) have been achieved. However, the accuracy of epidemic forecasts strongly depends on the accuracy and timeliness of the input data.

Are epidemic data—especially in the short-time period after the onset of the epidemic—devoid of inaccuracies? Multiple sources suggest the negative answer. Indeed, public health indicators are known to be biased, because they are sensitive to fluctuations in supply and demand for diagnostic testing (7–9). Likewise, inequalities in geographic accessibility have been documented to adversely affect the coverage and timelines of health indicators (10, 11). It is now well understood that ignoring epidemic data inaccuracies may lead to delays in deploying intervention policies resulting in dire consequences of an epidemic on the population’s health (12, 13).

One prominent challenge is the existence of temporal delays in data reporting (14, 15), defined as the time difference between the event when the person was affected by the virus and the time this event is accounted for. The reporting delays may occur due to many reasons, including patient hesitancy, medical testing delays, and reporting delays by public health authorities (16, 17). Consequently, reporting delays may vary not only across different diseases but also across different regions of interest. The median diagnosis delay for malaria is, for instance, approximately 4 days (18). Similarly, the research on the Middle East respiratory syndrome coronavirus (MERS-CoV) found a time difference between the symptom onset and confirmation of approximately four days (19). Hepatitis A, measles, and mumps data are usually reported after eight days (16). Reporting delays for some diseases are significantly longer: Hepatitis B, Shigellosis, and Salmonella can take on average 2–3 weeks (16). Recent COVID-19 studies reveal significant reporting delays of infections in China (20–25), Italy (26), Germany (27, 28), Singapore (29), the United States of America (30), and the United Kingdom (14).

While there is a plethora of scientific works aimed at the reconstruction of network data, including the reconstruction of network data for epidemic forecasts (31–33), we lack inference methods to assess and improve the accuracy of epidemic data itself. Instead, the prevalent epidemic forecast philosophy is focused either on data fusion (34–36)—or machine learning methods (37–39).

In our work, we develop a statistical framework to remove reporting delays. We demonstrate that the removal of reporting delays may significantly improve the accuracy of epidemic forecasts. Our statistical framework can be used in combination with any epidemic forecast method that takes infectious, recovered, and deceased epidemic data as input. Our work is organized as follows. We first present the evidence for the presence of reporting delays in the epidemic data extracted from the first COVID-19 wave. We then proceed to develop a statistical framework to remove reporting delays. After validating our framework on synthetic data, we move on to remove and analyze reporting delays in eight hotspots of the COVID-19 pandemic. We conclude our work with a discussion of the impact of delay removal on the accuracy of epidemic forecasts.

Results

Notation

Before presenting our findings, we introduce our notation for the epidemic data. Throughout the text, we operate with the infectious I, recovered R, and deceased D data. Each dataset is a time series of values, each corresponding to a specific observation time. For brevity, we refer to the triplet of infectious, recovered, and deceased data as Y={I,R,D}. All values contained in the Y time series are fractions of individuals found in the corresponding state on a specific day. For instance, I[k] corresponds to the fraction of individuals who are infectious on day k. Since COVID-19 reported data often is advertised in the form of changes in the number of epidemic cases, we find it convenient to introduce the daily changes in epidemic data as ΔY[k] for Y={I,R,D}. Further, in this work, we operate with reported epidemic data Y~ and inferred data Y^. Since reported and inferred data are expected to differ from the true data Y, we need to distinguish the three. We summarize our notation in Table 1.

Table 1.

Naming convention for the epidemic data, Y={I,R,D}.

Cumulative quantitiesQuantity increments
Y: fractions of casesΔY: fractions of new cases
Y~: fractions of reported casesΔY~: fractions of reported new cases
Y^: fractions of predicted casesΔY^: fractions of predicted new cases
Cumulative quantitiesQuantity increments
Y: fractions of casesΔY: fractions of new cases
Y~: fractions of reported casesΔY~: fractions of reported new cases
Y^: fractions of predicted casesΔY^: fractions of predicted new cases
Table 1.

Naming convention for the epidemic data, Y={I,R,D}.

Cumulative quantitiesQuantity increments
Y: fractions of casesΔY: fractions of new cases
Y~: fractions of reported casesΔY~: fractions of reported new cases
Y^: fractions of predicted casesΔY^: fractions of predicted new cases
Cumulative quantitiesQuantity increments
Y: fractions of casesΔY: fractions of new cases
Y~: fractions of reported casesΔY~: fractions of reported new cases
Y^: fractions of predicted casesΔY^: fractions of predicted new cases

Evidence for reporting delays in epidemiological data

We begin the exposition by considering daily reports on the fractions of infected ΔI~, recovered ΔR~, and deceased ΔD~ individuals in Spain. We used the daily epidemic reports in Spain to construct the fraction of infected individuals as I~[k]==0k1(ΔI~[]ΔR~[]ΔD~[]). While both the infected I~ and the deceased data D~ indicate that the first COVID-19 wave in Spain peaked in April 2020, the exact timings of the two peaks are, nevertheless, different. As observed in Fig. 1a, reported new deceased cases ΔD~ reached their peak on 2020 April 1st, while the highest fraction of infectious individuals I~ was observed 22 days later on 2020 April 23. This observation is not specific to Spain: the peaks in the number of reported new deceased cases ΔD~ precede those of infectious cases by more than 1 week in most regions, Fig. S1. We make similar observations for COVID-19 daily recovery reports ΔR~, which exhibit their peaks after those of the deceased cases ΔD~, Figs. 1a and S1. Furthermore, when plotted as a function of daily deceased cases ΔD~, infectious cases I~ and daily recovered cases ΔR~ form “loop” patterns, see Figs. 1b, c and S2.

Evidence for reporting delays in COVID-19 epidemic data. a) Reported infectious I~[k], recovered ΔR~[k+1] and deceased ΔD~[k+1] cases for the first COVID-19 wave in Spain. The k=0 day corresponds to 2020 February 25. Note the difference in peaks of the reported data. b) and c) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for Spain. Colors, from light to dark green, reflect different days in the data ranging, respectively, from k=0 to k=80. The scatter plots in b) and c) form, respectively, clockwise and counterclockwise loop patterns. d)–f) display the epidemic data in Spain after the removal of reporting delays. g) Synthetic epidemic data generated with the SIRD model by solving Eq. 1 with parameters β=0.5, γr=0.2, γd=0.05, R[0]=D[0]=0, I[0]=10−6, and S[0]=1−I[0]. Synthetic reporting delays were generated with the Pólya-Aeppli distributions using Eq. 3 with parameters λI=3.68, θI=0.5, E[TI]=7.36; λR=9.46, θR=0.4, E[TR]=23.65; λD=0.3, and θD=0.3, E[TD]=1. h) and i) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for the SIRD epidemic data. j)–l) display the synthetic epidemic data after the removal of reporting delays.
Fig. 1.

Evidence for reporting delays in COVID-19 epidemic data. a) Reported infectious I~[k], recovered ΔR~[k+1] and deceased ΔD~[k+1] cases for the first COVID-19 wave in Spain. The k=0 day corresponds to 2020 February 25. Note the difference in peaks of the reported data. b) and c) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for Spain. Colors, from light to dark green, reflect different days in the data ranging, respectively, from k=0 to k=80. The scatter plots in b) and c) form, respectively, clockwise and counterclockwise loop patterns. d)–f) display the epidemic data in Spain after the removal of reporting delays. g) Synthetic epidemic data generated with the SIRD model by solving Eq. 1 with parameters β=0.5, γr=0.2, γd=0.05, R[0]=D[0]=0, I[0]=106, and S[0]=1I[0]. Synthetic reporting delays were generated with the Pólya-Aeppli distributions using Eq. 3 with parameters λI=3.68, θI=0.5, E[TI]=7.36; λR=9.46, θR=0.4, E[TR]=23.65; λD=0.3, and θD=0.3, E[TD]=1. h) and i) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for the SIRD epidemic data. j)–l) display the synthetic epidemic data after the removal of reporting delays.

The patterns observed in Figs. 1a–c, S1, and S2 may indicate the presence of reporting delays. Indeed, epidemic models view recoveries and deaths of patients as stochastic processes with effective rates proportional to the number of infected individuals I, ΔR[k]γrI[k], ΔD[k]γdI[k]. Since the SARS-CoV-2 virus has hardly changed during the first wave of the pandemic, we expect that recovery γr and death γd rates are approximately constant during this period (40–42). This observation suggests that changes in the fractions of recovered ΔR, and deceased ΔD data are proportional to the fraction of infectious individuals I and, therefore, reach maximum values at the same time step, contradicting Fig. 1a–c.

We hypothesize that the disagreement between ΔR~, ΔD~, and I~ is due to reporting delays. If each ΔR~, ΔD~, and I~ time series are reported with different delays, their peaks are expected to differ. Likewise, the loop patterns of Fig. 1b, c may also be the result of an effective time shift between two nonmonotonous time series.

To check if reporting delays may result in the observed patterns, we consider the compartmental Susceptible-Infectious-Recovered-Deceased (SIRD) epidemic model. Within the SIRD model, the population is split into four compartments: susceptible S, infectious I, recovered R, and deceased D. Compartment S denotes the fraction of susceptible individuals, who can be infected by infectious individuals. Compartment I denotes the fraction of individuals, who have been infected but have not recovered or are deceased. Compartments R and D are respectively the fractions of individuals, who have recovered or are deceased. The SIRD model assumes that recovered individuals become immune and cannot be infected by the virus in the future. Further, the SIRD model assumes the uniform mixing of the Infectious and Susceptible sub-populations. As a result, the discrete-time transitions between the compartments are governed by first-order difference time equations

(1)

where β, γr, and γd are the infection, the recovery, and the deceased probabilities, respectively. We used Eq. 1 to generate infectious I[k], recovered R[k], and deceased D[k] time series epidemic data, and then added synthetic delays to the time series data, see Fig. 1g–i and Supplementary Material B. Since reporting delays were manually added, we observe that ΔR[k+1], ΔD[k+1], and I[k] reach maxima at different time steps, Fig. 1g.

Upholding our hypothesis, we observe the formation of loop patterns in the ΔD~[k+1] vs. I~[k] and ΔR~[k+1] vs. D~[k] in Fig. 1h, i, similar to those of Spain, Fig. 1b, c. The observed loop patterns are due to the effective horizontal shifts of the corresponding times series due to reporting delays. Indeed, let us split the observation time window into three windows formed by the maxima of the ΔD~[k+1] and I~[k] curves, as shown in Fig. 1g. In window i, both I~[k] and ΔD~[k+1] increase as a function of discrete time k. Since reporting delays in the synthetic ΔD~[k+1] data are smaller than those in I~[k], this time window corresponds to the upper branch of the ΔD~[k+1] vs. I~[k] loop in Fig. 1h. In window ii, I~[k] increases while ΔD~[k+1] decreases. Thus, window ii corresponds to the top (decreasing) section of the ΔD~[k+1] vs. I~[k] loop, Fig. 1h. Finally, in window iii both ΔD~[k+1] and I~[k] decrease as a function of time step k resulting in the lowest section of the ΔD~[k+1] vs. I~[k] loop, Fig. 1h. Combined, all sections correspond to the loop pattern of Fig. 1h with points progressing in the clockwise direction. Similar considerations explain the counterclockwise loop pattern in the ΔR~[k+1] vs. D~[k] scatter plot. The counterclockwise progression of points in this loop pattern is due to ΔR~[k] lagging behind the ΔD~[k] time series, Fig. 1i.

A null model for reporting delays

To uncover reporting delays, we employ the following null model: each individual i is endowed with random event time τYi of either getting infected, τIi, recovered, τRi, or deceased, τDi. Figure 2 depicts the random delay time for event Yi={Ii,Ri,Di} as TYi={TIi,TRi,TDi} and the corresponding random reported time as KYi=τYi+TYi. Within our discrete-time setting, these random times are discrete random variables. Assuming that the reporting delays TYi are independent of the events Yi, we obtain for the reported time KYi

(2)
A schematic representation of epidemic events and corresponding delay times.
Fig. 2.

A schematic representation of epidemic events and corresponding delay times.

By taking the arithmetic mean of Eq. 2, we find that the expected reported fraction ΔY~[k]=1NiPr[KYi=k] of individuals in state Y is a discrete convolution

(3)

where Y={I,R,D}. Equation 3 serves as the foundation for uncovering reporting delays and improving epidemic forecasts.

Statistical framework to uncover reporting delays

The delay distribution Pr[TY=m] in (3) can be determined, in principle, as the solution of a large set of quadratic equations, deduced from the governing epidemic equations, as shown in Supplementary Material D, provided that we assume a same distribution for each delay TYi={TIi,TRi,TDi}. Since that solution is rather demanding, we propose a simplification. We assume that the delay distribution Pr[TY=m] is generated by a two-parameter probability distribution with parameters κ={λ,θ}, which is sufficiently general to incorporate the real probability distribution of the delay.

We consider three families of two-parameter discrete probability distributions: the negative binomial distribution, the Neyman type A distribution, and the Pólya-Aeppli distribution, see Materials and methods A. These distributions are quite general and contain some well-known 1-parameter distributions as special cases. The logarithmic, geometric, and Poisson distributions are, for instance, all special cases of the negative binomial distribution (43). The three distributions are sufficiently general to generate data with variable mean and variance and of varying skewness (43, 44)—properties expected of the epidemic delay data, see Materials and methods A. For brevity, we focus on the Pólya-Aeppli distribution in the main text and report the results for the other two distributions in Fig. S4.

In our statistical framework, we assume that the reporting delays correspond to three datasets, Y={I,R,D}, which are all characterized by the Pólya-Aeppli distribution, albeit with different parameters κY{λY,θY}. Hence, there are in total six parameters for three Pólya-Aeppli distributions, κ=(λI,θI,λR,θR,λD,θD). With the choice of the Pólya-Aeppli distribution, the reporting delays ΔY~ in our basic equation Eq. 3 can be determined in the parameterized form ΔY~κ.

Given the incremental time series ΔY~κ, the cumulative time series Y~κ are given by

(4)

for k0, and Iκ~[0]=Rκ~[0]=Dκ~[0]=0.

The main assumption of our reporting delay removal framework is that the increments in new recovered ΔR and deceased ΔD individuals are proportional to the cumulative fraction of infectious individuals I.

Therefore, we determine the “best” parameters κ¯ that maximize the product of pairwise correlations among the three epidemic time series in Yκ~:

(5)

where ρ(X,Y) is the Pearson correlation coefficient between time series X and Y. The objective function Ob(Yκ~) in Eq. 5 reaches its maximum value of 1 when all three pairwise correlations among the Iκ~, ΔRκ~ and ΔDκ~ time series are 1, which we expect when recovery γr and deceased γd epidemic probabilities are constant. Due to the nature of the Pearson correlation coefficient, the objective function Ob(Yκ~[k])=Ob(Yκ~[kT]) is invariant under the constant time shift T of the epidemic data. As a result, we can only infer the reporting delays up to a constant time shift T.

To maximize Ob(Yκ~), the random search (45) method is applied: we conduct a large set of independent random iterations, =1,,L. At each iteration ℓ, we select the elements of the parameter vector κ¯ uniformly at random from the prescribed domain of values. Further, at each iteration ℓ, we use the selected parameter set κ¯ to reconstruct the original epidemic data ΔY~κ¯ by solving Eq. 3 and convert incremental data ΔY~κ¯ to cumulative data Y~κ¯ by solving Eq. 4. We then compute the pairwise correlations in Y~κ¯ to obtain the objective function Ob(Y~κ¯). After completing all random search iterations, the resulting delay parameter κ^ is the one maximizing the objective function, κ^=argmaxκ¯Ob(Y~κ¯), see Materials and methods B.

Tests on synthetic data

Before uncovering reporting delays in real epidemic data, we test our framework on synthetic datasets, which we generate with the SIRD compartmental model, Eq. 1.

To test our framework, we generated 50 SIRD epidemic model datasets with different epidemic parameters and added synthetic reported delays to the obtained times series, as prescribed by Eq. 3. The resulting delay-perturbed synthetic data is fed to our statistical framework that infers the reported synthetically added delays. Figure 1d–f and j–l display, respectively, the reconstructed real and synthetic epidemic data. We find strong correlations between the reconstructed ΔR~, ΔD~, and I~ data. Further, Fig. 3a, b indicate that the inferred delay parameters are in good agreement with the true parameters that generated the datasets. Figure 3 illustrates that the inference errors decrease fast as a function of the number of iteration steps saturating at the value of 2 days after 107 iterations. To assess the robustness of the inference procedure, we have conducted cross-inference experiments by generating synthetic data with one delay distribution and inferring delays using another distribution, arriving at similar results in Fig. S4.

Uncovering reporting delays in synthetic data. We generate epidemic data using the SIRD model, Eq. 1 with parameters β=0.23, γr=0.079, and γd=0.11 and variable delay parameters. The schematics on top illustrates the process of generation and inference on synthetic data. a) The mean absolute inference errors as functions of the number of iteration steps in the random search. Due to the nature of the Pearson correlation coefficient, it is only possible to infer relative delays, E[TR]−E[TD], E[TI]−E[TD], and E[TR]−E[TI]. b) The relationship between the true and the inferred values of delay parameters. c, d) An example of an SIRD synthetic data c) before and d) after the removal of synthetic reporting delays.
Fig. 3.

Uncovering reporting delays in synthetic data. We generate epidemic data using the SIRD model, Eq. 1 with parameters β=0.23, γr=0.079, and γd=0.11 and variable delay parameters. The schematics on top illustrates the process of generation and inference on synthetic data. a) The mean absolute inference errors as functions of the number of iteration steps in the random search. Due to the nature of the Pearson correlation coefficient, it is only possible to infer relative delays, E[TR]E[TD], E[TI]E[TD], and E[TR]E[TI]. b) The relationship between the true and the inferred values of delay parameters. c, d) An example of an SIRD synthetic data c) before and d) after the removal of synthetic reporting delays.

Uncovering reporting delays in real data

After testing our inference framework on synthetic data, we moved on to uncover reporting delays in real epidemic data that we have collected from eight regions worldwide, Supplementary Material A. Figure 1d–f and j–l display the reconstructed epidemic data for Spain and for synthetic SIRD model after the identification and removal of reporting delays, see Table S1 for the inferred parameters. Consistent with our expectations, the reconstructed time series of infected I[k], new recovered ΔR[k+1], and new deceased ΔD[k+1] individuals are strongly correlated, Fig. 1e, f and k, l, and their peaks co-occur, as seen in Fig. 1d, j.

We summarize the properties of uncovered reporting delays in the eight regions in Table 2. We find that the infectious and recovered data delays are longer than those for deceased cases and vary from several days to several weeks. This observation is hardly surprising. Indeed, there are several factors contributing to the delays of infectious cases. One factor is the delay between an individual becoming infectious, and the symptom onset (46). Another factor, particularly significant in the first COVID-19 wave, is the delay between the symptom onset and the test result (47). In their turn, delays in the recovered events are likely caused by hospital discharge policies. Indeed, the recovered data are usually derived from hospital discharge events, which occur after the patient’s recovery.

Table 2.

Inferred reporting delays.

RegionsE[TR]E[TD]E[TI]E[TD]σIσRσDρ(ΔR¯,ΔD¯)ρ(I¯,ΔR¯)ρ(I¯,ΔD¯)Ob(Y¯)
Italy28.526.595.0827.790.720.910.940.990.84
Spain22.666.368.6328.390.730.990.991.000.98
Wuhan14.8020.6451.6716.1323.870.780.890.910.63
Turkey14.1324.8543.8912.4711.210.910.950.980.85
Hubei9.9621.0778.6210.8954.140.850.900.920.71
Romania3.3019.0543.0890.120.290.810.890.970.70
Germany2.7216.5539.25106.894.310.870.880.990.76
Denmark0.1725.0736.902.7128.250.830.930.940.72
RegionsE[TR]E[TD]E[TI]E[TD]σIσRσDρ(ΔR¯,ΔD¯)ρ(I¯,ΔR¯)ρ(I¯,ΔD¯)Ob(Y¯)
Italy28.526.595.0827.790.720.910.940.990.84
Spain22.666.368.6328.390.730.990.991.000.98
Wuhan14.8020.6451.6716.1323.870.780.890.910.63
Turkey14.1324.8543.8912.4711.210.910.950.980.85
Hubei9.9621.0778.6210.8954.140.850.900.920.71
Romania3.3019.0543.0890.120.290.810.890.970.70
Germany2.7216.5539.25106.894.310.870.880.990.76
Denmark0.1725.0736.902.7128.250.830.930.940.72

The table displays the inferred differences between the expected delay times E[TR]E[TD], E[TI]E[TD], standard deviations σI, σR and σD, and optimized values for the objective function Ob(Y¯). See Tables S1 and S2 for the corresponding parameters of the Pólya-Aeppli distributions.

Table 2.

Inferred reporting delays.

RegionsE[TR]E[TD]E[TI]E[TD]σIσRσDρ(ΔR¯,ΔD¯)ρ(I¯,ΔR¯)ρ(I¯,ΔD¯)Ob(Y¯)
Italy28.526.595.0827.790.720.910.940.990.84
Spain22.666.368.6328.390.730.990.991.000.98
Wuhan14.8020.6451.6716.1323.870.780.890.910.63
Turkey14.1324.8543.8912.4711.210.910.950.980.85
Hubei9.9621.0778.6210.8954.140.850.900.920.71
Romania3.3019.0543.0890.120.290.810.890.970.70
Germany2.7216.5539.25106.894.310.870.880.990.76
Denmark0.1725.0736.902.7128.250.830.930.940.72
RegionsE[TR]E[TD]E[TI]E[TD]σIσRσDρ(ΔR¯,ΔD¯)ρ(I¯,ΔR¯)ρ(I¯,ΔD¯)Ob(Y¯)
Italy28.526.595.0827.790.720.910.940.990.84
Spain22.666.368.6328.390.730.990.991.000.98
Wuhan14.8020.6451.6716.1323.870.780.890.910.63
Turkey14.1324.8543.8912.4711.210.910.950.980.85
Hubei9.9621.0778.6210.8954.140.850.900.920.71
Romania3.3019.0543.0890.120.290.810.890.970.70
Germany2.7216.5539.25106.894.310.870.880.990.76
Denmark0.1725.0736.902.7128.250.830.930.940.72

The table displays the inferred differences between the expected delay times E[TR]E[TD], E[TI]E[TD], standard deviations σI, σR and σD, and optimized values for the objective function Ob(Y¯). See Tables S1 and S2 for the corresponding parameters of the Pólya-Aeppli distributions.

Based on the expected delay values, one can naturally split the eight ROIs into three categories: (i) large infectious and small recovered delays: Romania, Germany, and Denmark, (ii) small infectious and large recovered delays: Italy and Spain, and (iii) large infectious and recovered delays: Wuhan, Hubei, and Turkey.

Small infectious delays imply that COVID-19 tests are timely and accurate. This seems to be the case for Italy, which executed more tests per capita in April 2020 than other countries (48). On the contrary, the testing ability of the Hubei province was significantly insufficient during the first wave of the COVID-19 outbreak.

Large delays in the recovered data may be attributed to strict hospital discharge policies. Discharge policy in Italy, for instance, was based on the negative test (49), and it has been shown that COVID-19 tests may stay positive for an extended time after COVID-19 symptoms disappear. In contrast, discharge policies in Denmark were based not on the negative test but on patient symptoms (49), likely leading to shorter reporting delays in the recovered data.

Large standard deviations in the reporting delays may indicate irregularities in reporting mechanisms. As an example, Hubei province expanded its daily testing capacity from 200 to 2,000 individuals from the beginning of the pandemic until January 27th (50). As a result, fewer individuals were tested late at the end of the first COVID-19 wave compared to its beginning, likely resulting in the large standard deviation of infectious data delays.

The small standard deviation in the delays of recovered data observed in Hubei and its capital Wuhan is likely the consequence of the strict discharge criteria (51). Although strict discharge policies cause significant delays, these delays are similar, resulting in relatively smaller σR values. In contrast to Chinese regions, the recovery data for Germany are not reported directly but instead are estimated by a not explained algorithm (52), resulting in large errors and, consequently, larger σR values. Based on the optimized values Ob(Y¯) as shown in Table S2, the optimization performances of our algorithm for Italy, Spain, and Turkey are better than the other countries.

Improving epidemic forecasts

Is accounting for reporting delays likely to improve epidemic forecasts? To answer this question, we designed two experiments. Experiment 1 aims to forecast the epidemics ignoring reporting delays and serves as a baseline for Experiment 2, which uncovers reporting delays prior to forecasting epidemic data.

In both experiments, we split the reported epidemic data into two parts, which we call the training and the testing sets, respectively, see Fig. 4a. In Experiment 1, we fit the training set with the SIRD epidemic model, obtaining model parameters β, γr+γd, and the fraction of initial infected cases I[0]. We then use the SIRD model with the obtained parameters to forecast the epidemic data, Supplementary Material C.

Accounting for reporting delays improves epidemic forecasts. a) The schematic diagram for the two forecast experiments. b), c) display the results of epidemic forecasts in Spain with different forecast start dates. The forecasts of experiment 2 (yellow) are closer to the reported data (red) than the forecasts of experiment 1 (blue). The forecast results for all eight countries or regions are shown in Figs. S5 and S6. d), e) All forecast results for Spain and Germany. Shown are the RMSE forecast errors and their ratios. h) Average forecast errors ϵ for the 8 regions. The benefits of removing reporting delays vary by region. g) The average RMSE forecast error of experiment 2 as a function of SIRD model fitting error for all experiments. h) The mean relative forecast error as a function of SIRD model fitting error for all experiments.
Fig. 4.

Accounting for reporting delays improves epidemic forecasts. a) The schematic diagram for the two forecast experiments. b), c) display the results of epidemic forecasts in Spain with different forecast start dates. The forecasts of experiment 2 (yellow) are closer to the reported data (red) than the forecasts of experiment 1 (blue). The forecast results for all eight countries or regions are shown in Figs. S5 and S6. d), e) All forecast results for Spain and Germany. Shown are the RMSE forecast errors and their ratios. h) Average forecast errors ϵ for the 8 regions. The benefits of removing reporting delays vary by region. g) The average RMSE forecast error of experiment 2 as a function of SIRD model fitting error for all experiments. h) The mean relative forecast error as a function of SIRD model fitting error for all experiments.

In Experiment 2, we first use the reported data Y~ in the training set to infer the parameters of the delay distributions κ¯. We rely on these parameters to remove reporting delays from the testing set and obtain reconstructed data ΔI¯. In the next step, we fit the reconstructed data ΔI¯ with the SIRD model, obtaining β, γr+γd and I[0] spreading parameters. We use these spreading parameters to forecast the epidemic data ΔI^, which we compare to the reported data ΔI~ in the testing set. Since ΔI~ in the testing set contains the reporting delays, while the forecast ΔI^ does not, we added the inferred reporting delays back to the ΔI^ using Eq. 3, obtaining ΔI^*, See Fig. 4a and Supplementary Material C.

Figure 4b, c presents the results of the forecast experiments for Spain, indicating that correcting for reporting delays does improve the forecast accuracy. To evaluate the forecast errors in a systematic way, we evaluated the root mean square errors between the forecasts IF and the testing IT sets:

(6)

where n is the size of the testing set. Further, to quantify the benefits of accounting for reporting delays, we used the ratio of the root mean square errors measured in forecasts with and without reporting delays, ϵRMSE(I^2*,I)RMSE(I^1,I), where I^1 and I^2* are the epidemic forecasts obtained in experiments 1 and 2, respectively. The smaller the ratio, the smaller the relative forecast error.

We measured the accuracy of epidemic forecasts with different start dates for Spain in Germany. While we did observe a substantial improvement in forecast accuracy for Spain, Fig. 4d, this was not the case for Germany, where the removal of reporting delays only improved the forecasts by a small margin, Fig. 4e. On a broader scale, we observed that the benefits of correction for reporting delays vary across all regions of interest, Figs. 4f, S5, and S6. While there is nearly a two-fold improvement in the forecast accuracy for Denmark, there is little improvement for Wuhan and Hubei.

Apart from only measuring and testing a small part of the population—which might be too small or not sampled adequately—there are at least two factors that may hinder epidemic forecast accuracy. The first factor is the insufficient accuracy of reporting delay removals. The second one is the inability of the SIRD model to reproduce the COVID-19 dynamics accurately. We can quantify the former by the maximum attained value of the objective function Ob(Y¯)max that we aim to maximize when removing reporting delays. While there is no direct way to quantify the goodness of the SIRD model in epidemic forecasts, as an indirect measure, we fit the training set with the SIRD model and compute the fitting error.

Figure 4g shows that the forecast error is strongly correlated with the SIRD fitting error, Pearson r=0.89, p<1027, indicating that the highest epidemic forecast accuracy is attained when the model fit is accurate. The forecast error ratio depends both on the Ob(Y¯)max and SIRD model fitting, Fig. 4h. We observe the largest error ratio ϵ in Wuhan, Hubei, and Romania, Fig. 4h. Wuhan and Hubei correspond to the largest SIRD fitting errors, while Romania corresponds to the lowest SIRD fitting errors. At the same time, all three regions are quantified by the lowest Ob(Y¯)max values. The other five regions, corresponding to smaller error ratios, are characterized by significantly larger Ob(Y¯)max values, Fig. 4g and Table S2.

These observations indicate that Ob(Y¯)max may be used as an early indicator of the benefit of delay removals. Indeed, lower Ob(Y¯)max values may indicate that reporting delays were not removed successfully or the initial assumption of the proportionality among I, ΔR, and ΔD times series does not hold. We note that high Ob(Y¯)max values correspond to high benefits of delay removals regardless of the SIRD fitting error. In the case of Romania, for instance, SIRD fitting errors are among the smallest, resulting in low forecast errors. Yet, the removal of reporting delays does not result in even lower forecast errors.

There might be multiple reasons for suboptimal reporting delay removals in the case of Hubei, Wuhan, and Romania. One possibility is the nonstationary nature of delays. The main assumption of our framework is that delay times are independent and drawn from the same distribution. While COVID-19 did not significantly mutate over the short time span of the first wave, our ability to handle the infections improved significantly. The PCR test capacity in China has grown remarkably during the first COVID wave, possibly explaining the limited effect of reporting delays on improving epidemic forecasts in Hubei and Wuhan.

Discussion

Data delays are ubiquitous in data sciences and adversely affect data analysis (53). The unique property of the epidemic data, enabling us to identify and filter out reporting delays, is the proportionality between the number of infectious individuals I and the rates of change in the deceased ΔD and the recovered ΔR individuals.

We relied on this proportionality property to develop a parametric statistical framework to uncover reporting delays in the first COVID-19 wave and applied it to eight regions significantly affected during the first COVID-19 wave. The character of uncovered delays varies across studied regions and can be explained by region-specific medical capacities and epidemic policies.

Concerning the epidemic forecasts, we found that the benefits of using curated epidemic data, as opposed to the raw data, are maximized in situations when reporting delays are removed efficiently. One of the main factors hindering the efficiency of delay removal—the nonstationarity of reporting delays—can be caused by either varying spreading properties of the virus or by rapid changes in medical capacities or epidemic restrictions across the regions. While the spreading properties of COVID-19 did not change significantly during the first wave of the pandemic, both medical capacities and epidemic policies were the subjects of constant updates as the communities learned how to handle the pandemic.

We expect that our framework will prove most useful in the early stages of an epidemic when accurate epidemic forecasts are essential to plan intervention strategies and raise public awareness. Common sense dictates that data collection and reporting challenges are the most significant during that time.

Our statistical framework is based on the assumption that transition probabilities γr and γd quantifying transitions from the infectious to the recovered, IR and the infections to the deceased, ID states, respectively are constant across the population and do not change over the observational period. While these assumptions are approximations, we expect them to hold better in early epidemic stages than in later stages. In the later stages, the population becomes more heterogeneous, e.g. due to natural or vaccine-induced immunity (54). In these situations, our framework can be generalized by splitting the infectious compartment I into sub-compartments and considering different rates, e.g.

(7)

where I1[k] and I2[k] are two infectious sub-compartments characterized by different recovery and death probabilities.

At the same time, we stress that our statistical framework does not make strong assumptions about the spreading mechanisms of a pathogen and can be used in combination with any forecasting method that requires infectious, recovered, and deceased data as input.

In conclusion, our framework can be used as a preliminary filter for any epidemic forecast tool that takes infected, recovered, and deceased data as input. Our framework holds for any compartmental epidemic model, provided that each compartment can be measured, because a time series of each compartment is indispensable. Accurate and timely epidemic forecasts are of immense value for society and policymakers to minimize the adverse effects of the virus.

Materials and methods

Types of distributions

To determine the family of reporting delay distributions that best suit our data, we consider three different two-parameter discrete distributions (55) below:

(I) Negative binomial distribution.

(8)

The negative binomial distribution with parameters r>0 and p[0,1] has mean value E[T]=r(1p)/p and variance Var[T]=r(1p)/p2.

(II) Pólya-Aeppli distribution is also known as the geometric Poisson distribution.

(9)

The Pólya-Aeppli distribution with parameters λ>0 and θ[0,1] has mean value E[T]=λ/θ and variance Var[T]=λ(2θ)/θ2.

(III) Neyman type A distribution.

(10)

The Neyman type A distribution with parameters ξ>0 and μ>0 has mean value E[T]=ξμ and variance Var[T]=ξμ(1+μ).

Inferring reporting delays

To uncover reporting delays in epidemic data, we determine parameters of delays distributions κY maximizing the pairwise correlations between the epidemic time series, Eq. 5. This optimization problem can be compactly written as:

(11)

Here T is the size of epidemic time series, Y~{I~,R~,D~} and Y{I,R,D} are reported and reconstructed epidemic data, respectively, while ΔY=ΨY1ΔY~ are the matrix solutions of Eq. 3. Indeed, Eq. 3 can be written in the matrix form as ΔY~=ΨYΔY for Y={I,R,D}, where

Then, Ψ1 is the inverse of ΨY and ΔY=ΨY1ΔY~.

In the main text, we assume that reporting delays are iid random variables drawn from three Pólya-Aeppli distributions, Eq. 9, with distinct parameters {λI,θI}, {λR,θR}, and {λD,θD}. In the case of Pólya-Aeppli distributions, the parameter vector takes the form of κ{λI,θI,λR,θR,λD,θD}. We solve the optimization problem given by Eq. 11 and the random search (45). For each iteration =1,,L, we treat the expected values of the Pólya-Aeppli distributions, E[TY]=λY/θY, as iid random variables and draw from the uniform pdfs U[0,30] for Y={I,R,D}. Similarly, we draw θY parameters independently at random from uniform pdfs U[0,1] and then determine λY parameters as λY=θYE[TY], obtaining κ. In our experiments, we set the maximum number of search iterations to L=107.

For each κ, we use Eq. 3 to reconstruct original data Y, which we then use to compute the objective function Ob(Y). After completing all iterations, the thought parameter vector κ^ describing reporting delays is the one corresponding to the maximum Ob(Y) value, κ^=arg maxκOb(Y).

Supplementary Material

Supplementary material is available at PNAS Nexus online.

Funding

This research has been funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 101019718). L.M. is thankful for the support from the China Scholarship Council. M.K. has been supported by the Dutch Research Council (NWO) grant OCENW.M20.244.

Author Contributions

L.M. conceived the research idea. L.M., P.V.M., and M.K. designed research. L.M. and Z.Q. performed research. All authors discussed results and wrote the manuscript.

This manuscript has been posted as a preprint in the ArXiv repository https://arxiv.org/abs/2304.11863.

Data Availability

All epidemic data used in this work are publicly available from the original sources, see Supplementary Information A. The extracted epidemic time series for the eight regions of interest are deposited in FigShare (https://doi.org/10.6084/m9.figshare.22639519.v1). Code to generate SIRD epidemic time series and to uncover reporting delays is available on GitHub (https://github.com/qzhszl/Reporting-delays-a-widely-neglected-impact-factor-in-COVID-19-forecasts.git).

References

1

Phillips
 
N
.
2021
.
The coronavirus is here to stay-here’s what that means
.
Nature
.
590
(
7846
):
382
384
.

2

Pastor-Satorras
 
R
,
Castellano
 
C
,
Van Mieghem
 
P
,
Vespignani
 
A
.
2015
.
Epidemic processes in complex networks
.
Rev Mod Phys
.
87
(
3
):
925
979
.

3

Kiss
 
IZ
,
Miller
 
JC
,
Simon
 
PL
.
2017
.
Mathematics of epidemics on networks
. Vol.
598
.
Cham
:
Springer
.

4

Brauer
 
F
.
2017
.
Mathematical epidemiology: past, present, and future
.
Infect Dis Model
.
2
(
2
):
113
127
.

5

Achterberg
 
MA
, et al.  
2022
.
Comparing the accuracy of several network-based COVID-19 prediction algorithms
.
Int J Forecast
.
38
(
2
):
489
504
.

6

Prasse
 
B
,
Van Mieghem
 
P
.
2022
.
Predicting network dynamics without requiring the knowledge of the interaction graph
.
Proc Natl Acad Sci USA
.
119
(
44
):
e2205517119
.

7

Moss
 
R
,
Zarebski
 
AE
,
Carlson
 
SJ
,
McCaw
 
JM
.
2019
.
Accounting for healthcare-seeking behaviours and testing practices in real-time influenza forecasts
.
Trop Med Infect Dis
.
4
(
1
):
12
.

8

Ibrahim
 
NK
.
2020
.
Epidemiologic surveillance for controlling COVID-19 pandemic: types, challenges and implications
.
J Infect Public Health
.
13
(
11
):
1630
1638
.

9

Clipman
 
SJ
, et al.  
2022
.
Improvements in severe acute respiratory syndrome coronavirus 2 testing cascade in the united states: data from serial cross-sectional assessments
.
Clin Infect Dis
.
74
(
9
):
1534
1542
.

10

Rader
 
B
, et al.  
2020
.
Geographic access to United States SARS-CoV-2 testing sites highlights healthcare disparities and may bias transmission estimates
.
J Travel Med
.
27
(
7
):
taaa076
.

11

Holden
 
TM
, et al.  
2021
.
Geographic and demographic heterogeneity of SARS-CoV-2 diagnostic testing in Illinois, USA, March to December 2020
.
BMC Public Health
.
21
(
1
):
1
13
.

12

Runge
 
M
, et al.  
2022
.
Modeling robust COVID-19 intensive care unit occupancy thresholds for imposing mitigation to prevent exceeding capacities
.
PLOS Glob Public Health
.
2
(
5
):
e0000308
.

13

Toh
 
KB
,
Runge
 
M
,
Richardson
 
RAK
,
Hladish
 
TJ
,
Gerardin
 
J
.
2023
.
Design of effective outpatient sentinel surveillance for COVID-19 decision-making: a modeling study
.
BMC Infect Dis
.
23
(
1
):
287
.

14

Pellis
 
L
, et al.  
2021
.
Challenges in control of COVID-19: short doubling time and long delay to effect of interventions
.
Phil Trans R Soc B
.
376
(
1829
):
20200264
.

15

Larremore
 
DB
, et al.  
2021
.
Test sensitivity is secondary to frequency and turnaround time for COVID-19 screening
.
Sci Adv
.
7
(
1
):
eabd5393
.

16

Marinović
 
AB
,
Swaan
 
C
,
van Steenbergen
 
J
,
Kretzschmar
 
M
.
2015
.
Quantifying reporting timeliness to improve outbreak control
.
Emerging Infect Dis
.
21
(
2
):
209
216
.

17

Swaan
 
C
,
van den Broek
 
A
,
Kretzschmar
 
M
,
Richardus
 
JH
.
2018
.
Timeliness of notification systems for infectious diseases: a systematic literature review
.
PLoS One
.
13
(
6
):
e0198845
.

18

Bastaki
 
H
,
Carter
 
J
,
Marston
 
L
,
Cassell
 
J
,
Rait
 
G
.
2018
.
Time delays in the diagnosis and treatment of malaria in non-endemic countries: a systematic review
.
Travel Med Infect Dis
.
21
:
21
27
.

19

Ahmed
 
AE
.
2017
.
Diagnostic delays in 537 symptomatic cases of Middle East respiratory syndrome coronavirus infection in Saudi Arabia
.
Int J Infect Dis
.
62
:
47
51
.

20

Sun
 
K
,
Chen
 
J
,
Viboud
 
C
.
2020
.
Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study
.
Lancet Digital Health
.
2
(
4
):
e201
e208
.

21

Linton
 
NM
, et al.  
2020
.
Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data
.
J Clin Med
.
9
(
2
):
538
.

22

Lauer
 
SA
, et al.  
2020
.
The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application
.
Ann Intern Med
.
172
(
9
):
577
582
.

23

Kraemer
 
MUG
, et al.  
2020
.
The effect of human mobility and control measures on the COVID-19 epidemic in China
.
Science
.
368
(
6490
):
493
497
.

24

Leung
 
K
,
Wu
 
JT
,
Liu
 
D
,
Leung
 
GM
.
2020
.
First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment
.
Lancet
.
395
(
10233
):
1382
1393
.

25

Lin
 
Q
, et al.  
2020
.
A conceptual model for the coronavirus disease 2019 (COVID-19) outbreak in Wuhan, China with individual reaction and governmental action
.
Int J Infect Dis
.
93
:
211
216
.

26

Cereda
 
D
, et al.  
2020
.
The early phase of the COVID-19 outbreak in Lombardy, Italy
.
arXiv:2003.09320
.

27

Dehning
 
J
, et al.  
2020
.
Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions
.
Science
.
369
(
6500
):
eabb9789
.

28

Günther
 
F
,
Bender
 
A
,
Katz
 
K
,
Küchenhoff
 
H
,
Höhle
 
M
.
2021
.
Nowcasting the COVID-19 pandemic in Bavaria
.
Biom J
.
63
(
3
):
490
502
.

29

Tariq
 
A
, et al.  
2020
.
Real-time monitoring the transmission potential of COVID-19 in Singapore, March 2020
.
BMC Med
.
18
(
1
):
1
14
.

30

Harris
 
JE
.
2022
.
Timely epidemic monitoring in the presence of reporting delays: anticipating the COVID-19 surge in New York city, September 2020
.
BMC Public Health
.
22
(
1
):
871
.

31

Génois
 
M
,
Vestergaard
 
CL
,
Cattuto
 
C
,
Barrat
 
A
.
2015
.
Compensating for population sampling in simulations of epidemic spread on temporal contact networks
.
Nat Commun
.
6
(
1
):
8860
.

32

Mastrandrea
 
R
,
Barrat
 
A
.
2016
.
How to estimate epidemic risk from incomplete contact diaries data?
 
PLoS Comput Biol
.
12
(
6
):
e1005002
.

33

Sapienza
 
A
,
Barrat
 
A
,
Cattuto
 
C
,
Gauvin
 
L
.
2018
.
Estimating the outcome of spreading processes on networks with incomplete information: a dimensionality reduction approach
.
Phys Rev E
.
98
(
1
):
012317
.

34

Lu
 
FS
,
Nguyen
 
AT
,
Link
 
NB
,
Lipsitch
 
M
,
Santillana
 
M
.
2021
.
Estimating the cumulative incidence of COVID-19 in the United States using influenza surveillance, virologic testing, and mortality data: Four complementary approaches
.
PLOS Comput Biol
.
17
(
6
):
e1008994
. https://doi.org/10.1371/journal.pcbi.1008994.

35

Aleta
 
A
, et al.  
2020
.
Modelling the impact of testing, contact tracing and household quarantine on second waves of COVID-19
.
Nat Hum Behav
.
4
(
9
):
964
971
.

36

Cramer
 
EY
, et al.  
2022
.
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the united states
.
Proc Natl Acad Sci USA
.
119
(
15
):
e2113561119
.

37

Chimmula
 
VKR
,
Zhang
 
L
.
2020
.
Time series forecasting of COVID-19 transmission in Canada using LSTM networks
.
Chaos Solitons Fract
.
135
:
109864
.

38

Liu
 
D
, et al.  
2020
.
A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models
.
arXiv:2004.04019
.

39

Rahimi
 
I
,
Chen
 
F
,
Gandomi
 
AH
.
2023
.
A review on COVID-19 forecasting models
.
Neural Comput App
.
35
(
33
):
23671
23681
. https://doi.org/10.1007/s00521-020-05626-8.

40

Chen
 
J
, et al.  
2020
.
Clinical progression of patients with COVID-19 in Shanghai, China
.
J Infect
.
80
(
5
):
e1
e6
.

41

Voinsky
 
I
,
Baristaite
 
G
,
Gurwitz
 
D
.
2020
.
Effects of age and sex on recovery from COVID-19: analysis of 5769 Israeli patients
.
J Infect
.
81
(
2
):
e102
e103
.

42

Faes
 
C
, et al.  
2020
.
Time between symptom onset, hospitalisation and recovery or death: statistical analysis of Belgian COVID-19 patients
.
Int J Environ Res Public Health
.
17
(
20
):
7560
.

43

Freeman
 
GH
.
1980
.
Fitting two-parameter discrete distributions to many data sets with one common parameter
.
J R Stat Soc: Series C (Appl Stat)
.
29
(
3
):
259
267
.

44

Johnson
 
NL
,
Kemp
 
AW
,
Kotz
 
S
.
2005
.
Univariate discrete distributions
.
New York (NY)
:
John Wiley & Sons
444
.

45

Bergstra
 
J
,
Bengio
 
Y
.
2012
.
Random search for hyper-parameter optimization
.
J Mach Learn Res
.
13
(
2
):
281
305
.

46

Sun
 
K
, et al.  
2021
.
Transmission heterogeneities, kinetics, and controllability of SARS-CoV-2
.
Science
.
371
(
6526
):
eabe2424
.

47

Kretzschmar
 
ME
, et al.  
2020
.
Impact of delays on effectiveness of contact tracing strategies for COVID-19: a modelling study
.
Lancet Public Health
.
5
(
8
):
e452
e459
.

49

Jespers
 
V
, et al.  
2020
.
International comparison of COVID-19 testing and contact tracing strategies. Fps Health, Food Chain Safety and Environment: Lieven De Raedt Sciensano: Ana Hoxha. COVID-19 KCE contributions. 29
.

51

Fu
 
H
, et al.  
2021
.
Database of epidemic trends and control measures during the first wave of COVID-19 in mainland China
.
Int J Infect Dis
.
102
:
463
471
.

53

Akbarov
 
A
,
Wu
 
S
.
2013
.
Warranty claims data analysis considering sales delay
.
Qual Reliab Eng Int
.
29
(
1
):
113
123
.

54

Nesteruk
 
I
.
2021
.
COVID-19 pandemic dynamics: mathematical simulations
.
Singapore
:
Springer Nature
.

55

Ord
 
J. K.
 
1974
.
Families of frequency distributions
.
Int Stat Rev
.
42
(
2
):
231
. https://doi.org/10.2307/1403084.

Author notes

Competing Interest: The authors declare no competing interest.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Editor: Rui Reis
Rui Reis
Editor
Search for other works by this author on:

Supplementary data