-
PDF
- Split View
-
Views
-
Cite
Cite
Long Ma, Zhihao Qiu, Piet Van Mieghem, Maksim Kitsak, Reporting delays: A widely neglected impact factor in COVID-19 forecasts, PNAS Nexus, Volume 3, Issue 6, June 2024, pgae204, https://doi.org/10.1093/pnasnexus/pgae204
- Share Icon Share
Abstract
Epidemic forecasts are only as good as the accuracy of epidemic measurements. Is epidemic data, particularly COVID-19 epidemic data, clean, and devoid of noise? The complexity and variability inherent in data collection and reporting suggest otherwise. While we cannot evaluate the integrity of the COVID-19 epidemic data in a holistic fashion, we can assess the data for the presence of reporting delays. In our work, through the analysis of the first COVID-19 wave, we find substantial reporting delays in the published epidemic data. Motivated by the desire to enhance epidemic forecasts, we develop a statistical framework to detect, uncover, and remove reporting delays in the infectious, recovered, and deceased epidemic time series. Using our framework, we expose and analyze reporting delays in eight regions significantly affected by the first COVID-19 wave. Further, we demonstrate that removing reporting delays from epidemic data by using our statistical framework may decrease the error in epidemic forecasts. While our statistical framework can be used in combination with any epidemic forecast method that intakes infectious, recovered, and deceased data, to make a basic assessment, we employed the classical SIRD epidemic model. Our results indicate that the removal of reporting delays from the epidemic data may decrease the forecast error by up to 50%. We anticipate that our framework will be indispensable in the analysis of novel COVID-19 strains and other existing or novel infectious diseases.
Accurate forecasts constitute the first step toward controlling an epidemic outbreak. However, the accuracy of such forecasts strongly depends on input data. Here, we develop a statistical framework to identify and remove reporting delays from the epidemic data. We use this framework to de-noise epidemic data of the first COVID-19 wave in eight regions worldwide. We demonstrate that the removal of reporting delays may increase the accuracy of epidemic forecasts by up to 50%. We anticipate that our framework will be invaluable in the analysis and forecasts of existing and future epidemics.
Introduction
The COVID-19 pandemic has been crippling the world’s health, economies, and quality of life for over 3 years. We are gradually understanding that COVID-19 is here to stay. Fast mutation rates of the virus and its overwhelming spreading capability make it extremely hard, if not impossible, to eradicate (1). COVID-19 is not the first and almost surely not the last pandemic to hit humanity. Therefore, in order to better prepare and withstand other contagious diseases in the future, we need to extract as many lessons as possible from the COVID-19 pandemic.
Public awareness is, arguably, the first line of defense against any infectious disease. Efficient collection of epidemic data and accurate epidemic forecasts allow for timely containment of the spread or flattening of the curve to win time for the development of pharmaceutical treatment methods. Due to the success of network epidemiology and the broad availability of data, significant advances in epidemic modeling and forecast methods (2–6) have been achieved. However, the accuracy of epidemic forecasts strongly depends on the accuracy and timeliness of the input data.
Are epidemic data—especially in the short-time period after the onset of the epidemic—devoid of inaccuracies? Multiple sources suggest the negative answer. Indeed, public health indicators are known to be biased, because they are sensitive to fluctuations in supply and demand for diagnostic testing (7–9). Likewise, inequalities in geographic accessibility have been documented to adversely affect the coverage and timelines of health indicators (10, 11). It is now well understood that ignoring epidemic data inaccuracies may lead to delays in deploying intervention policies resulting in dire consequences of an epidemic on the population’s health (12, 13).
One prominent challenge is the existence of temporal delays in data reporting (14, 15), defined as the time difference between the event when the person was affected by the virus and the time this event is accounted for. The reporting delays may occur due to many reasons, including patient hesitancy, medical testing delays, and reporting delays by public health authorities (16, 17). Consequently, reporting delays may vary not only across different diseases but also across different regions of interest. The median diagnosis delay for malaria is, for instance, approximately 4 days (18). Similarly, the research on the Middle East respiratory syndrome coronavirus (MERS-CoV) found a time difference between the symptom onset and confirmation of approximately four days (19). Hepatitis A, measles, and mumps data are usually reported after eight days (16). Reporting delays for some diseases are significantly longer: Hepatitis B, Shigellosis, and Salmonella can take on average 2–3 weeks (16). Recent COVID-19 studies reveal significant reporting delays of infections in China (20–25), Italy (26), Germany (27, 28), Singapore (29), the United States of America (30), and the United Kingdom (14).
While there is a plethora of scientific works aimed at the reconstruction of network data, including the reconstruction of network data for epidemic forecasts (31–33), we lack inference methods to assess and improve the accuracy of epidemic data itself. Instead, the prevalent epidemic forecast philosophy is focused either on data fusion (34–36)—or machine learning methods (37–39).
In our work, we develop a statistical framework to remove reporting delays. We demonstrate that the removal of reporting delays may significantly improve the accuracy of epidemic forecasts. Our statistical framework can be used in combination with any epidemic forecast method that takes infectious, recovered, and deceased epidemic data as input. Our work is organized as follows. We first present the evidence for the presence of reporting delays in the epidemic data extracted from the first COVID-19 wave. We then proceed to develop a statistical framework to remove reporting delays. After validating our framework on synthetic data, we move on to remove and analyze reporting delays in eight hotspots of the COVID-19 pandemic. We conclude our work with a discussion of the impact of delay removal on the accuracy of epidemic forecasts.
Results
Notation
Before presenting our findings, we introduce our notation for the epidemic data. Throughout the text, we operate with the infectious , recovered , and deceased data. Each dataset is a time series of values, each corresponding to a specific observation time. For brevity, we refer to the triplet of infectious, recovered, and deceased data as . All values contained in the Y time series are fractions of individuals found in the corresponding state on a specific day. For instance, corresponds to the fraction of individuals who are infectious on day k. Since COVID-19 reported data often is advertised in the form of changes in the number of epidemic cases, we find it convenient to introduce the daily changes in epidemic data as for . Further, in this work, we operate with reported epidemic data and inferred data . Since reported and inferred data are expected to differ from the true data Y, we need to distinguish the three. We summarize our notation in Table 1.
Cumulative quantities . | Quantity increments . |
---|---|
Y: fractions of cases | : fractions of new cases |
: fractions of reported cases | : fractions of reported new cases |
: fractions of predicted cases | : fractions of predicted new cases |
Cumulative quantities . | Quantity increments . |
---|---|
Y: fractions of cases | : fractions of new cases |
: fractions of reported cases | : fractions of reported new cases |
: fractions of predicted cases | : fractions of predicted new cases |
Cumulative quantities . | Quantity increments . |
---|---|
Y: fractions of cases | : fractions of new cases |
: fractions of reported cases | : fractions of reported new cases |
: fractions of predicted cases | : fractions of predicted new cases |
Cumulative quantities . | Quantity increments . |
---|---|
Y: fractions of cases | : fractions of new cases |
: fractions of reported cases | : fractions of reported new cases |
: fractions of predicted cases | : fractions of predicted new cases |
Evidence for reporting delays in epidemiological data
We begin the exposition by considering daily reports on the fractions of infected , recovered , and deceased individuals in Spain. We used the daily epidemic reports in Spain to construct the fraction of infected individuals as . While both the infected and the deceased data indicate that the first COVID-19 wave in Spain peaked in April 2020, the exact timings of the two peaks are, nevertheless, different. As observed in Fig. 1a, reported new deceased cases reached their peak on 2020 April 1st, while the highest fraction of infectious individuals was observed 22 days later on 2020 April 23. This observation is not specific to Spain: the peaks in the number of reported new deceased cases precede those of infectious cases by more than 1 week in most regions, Fig. S1. We make similar observations for COVID-19 daily recovery reports , which exhibit their peaks after those of the deceased cases , Figs. 1a and S1. Furthermore, when plotted as a function of daily deceased cases , infectious cases and daily recovered cases form “loop” patterns, see Figs. 1b, c and S2.
![Evidence for reporting delays in COVID-19 epidemic data. a) Reported infectious I~[k], recovered ΔR~[k+1] and deceased ΔD~[k+1] cases for the first COVID-19 wave in Spain. The k=0 day corresponds to 2020 February 25. Note the difference in peaks of the reported data. b) and c) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for Spain. Colors, from light to dark green, reflect different days in the data ranging, respectively, from k=0 to k=80. The scatter plots in b) and c) form, respectively, clockwise and counterclockwise loop patterns. d)–f) display the epidemic data in Spain after the removal of reporting delays. g) Synthetic epidemic data generated with the SIRD model by solving Eq. 1 with parameters β=0.5, γr=0.2, γd=0.05, R[0]=D[0]=0, I[0]=10−6, and S[0]=1−I[0]. Synthetic reporting delays were generated with the Pólya-Aeppli distributions using Eq. 3 with parameters λI=3.68, θI=0.5, E[TI]=7.36; λR=9.46, θR=0.4, E[TR]=23.65; λD=0.3, and θD=0.3, E[TD]=1. h) and i) display pairwise color-coded scatter plots of ΔD~[k+1] vs. I~[k], and ΔR~[k] vs. ΔD~[k] for the SIRD epidemic data. j)–l) display the synthetic epidemic data after the removal of reporting delays.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/pnasnexus/3/6/10.1093_pnasnexus_pgae204/6/m_pgae204f1.jpeg?Expires=1747879600&Signature=aAEgNT0PrgboD1qzGvAFeVWR1G638wapTj2hh~vd7MRlnk391XhevhXwvy7-o0o1NVvC-jcjBcIkogUIXAkRh4xoxDicm9CmlVYVWjqo2SLMUWZPKzHct9Phsxe19rsGMyBiCWvk3vevg75mT4tpBoXmwkj0e6Ks-K~AEVn2V-Nn2DrzCPeyVPScPZNe4dSku-IzwBb-I9H4l1THkOuvdEV6aEglruPu05IcgSwMOW4opA6oh4O7qIeENgIu9R7wiyRRj4opzGDpaGYv-9M8Nr6QypgY8UAg7VuwfyCPkXEvr3pqCQV3JbVhIP1Wzbw6wXGG6B1vDu2zLWV~vOR9cQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Evidence for reporting delays in COVID-19 epidemic data. a) Reported infectious , recovered and deceased cases for the first COVID-19 wave in Spain. The day corresponds to 2020 February 25. Note the difference in peaks of the reported data. b) and c) display pairwise color-coded scatter plots of vs. , and vs. for Spain. Colors, from light to dark green, reflect different days in the data ranging, respectively, from to . The scatter plots in b) and c) form, respectively, clockwise and counterclockwise loop patterns. d)–f) display the epidemic data in Spain after the removal of reporting delays. g) Synthetic epidemic data generated with the SIRD model by solving Eq. 1 with parameters , , , , , and . Synthetic reporting delays were generated with the Pólya-Aeppli distributions using Eq. 3 with parameters , , ; , , ; , and , . h) and i) display pairwise color-coded scatter plots of vs. , and vs. for the SIRD epidemic data. j)–l) display the synthetic epidemic data after the removal of reporting delays.
The patterns observed in Figs. 1a–c, S1, and S2 may indicate the presence of reporting delays. Indeed, epidemic models view recoveries and deaths of patients as stochastic processes with effective rates proportional to the number of infected individuals I, , . Since the SARS-CoV-2 virus has hardly changed during the first wave of the pandemic, we expect that recovery and death rates are approximately constant during this period (40–42). This observation suggests that changes in the fractions of recovered , and deceased data are proportional to the fraction of infectious individuals and, therefore, reach maximum values at the same time step, contradicting Fig. 1a–c.
We hypothesize that the disagreement between , , and is due to reporting delays. If each , , and time series are reported with different delays, their peaks are expected to differ. Likewise, the loop patterns of Fig. 1b, c may also be the result of an effective time shift between two nonmonotonous time series.
To check if reporting delays may result in the observed patterns, we consider the compartmental Susceptible-Infectious-Recovered-Deceased (SIRD) epidemic model. Within the SIRD model, the population is split into four compartments: susceptible S, infectious I, recovered R, and deceased D. Compartment S denotes the fraction of susceptible individuals, who can be infected by infectious individuals. Compartment I denotes the fraction of individuals, who have been infected but have not recovered or are deceased. Compartments R and D are respectively the fractions of individuals, who have recovered or are deceased. The SIRD model assumes that recovered individuals become immune and cannot be infected by the virus in the future. Further, the SIRD model assumes the uniform mixing of the Infectious and Susceptible sub-populations. As a result, the discrete-time transitions between the compartments are governed by first-order difference time equations
where β, , and are the infection, the recovery, and the deceased probabilities, respectively. We used Eq. 1 to generate infectious , recovered , and deceased time series epidemic data, and then added synthetic delays to the time series data, see Fig. 1g–i and Supplementary Material B. Since reporting delays were manually added, we observe that , , and reach maxima at different time steps, Fig. 1g.
Upholding our hypothesis, we observe the formation of loop patterns in the vs. and vs. in Fig. 1h, i, similar to those of Spain, Fig. 1b, c. The observed loop patterns are due to the effective horizontal shifts of the corresponding times series due to reporting delays. Indeed, let us split the observation time window into three windows formed by the maxima of the and curves, as shown in Fig. 1g. In window i, both and increase as a function of discrete time k. Since reporting delays in the synthetic data are smaller than those in , this time window corresponds to the upper branch of the vs. loop in Fig. 1h. In window ii, increases while decreases. Thus, window ii corresponds to the top (decreasing) section of the vs. loop, Fig. 1h. Finally, in window iii both and decrease as a function of time step k resulting in the lowest section of the vs. loop, Fig. 1h. Combined, all sections correspond to the loop pattern of Fig. 1h with points progressing in the clockwise direction. Similar considerations explain the counterclockwise loop pattern in the vs. scatter plot. The counterclockwise progression of points in this loop pattern is due to lagging behind the time series, Fig. 1i.
A null model for reporting delays
To uncover reporting delays, we employ the following null model: each individual i is endowed with random event time of either getting infected, , recovered, , or deceased, . Figure 2 depicts the random delay time for event as and the corresponding random reported time as . Within our discrete-time setting, these random times are discrete random variables. Assuming that the reporting delays are independent of the events , we obtain for the reported time

A schematic representation of epidemic events and corresponding delay times.
By taking the arithmetic mean of Eq. 2, we find that the expected reported fraction of individuals in state Y is a discrete convolution
where . Equation 3 serves as the foundation for uncovering reporting delays and improving epidemic forecasts.
Statistical framework to uncover reporting delays
The delay distribution in (3) can be determined, in principle, as the solution of a large set of quadratic equations, deduced from the governing epidemic equations, as shown in Supplementary Material D, provided that we assume a same distribution for each delay . Since that solution is rather demanding, we propose a simplification. We assume that the delay distribution is generated by a two-parameter probability distribution with parameters , which is sufficiently general to incorporate the real probability distribution of the delay.
We consider three families of two-parameter discrete probability distributions: the negative binomial distribution, the Neyman type A distribution, and the Pólya-Aeppli distribution, see Materials and methods A. These distributions are quite general and contain some well-known 1-parameter distributions as special cases. The logarithmic, geometric, and Poisson distributions are, for instance, all special cases of the negative binomial distribution (43). The three distributions are sufficiently general to generate data with variable mean and variance and of varying skewness (43, 44)—properties expected of the epidemic delay data, see Materials and methods A. For brevity, we focus on the Pólya-Aeppli distribution in the main text and report the results for the other two distributions in Fig. S4.
In our statistical framework, we assume that the reporting delays correspond to three datasets, , which are all characterized by the Pólya-Aeppli distribution, albeit with different parameters . Hence, there are in total six parameters for three Pólya-Aeppli distributions, . With the choice of the Pólya-Aeppli distribution, the reporting delays in our basic equation Eq. 3 can be determined in the parameterized form .
Given the incremental time series , the cumulative time series are given by
for , and .
The main assumption of our reporting delay removal framework is that the increments in new recovered and deceased individuals are proportional to the cumulative fraction of infectious individuals I.
Therefore, we determine the “best” parameters that maximize the product of pairwise correlations among the three epidemic time series in :
where is the Pearson correlation coefficient between time series X and Y. The objective function in Eq. 5 reaches its maximum value of 1 when all three pairwise correlations among the , and time series are 1, which we expect when recovery and deceased epidemic probabilities are constant. Due to the nature of the Pearson correlation coefficient, the objective function is invariant under the constant time shift T of the epidemic data. As a result, we can only infer the reporting delays up to a constant time shift T.
To maximize , the random search (45) method is applied: we conduct a large set of independent random iterations, . At each iteration ℓ, we select the elements of the parameter vector uniformly at random from the prescribed domain of values. Further, at each iteration ℓ, we use the selected parameter set to reconstruct the original epidemic data by solving Eq. 3 and convert incremental data to cumulative data by solving Eq. 4. We then compute the pairwise correlations in to obtain the objective function . After completing all random search iterations, the resulting delay parameter is the one maximizing the objective function, , see Materials and methods B.
Tests on synthetic data
Before uncovering reporting delays in real epidemic data, we test our framework on synthetic datasets, which we generate with the SIRD compartmental model, Eq. 1.
To test our framework, we generated 50 SIRD epidemic model datasets with different epidemic parameters and added synthetic reported delays to the obtained times series, as prescribed by Eq. 3. The resulting delay-perturbed synthetic data is fed to our statistical framework that infers the reported synthetically added delays. Figure 1d–f and j–l display, respectively, the reconstructed real and synthetic epidemic data. We find strong correlations between the reconstructed , , and data. Further, Fig. 3a, b indicate that the inferred delay parameters are in good agreement with the true parameters that generated the datasets. Figure 3 illustrates that the inference errors decrease fast as a function of the number of iteration steps saturating at the value of 2 days after iterations. To assess the robustness of the inference procedure, we have conducted cross-inference experiments by generating synthetic data with one delay distribution and inferring delays using another distribution, arriving at similar results in Fig. S4.
![Uncovering reporting delays in synthetic data. We generate epidemic data using the SIRD model, Eq. 1 with parameters β=0.23, γr=0.079, and γd=0.11 and variable delay parameters. The schematics on top illustrates the process of generation and inference on synthetic data. a) The mean absolute inference errors as functions of the number of iteration steps in the random search. Due to the nature of the Pearson correlation coefficient, it is only possible to infer relative delays, E[TR]−E[TD], E[TI]−E[TD], and E[TR]−E[TI]. b) The relationship between the true and the inferred values of delay parameters. c, d) An example of an SIRD synthetic data c) before and d) after the removal of synthetic reporting delays.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/pnasnexus/3/6/10.1093_pnasnexus_pgae204/6/m_pgae204f3.jpeg?Expires=1747879600&Signature=Xw1LPeBmUpuaK-VqLqHPPhxkYnWZnwyt70YL9aXO5Ua0ht59J0KG1yLa2lzr30fk9TA0j9iFm6B1aEXN132flqnD3bPoVxrs2KAwgGTLr0wgtPVUHAwBp~kS0emdAA1VZIw3VwKi0GRzu-PBA9ykGVaNDmGwPuTlVM9bUHZz7KfMtHw3tyggiXseqCGF5WZ912-s7IKnffGwReqeUbFXYhQcP8UapAVojISQ1jT-TghZ1UkcZocKxZ1gOgRaS5pGyU60NoMjvuMyG5nikzhVHxbZ70Ek0N4ZttBHsl1av~~-UbNsx5c~0FqKoemllKvTHwY2dcFz7OrMpVT~zLgRiw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Uncovering reporting delays in synthetic data. We generate epidemic data using the SIRD model, Eq. 1 with parameters , , and and variable delay parameters. The schematics on top illustrates the process of generation and inference on synthetic data. a) The mean absolute inference errors as functions of the number of iteration steps in the random search. Due to the nature of the Pearson correlation coefficient, it is only possible to infer relative delays, , , and . b) The relationship between the true and the inferred values of delay parameters. c, d) An example of an SIRD synthetic data c) before and d) after the removal of synthetic reporting delays.
Uncovering reporting delays in real data
After testing our inference framework on synthetic data, we moved on to uncover reporting delays in real epidemic data that we have collected from eight regions worldwide, Supplementary Material A. Figure 1d–f and j–l display the reconstructed epidemic data for Spain and for synthetic SIRD model after the identification and removal of reporting delays, see Table S1 for the inferred parameters. Consistent with our expectations, the reconstructed time series of infected , new recovered , and new deceased individuals are strongly correlated, Fig. 1e, f and k, l, and their peaks co-occur, as seen in Fig. 1d, j.
We summarize the properties of uncovered reporting delays in the eight regions in Table 2. We find that the infectious and recovered data delays are longer than those for deceased cases and vary from several days to several weeks. This observation is hardly surprising. Indeed, there are several factors contributing to the delays of infectious cases. One factor is the delay between an individual becoming infectious, and the symptom onset (46). Another factor, particularly significant in the first COVID-19 wave, is the delay between the symptom onset and the test result (47). In their turn, delays in the recovered events are likely caused by hospital discharge policies. Indeed, the recovered data are usually derived from hospital discharge events, which occur after the patient’s recovery.
Regions . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Italy | 28.52 | 6.59 | 5.08 | 27.79 | 0.72 | 0.91 | 0.94 | 0.99 | 0.84 |
Spain | 22.66 | 6.36 | 8.63 | 28.39 | 0.73 | 0.99 | 0.99 | 1.00 | 0.98 |
Wuhan | 14.80 | 20.64 | 51.67 | 16.13 | 23.87 | 0.78 | 0.89 | 0.91 | 0.63 |
Turkey | 14.13 | 24.85 | 43.89 | 12.47 | 11.21 | 0.91 | 0.95 | 0.98 | 0.85 |
Hubei | 9.96 | 21.07 | 78.62 | 10.89 | 54.14 | 0.85 | 0.90 | 0.92 | 0.71 |
Romania | 3.30 | 19.05 | 43.08 | 90.12 | 0.29 | 0.81 | 0.89 | 0.97 | 0.70 |
Germany | 2.72 | 16.55 | 39.25 | 106.89 | 4.31 | 0.87 | 0.88 | 0.99 | 0.76 |
Denmark | 0.17 | 25.07 | 36.90 | 2.71 | 28.25 | 0.83 | 0.93 | 0.94 | 0.72 |
Regions . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Italy | 28.52 | 6.59 | 5.08 | 27.79 | 0.72 | 0.91 | 0.94 | 0.99 | 0.84 |
Spain | 22.66 | 6.36 | 8.63 | 28.39 | 0.73 | 0.99 | 0.99 | 1.00 | 0.98 |
Wuhan | 14.80 | 20.64 | 51.67 | 16.13 | 23.87 | 0.78 | 0.89 | 0.91 | 0.63 |
Turkey | 14.13 | 24.85 | 43.89 | 12.47 | 11.21 | 0.91 | 0.95 | 0.98 | 0.85 |
Hubei | 9.96 | 21.07 | 78.62 | 10.89 | 54.14 | 0.85 | 0.90 | 0.92 | 0.71 |
Romania | 3.30 | 19.05 | 43.08 | 90.12 | 0.29 | 0.81 | 0.89 | 0.97 | 0.70 |
Germany | 2.72 | 16.55 | 39.25 | 106.89 | 4.31 | 0.87 | 0.88 | 0.99 | 0.76 |
Denmark | 0.17 | 25.07 | 36.90 | 2.71 | 28.25 | 0.83 | 0.93 | 0.94 | 0.72 |
The table displays the inferred differences between the expected delay times , , standard deviations , and , and optimized values for the objective function . See Tables S1 and S2 for the corresponding parameters of the Pólya-Aeppli distributions.
Regions . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Italy | 28.52 | 6.59 | 5.08 | 27.79 | 0.72 | 0.91 | 0.94 | 0.99 | 0.84 |
Spain | 22.66 | 6.36 | 8.63 | 28.39 | 0.73 | 0.99 | 0.99 | 1.00 | 0.98 |
Wuhan | 14.80 | 20.64 | 51.67 | 16.13 | 23.87 | 0.78 | 0.89 | 0.91 | 0.63 |
Turkey | 14.13 | 24.85 | 43.89 | 12.47 | 11.21 | 0.91 | 0.95 | 0.98 | 0.85 |
Hubei | 9.96 | 21.07 | 78.62 | 10.89 | 54.14 | 0.85 | 0.90 | 0.92 | 0.71 |
Romania | 3.30 | 19.05 | 43.08 | 90.12 | 0.29 | 0.81 | 0.89 | 0.97 | 0.70 |
Germany | 2.72 | 16.55 | 39.25 | 106.89 | 4.31 | 0.87 | 0.88 | 0.99 | 0.76 |
Denmark | 0.17 | 25.07 | 36.90 | 2.71 | 28.25 | 0.83 | 0.93 | 0.94 | 0.72 |
Regions . | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|
Italy | 28.52 | 6.59 | 5.08 | 27.79 | 0.72 | 0.91 | 0.94 | 0.99 | 0.84 |
Spain | 22.66 | 6.36 | 8.63 | 28.39 | 0.73 | 0.99 | 0.99 | 1.00 | 0.98 |
Wuhan | 14.80 | 20.64 | 51.67 | 16.13 | 23.87 | 0.78 | 0.89 | 0.91 | 0.63 |
Turkey | 14.13 | 24.85 | 43.89 | 12.47 | 11.21 | 0.91 | 0.95 | 0.98 | 0.85 |
Hubei | 9.96 | 21.07 | 78.62 | 10.89 | 54.14 | 0.85 | 0.90 | 0.92 | 0.71 |
Romania | 3.30 | 19.05 | 43.08 | 90.12 | 0.29 | 0.81 | 0.89 | 0.97 | 0.70 |
Germany | 2.72 | 16.55 | 39.25 | 106.89 | 4.31 | 0.87 | 0.88 | 0.99 | 0.76 |
Denmark | 0.17 | 25.07 | 36.90 | 2.71 | 28.25 | 0.83 | 0.93 | 0.94 | 0.72 |
The table displays the inferred differences between the expected delay times , , standard deviations , and , and optimized values for the objective function . See Tables S1 and S2 for the corresponding parameters of the Pólya-Aeppli distributions.
Based on the expected delay values, one can naturally split the eight ROIs into three categories: (i) large infectious and small recovered delays: Romania, Germany, and Denmark, (ii) small infectious and large recovered delays: Italy and Spain, and (iii) large infectious and recovered delays: Wuhan, Hubei, and Turkey.
Small infectious delays imply that COVID-19 tests are timely and accurate. This seems to be the case for Italy, which executed more tests per capita in April 2020 than other countries (48). On the contrary, the testing ability of the Hubei province was significantly insufficient during the first wave of the COVID-19 outbreak.
Large delays in the recovered data may be attributed to strict hospital discharge policies. Discharge policy in Italy, for instance, was based on the negative test (49), and it has been shown that COVID-19 tests may stay positive for an extended time after COVID-19 symptoms disappear. In contrast, discharge policies in Denmark were based not on the negative test but on patient symptoms (49), likely leading to shorter reporting delays in the recovered data.
Large standard deviations in the reporting delays may indicate irregularities in reporting mechanisms. As an example, Hubei province expanded its daily testing capacity from 200 to 2,000 individuals from the beginning of the pandemic until January 27th (50). As a result, fewer individuals were tested late at the end of the first COVID-19 wave compared to its beginning, likely resulting in the large standard deviation of infectious data delays.
The small standard deviation in the delays of recovered data observed in Hubei and its capital Wuhan is likely the consequence of the strict discharge criteria (51). Although strict discharge policies cause significant delays, these delays are similar, resulting in relatively smaller values. In contrast to Chinese regions, the recovery data for Germany are not reported directly but instead are estimated by a not explained algorithm (52), resulting in large errors and, consequently, larger values. Based on the optimized values as shown in Table S2, the optimization performances of our algorithm for Italy, Spain, and Turkey are better than the other countries.
Improving epidemic forecasts
Is accounting for reporting delays likely to improve epidemic forecasts? To answer this question, we designed two experiments. Experiment 1 aims to forecast the epidemics ignoring reporting delays and serves as a baseline for Experiment 2, which uncovers reporting delays prior to forecasting epidemic data.
In both experiments, we split the reported epidemic data into two parts, which we call the training and the testing sets, respectively, see Fig. 4a. In Experiment 1, we fit the training set with the SIRD epidemic model, obtaining model parameters β, , and the fraction of initial infected cases . We then use the SIRD model with the obtained parameters to forecast the epidemic data, Supplementary Material C.

Accounting for reporting delays improves epidemic forecasts. a) The schematic diagram for the two forecast experiments. b), c) display the results of epidemic forecasts in Spain with different forecast start dates. The forecasts of experiment 2 (yellow) are closer to the reported data (red) than the forecasts of experiment 1 (blue). The forecast results for all eight countries or regions are shown in Figs. S5 and S6. d), e) All forecast results for Spain and Germany. Shown are the RMSE forecast errors and their ratios. h) Average forecast errors for the 8 regions. The benefits of removing reporting delays vary by region. g) The average RMSE forecast error of experiment 2 as a function of SIRD model fitting error for all experiments. h) The mean relative forecast error as a function of SIRD model fitting error for all experiments.
In Experiment 2, we first use the reported data in the training set to infer the parameters of the delay distributions . We rely on these parameters to remove reporting delays from the testing set and obtain reconstructed data . In the next step, we fit the reconstructed data with the SIRD model, obtaining β, and spreading parameters. We use these spreading parameters to forecast the epidemic data , which we compare to the reported data in the testing set. Since in the testing set contains the reporting delays, while the forecast does not, we added the inferred reporting delays back to the using Eq. 3, obtaining , See Fig. 4a and Supplementary Material C.
Figure 4b, c presents the results of the forecast experiments for Spain, indicating that correcting for reporting delays does improve the forecast accuracy. To evaluate the forecast errors in a systematic way, we evaluated the root mean square errors between the forecasts and the testing sets:
where n is the size of the testing set. Further, to quantify the benefits of accounting for reporting delays, we used the ratio of the root mean square errors measured in forecasts with and without reporting delays, , where and are the epidemic forecasts obtained in experiments 1 and 2, respectively. The smaller the ratio, the smaller the relative forecast error.
We measured the accuracy of epidemic forecasts with different start dates for Spain in Germany. While we did observe a substantial improvement in forecast accuracy for Spain, Fig. 4d, this was not the case for Germany, where the removal of reporting delays only improved the forecasts by a small margin, Fig. 4e. On a broader scale, we observed that the benefits of correction for reporting delays vary across all regions of interest, Figs. 4f, S5, and S6. While there is nearly a two-fold improvement in the forecast accuracy for Denmark, there is little improvement for Wuhan and Hubei.
Apart from only measuring and testing a small part of the population—which might be too small or not sampled adequately—there are at least two factors that may hinder epidemic forecast accuracy. The first factor is the insufficient accuracy of reporting delay removals. The second one is the inability of the SIRD model to reproduce the COVID-19 dynamics accurately. We can quantify the former by the maximum attained value of the objective function that we aim to maximize when removing reporting delays. While there is no direct way to quantify the goodness of the SIRD model in epidemic forecasts, as an indirect measure, we fit the training set with the SIRD model and compute the fitting error.
Figure 4g shows that the forecast error is strongly correlated with the SIRD fitting error, Pearson , , indicating that the highest epidemic forecast accuracy is attained when the model fit is accurate. The forecast error ratio depends both on the and SIRD model fitting, Fig. 4h. We observe the largest error ratio in Wuhan, Hubei, and Romania, Fig. 4h. Wuhan and Hubei correspond to the largest SIRD fitting errors, while Romania corresponds to the lowest SIRD fitting errors. At the same time, all three regions are quantified by the lowest values. The other five regions, corresponding to smaller error ratios, are characterized by significantly larger values, Fig. 4g and Table S2.
These observations indicate that may be used as an early indicator of the benefit of delay removals. Indeed, lower values may indicate that reporting delays were not removed successfully or the initial assumption of the proportionality among I, , and times series does not hold. We note that high values correspond to high benefits of delay removals regardless of the SIRD fitting error. In the case of Romania, for instance, SIRD fitting errors are among the smallest, resulting in low forecast errors. Yet, the removal of reporting delays does not result in even lower forecast errors.
There might be multiple reasons for suboptimal reporting delay removals in the case of Hubei, Wuhan, and Romania. One possibility is the nonstationary nature of delays. The main assumption of our framework is that delay times are independent and drawn from the same distribution. While COVID-19 did not significantly mutate over the short time span of the first wave, our ability to handle the infections improved significantly. The PCR test capacity in China has grown remarkably during the first COVID wave, possibly explaining the limited effect of reporting delays on improving epidemic forecasts in Hubei and Wuhan.
Discussion
Data delays are ubiquitous in data sciences and adversely affect data analysis (53). The unique property of the epidemic data, enabling us to identify and filter out reporting delays, is the proportionality between the number of infectious individuals I and the rates of change in the deceased and the recovered individuals.
We relied on this proportionality property to develop a parametric statistical framework to uncover reporting delays in the first COVID-19 wave and applied it to eight regions significantly affected during the first COVID-19 wave. The character of uncovered delays varies across studied regions and can be explained by region-specific medical capacities and epidemic policies.
Concerning the epidemic forecasts, we found that the benefits of using curated epidemic data, as opposed to the raw data, are maximized in situations when reporting delays are removed efficiently. One of the main factors hindering the efficiency of delay removal—the nonstationarity of reporting delays—can be caused by either varying spreading properties of the virus or by rapid changes in medical capacities or epidemic restrictions across the regions. While the spreading properties of COVID-19 did not change significantly during the first wave of the pandemic, both medical capacities and epidemic policies were the subjects of constant updates as the communities learned how to handle the pandemic.
We expect that our framework will prove most useful in the early stages of an epidemic when accurate epidemic forecasts are essential to plan intervention strategies and raise public awareness. Common sense dictates that data collection and reporting challenges are the most significant during that time.
Our statistical framework is based on the assumption that transition probabilities and quantifying transitions from the infectious to the recovered, and the infections to the deceased, states, respectively are constant across the population and do not change over the observational period. While these assumptions are approximations, we expect them to hold better in early epidemic stages than in later stages. In the later stages, the population becomes more heterogeneous, e.g. due to natural or vaccine-induced immunity (54). In these situations, our framework can be generalized by splitting the infectious compartment I into sub-compartments and considering different rates, e.g.
where and are two infectious sub-compartments characterized by different recovery and death probabilities.
At the same time, we stress that our statistical framework does not make strong assumptions about the spreading mechanisms of a pathogen and can be used in combination with any forecasting method that requires infectious, recovered, and deceased data as input.
In conclusion, our framework can be used as a preliminary filter for any epidemic forecast tool that takes infected, recovered, and deceased data as input. Our framework holds for any compartmental epidemic model, provided that each compartment can be measured, because a time series of each compartment is indispensable. Accurate and timely epidemic forecasts are of immense value for society and policymakers to minimize the adverse effects of the virus.
Materials and methods
Types of distributions
To determine the family of reporting delay distributions that best suit our data, we consider three different two-parameter discrete distributions (55) below:
(I) Negative binomial distribution.
The negative binomial distribution with parameters and has mean value and variance .
(II) Pólya-Aeppli distribution is also known as the geometric Poisson distribution.
The Pólya-Aeppli distribution with parameters and has mean value and variance .
(III) Neyman type A distribution.
The Neyman type A distribution with parameters and has mean value and variance .
Inferring reporting delays
To uncover reporting delays in epidemic data, we determine parameters of delays distributions maximizing the pairwise correlations between the epidemic time series, Eq. 5. This optimization problem can be compactly written as:
Here T is the size of epidemic time series, and are reported and reconstructed epidemic data, respectively, while are the matrix solutions of Eq. 3. Indeed, Eq. 3 can be written in the matrix form as for , where
Then, is the inverse of and .
In the main text, we assume that reporting delays are iid random variables drawn from three Pólya-Aeppli distributions, Eq. 9, with distinct parameters , , and . In the case of Pólya-Aeppli distributions, the parameter vector takes the form of . We solve the optimization problem given by Eq. 11 and the random search (45). For each iteration , we treat the expected values of the Pólya-Aeppli distributions, , as iid random variables and draw from the uniform pdfs for . Similarly, we draw parameters independently at random from uniform pdfs and then determine parameters as , obtaining . In our experiments, we set the maximum number of search iterations to .
For each , we use Eq. 3 to reconstruct original data , which we then use to compute the objective function . After completing all iterations, the thought parameter vector describing reporting delays is the one corresponding to the maximum value, .
Supplementary Material
Supplementary material is available at PNAS Nexus online.
Funding
This research has been funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 101019718). L.M. is thankful for the support from the China Scholarship Council. M.K. has been supported by the Dutch Research Council (NWO) grant OCENW.M20.244.
Author Contributions
L.M. conceived the research idea. L.M., P.V.M., and M.K. designed research. L.M. and Z.Q. performed research. All authors discussed results and wrote the manuscript.
This manuscript has been posted as a preprint in the ArXiv repository https://arxiv.org/abs/2304.11863.
Data Availability
All epidemic data used in this work are publicly available from the original sources, see Supplementary Information A. The extracted epidemic time series for the eight regions of interest are deposited in FigShare (https://doi.org/10.6084/m9.figshare.22639519.v1). Code to generate SIRD epidemic time series and to uncover reporting delays is available on GitHub (https://github.com/qzhszl/Reporting-delays-a-widely-neglected-impact-factor-in-COVID-19-forecasts.git).
References
Author notes
Competing Interest: The authors declare no competing interest.