Estimated incidence of COVID-19 illness and hospitalization — United States, February–September, 2020

Abstract Background In the United States, laboratory confirmed coronavirus disease 2019 (COVID-19) is nationally notifiable. However, reported case counts are recognized to be less than the true number of cases because detection and reporting are incomplete and can vary by disease severity, geography, and over time. Methods To estimate the cumulative incidence SARS-CoV-2 infections, symptomatic illnesses, and hospitalizations, we adapted a simple probabilistic multiplier model. Laboratory-confirmed case counts that were reported nationally were adjusted for sources of under-detection based on testing practices in inpatient and outpatient settings and assay sensitivity. Results We estimated that through the end of September, 1 of every 2.5 (95% Uncertainty Interval (UI): 2.0–3.1) hospitalized infections and 1 of every 7.1 (95% UI: 5.8–9.0) non-hospitalized illnesses may have been nationally reported. Applying these multipliers to reported SARS-CoV-2 cases along with data on the prevalence of asymptomatic infection from published systematic reviews, we estimate that 2.4 million hospitalizations, 44.8 million symptomatic illnesses, and 52.9 million total infections may have occurred in the U.S. population from February 27–September 30, 2020. Conclusions These preliminary estimates help demonstrate the societal and healthcare burdens of the COVID-19 pandemic and can help inform resource allocation and mitigation planning.

A c c e p t e d M a n u s c r i p t 4

Background
In the United States, the earliest known patients with coronavirus disease 2019 (COVID- 19), the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, were associated with travel to affected countries or known contact with other infected persons [1]. By February 2020, persons with SARS-CoV-2 infection in the U.S. and no known exposure were detected [2]. Between February 27-September 30, 2020, nearly 6.9 million laboratory-confirmed cases of domestically acquired infections were detected and reported nationally.
Persons with laboratory-confirmed SARS-CoV-2 infection reported through national surveillance do not represent all infected persons in the U.S. Seroprevalence studies have shown a higher level of SARS-CoV-2 infection than has been reflected by confirmed case counts [2][3][4][5][6][7]. Most unreported infections were asymptomatic or mildly ill people who recovered without seeking medical care or testing [8][9][10]. However, even persons with SARS-CoV-2 infection in medical settings may not be tested or nationally reported as confirmed cases. Limited availability of tests, reagents, and laboratory capacity reduced case detection, in addition patients may have avoided medical care settings or presented with non-specific symptoms and not been suspected to have SARS-CoV-2 infection. Furthermore, not all infected persons will test positive because of assay sensitivity, timing of specimen collection, or specimen quality [11]. Factors involved in detecting and reporting cases may vary by age, geographically, over time, across healthcare settings, and by severity of disease. Finally, some people may be infected with SARS-CoV-2 and never show clinical symptoms; these asymptomatic persons would be even less likely to be detected [9,10].
To better estimate the U.S. incidence of SARS-CoV-2 infection since the beginning of the pandemic, we adapted a probabilistic multiplier model to adjust nationally reported counts of confirmed cases for various sources of A c c e p t e d M a n u s c r i p t 5 under-detection [12]; this model estimates total SARS-CoV-2 infections, symptomatic illnesses, and hospitalized patients in the U.S. population from February 27, 2020-September 30, 2020.

Reported confirmed cases
Persons with laboratory confirmed SARS-CoV-2 infection by molecular diagnostics are reported to CDC through the Nationally Notifiable Disease Surveillance System (NNDSS) at the person level or as aggregate counts at the reporting jurisdiction level (e.g., state, territory, New York City, District of Columbia) [13,14]. The NNDSS uses a standardized case report form, including state of residence, age, hospitalization admission, and other demographic and clinical characteristics. Given data entry delays and incomplete national reporting, jurisdictions reported aggregated counts daily for the previous day. Probable, asymptomatic, and travel-associated cases were excluded from counts of confirmed cases used in this analysis.

Analytic methods
We applied a probabilistic multiplier model to adjust the reported numbers of confirmed symptomatic cases for factors affecting detection of persons with SARS-CoV-2 infection, a method previously used to estimate the incidence of H1N1pdm09 during the 2009 influenza pandemic [12]. This method uses confirmed cases and data on case detection and the asymptomatic fraction to estimate the cumulative number of hospitalized patients A c c e p t e d M a n u s c r i p t 6 with SARS-CoV-2 infection, the total number with symptomatic illness, and the total number of infected persons ( Figure 1).
To account for variability in detection of SARS-CoV-2 we stratified reported cases into hospitalized and nonhospitalized symptomatic cases, and further by age group (0-4 years, 5-17 years, 18-49 years, 50-64 years, 65 years and older), time period when the case was reported (February-March, April-May, June-July, August-September), and U.S. Department of Health and Human Service (HHS) region [15]. Age group was imputed for cases with missing birth date according to the age distribution within each HHS region and reporting time period. If hospitalization status was missing, we imputed the percentage of patients who were hospitalized based on reported cases with complete data within by age group, HHS region, and reporting time period. More details on this process are available in the supplemental material.
We adjusted case counts for three factors that affected national case detection of symptomatic cases: if a patient is symptomatic, they may not have sought medical attention or testing for their illness (parameter C); if a patient sought medical care, they may not have had a SARS-CoV-2 test completed (parameter B); or if a patient was tested, the SARS-CoV-2 assay used may result in a false negative result due to its sensitivity to detect SARS-CoV-2 in the specimen (parameter A). We used several data sources to describe these factors (Table 2), with under-detection multipliers calculated as an inverse of the product of factors A-C. Each multiplier was calculated within strata of hospitalization status, age group, reporting time period, as data were available, and applied to the relevant stratified cases counts to estimate a number of symptomatic cases within that strata.
A c c e p t e d M a n u s c r i p t 7 After adjustment, we summed the strata to a number of estimated symptomatic cases and applied one more source of under-detection -a person infected with SARS-CoV-2 may never show clinical symptoms (parameter D) -to estimate the number of total infections in the population.
For all parameters and strata, we included a range of values; estimates were calculated using Latin hypercube sampling with 10,000 iterations, with 95% uncertainty intervals estimated as the 2.5 th and 97.5 th percentile range. Population rates were estimated using bridged-race population estimates from CDC Wonder [16].
Analyses were completed in R (version 3.6.1).

Sources of under-detection of cases Parameter A. SARS-CoV-2 assay sensitivity
Patients infected with SARS-CoV-2 may not always test positive. Sensitivity of approved molecular diagnostic assays may be affected by the limits of detection of specific assays [10], specimen quality, source, handling, and timing of collection [11]. In a systematic review, 2%-21% of patients ultimately confirmed to have SARS-CoV-2 infection did not have a positive result unless multiple tests were performed over several days [17]. This review was used to estimate the probability that a specimen with SARS-CoV-2 will test positive ( Table 2). For simplicity, since reported assay specificity has been high with false positive results ranging between 1-4% [18,19], we did not adjust for potential false positives.

Parameter B. SARS-CoV-2 assay ordered and test completed
Patients with SARS-CoV-2 infection who are not tested with molecular assays are not included in confirmed case counts. To characterize testing probabilities, we used data from two sources on healthcare visits and SARS-CoV-A c c e p t e d M a n u s c r i p t 8 2 testing, and estimated this parameter separately for hospitalized and non-hospitalized patients. To capture the variability in testing practices across data sources, we represented this parameter using a beta PERT distribution centered on the median value and ranging between the minimum and maximum values reported across both data sources within each stratum of age ( Table 2). The beta PERT distribution is a continuous probability distribution, which emphasizes the most likely values in an acceptable range of parameter values (i.e., more often drawing closer to middle value of the interval with a smaller probability on the extremes of the interval).
The first source of data was the IBM Watson Health Explorys electronic health record (EHR) database (IBM, Armonk, NY), which includes >39 health system partners across the country. We identified visits with an ICD-10 diagnosis or SNOMED code that indicated an acute respiratory illness (ARI) (Supplemental Table 5) and the number of those with evidence of SARS-CoV-2 test results from LOINC codes for SARS-CoV-2 RT-PCR tests (Supplemental Table 6). For each setting (inpatient, outpatient ED), visits and tests performed were aggregated into strata for time period and age group.
We also included rates of testing in the COVID Near You (CNY) survey platform. CNY is a website application where participants can self-report symptoms, healthcare seeking behaviors, and SARS-CoV-2 testing information [20][21][22]. COVID-like illness (CLI) was defined using self-reported presence of shortness of breath or cough, or two or more of: self-reported fever, chills, sore throat, body ache, headache, or loss of taste or smell. Proportions of individuals who self-reported receiving a SARS-CoV-2 test among those who sought care for CLI were estimated for each time period with available data by HHS region, and age group ( Table 2, Supplemental Table 4).

Parameter C. Symptomatic patient seeks care/testing
A symptomatic person with SARS-CoV-2 infection will not be included in confirmed case counts if they never sought medical attention or testing services. To estimate healthcare seeking, we used data obtained from both A c c e p t e d M a n u s c r i p t 9 CNY and Flu Near You (FNY) [23], which has conducted participatory surveillance for influenza-like illnesses since 2011, to better capture the full time period and differences between participants of the two systems. We considered a range of symptomatic illness including: (1) CLI as described above, but excluding loss of taste or smell for FNY, which was not captured in that platform; (2) a more specific case definition of fever, and either cough or shortness of breath; and, (3) a broader case definition of at least one of fever, cough, or shortness of breath. Among patients who met the given case definition, we calculated the proportion that reported visiting a doctor's office, urgent care clinic, outpatient clinic, emergency department, testing center, telemedicine, or other healthcare setting for symptoms. Care seeking proportions were included using a beta PERT distribution of the median and range of values across the three case definitions and two data sources, stratified by report date and age group ( Table 2, Supplemental Table 2).

Parameter D. Patient is symptomatic if infected with SARS-CoV-2
Some people infected with SARS-CoV-2 do not experience symptoms [24]. To estimate the number of infections in the population, we adjusted the sum of hospitalized and symptomatic non-hospitalized cases based on the the proportion of persons with confirmed COVID-19 and no symptoms from a meta-analysis of available literature ( Table 2) [17].

National case reporting
During February 27-September 30, 2020, there were 6,891,764 confirmed cases of symptomatic COVID-19 acquired domestically and reported nationally through individual or aggregate case counts. We estimated that approximately 14% of these patients had been hospitalized, with variation by age group, case report date, and HHS region (Table 1).  Table 3).

Non-hospitalized symptomatic illnesses
We estimated 7.1 (95% UI: 5.8-9.0) non-hospitalized symptomatic illnesses for every one non-hospitalized case reported nationally, with variation by age group, HHS region, and report date. Under-detection multipliers decreased over time and were consistently highest among children (Supplemental Table 3).
We summed the estimated hospitalized (Table 3) and non-hospitalized (Supplemental Table 5) illness for a total of 44.8 million symptomatic illnesses ( Table 4). The highest rates of symptomatic illness were among adults 18-   (Table 5). This indicates that 1 in 7.7, or 13% of total infections were identified and reported. Detection varied by age, with lower detection rates among children, but with improvements over time (Supplemental Table 4).

Conclusions
We estimated that nearly 53 million SARS-CoV-2 infections, including 42 million symptomatic illnesses and 2.4 million associated hospitalizations, may have occurred in the U.S. through September 30, 2020; with variation by geographic region, age group, and time. These preliminary estimates demonstrate the large incidence of disease in the U.S. population and better quantify the impact of the COVID-19 pandemic on the healthcare system and society, and will be updated as more data on under-detection become available. A c c e p t e d M a n u s c r i p t 12 at least 10 (range by U.S. site: 6-24) for every reported case [3], with improvements in this ratio by later time points. Severe cases were more likely to be detected and reported; we estimated 2.5 hospitalized patients for each hospitalized case reported. In the Explorys EHR data, the proportion of ICU patients tested for SARS-CoV-2 was >90% by the end of September, though testing remained lower among other inpatients with ARI, and even lower for ARI visits in outpatient settings (Supplemental Figure 1).
For comparison, COVID-NET is an active, population-based surveillance system for laboratory-confirmed SARS-CoV-2-associated hospitalizations in defined areas of 14 states [26]. While direct comparisons with COVID-NET are imperfect due to the narrower geographic area of the surveillance sites, in 10 of the 14 sites, our estimated hospitalization rates by region were 1.5-3.5 times higher than the reported rates from individual sites within those regions by the end of September, similar to the range of our estimated under-detection multiplier for confirmed hospitalizations. Likewise, COVID-NET showed similar trends across age; adults aged ≥65 years had 5-6 times higher rates of hospitalizations than younger adults aged 18-49 years [27]. Both also showed lower hospitalization rates among children [28,29]. For comparison of population-level incidence of infection, the estimated 36 million infections represent approximately 16% of the U.S. population, ranging from 9%-31% across regions of the country. This is higher than seroprevalence estimates from a nationwide commercial laboratory seroprevalence survey, which found that 1%-22% of various state populations had antibodies to SARS-CoV-2 by early August, though our estimates include two more months of circulation [31]. There remain uncertainties in the interpretation of seroprevalence estimates, including how they vary by the population surveyed, the serologic assays used, the proportion of infected cases with a detectable antibody response, and how long antibody detection persists after infection.
Additional studies and sources of data on population-based incidence will help resolve these concerns and provide better national estimates of illness and infection.
A c c e p t e d M a n u s c r i p t 13 We recognize that our model has limitations. From almost a decade of monitoring data on testing practices for influenza [32,33], testing rates and the use of more sensitive molecular testing has varied by jurisdictions, care settings, age, and disease severity [34]. The availability and use of testing for SARS-CoV-2 has changed rapidly over time; thus far, data on the proportion of persons who are tested for COVID-19 and how this varies across all the previously described factors remains limited. Although data on testing by time, healthcare setting, and age was available, it lacked the coverage to allow for geographic-specific model inputs. These data limitations could have resulted in overestimation of cases from areas with higher testing rates, including some hospitals that are performing universal testing, or have more outpatient testing facilities and active contact tracing. Likewise, we may underestimate in areas with lower testing and contact tracing. Additionally, some infections, such as those among healthcare workers or from outbreaks in congregate residential settings, may be more likely to be tested and nationally reported compared with the general population, and could overestimate non-hospitalized cases and infections. We continue to seek information on the proportion of cases and testing rates in various settings to improve estimates. With limited but growing information regarding the spectrum of clinical manifestations from SARS-CoV-2 infection, there could be a lower index of suspicion of COVID-19 for patients who present with nonspecific and non-respiratory symptoms; these cases may be less likely to be detected and reported. All of this highlights the importance of having data to monitor the proportions of patients with different clinical syndromes who are being tested for SARS-CoV-2 infection in a variety of healthcare and geographic settings, and not just total numbers of tests performed. Finally, in some heavily affected areas, the size of the outbreaks exceeded capacities to complete detailed case reporting, including patient age and hospitalization status. For cases with missing hospitalization status, we imputed the proportion of reported cases that were hospitalized from the subset with complete data, but it is unclear if age and hospitalization status were missing at random [35]. If not random, and the data were more complete for hospitalized patients, the true hospitalization ratio would be lower than we imputed, and the number of hospitalized cases would be lower than we estimated. Furthermore, this was hospitalization status at the time of the case report, and would miss those A c c e p t e d M a n u s c r i p t 14 diagnosed as an outpatient but became hospitalized after they were reported as a case; thus our estimates of hospitalization may be an underestimate.
Despite these limitations, our model provides a relatively simple approach to illustrate why there are more persons who have had a SARS-CoV-2 infection than the reported confirmed case counts at multiple levels of disease severity. We used data currently available to provide a preliminary estimate of the overall incidence of SARS-CoV-2 infection, illness, and hospitalization in the U.S. CDC is actively working on refining methods to synthesize information across multiple data sources to better describe the national burden of SARS-CoV-2 infection on an ongoing basis and will update estimates as data become available.
In summary, we estimated that in the U.   M a n u s c r i p t 22 A c c e p t e d M a n u s c r i p t 24