Development and validation of a prognostic model based on comorbidities to predict COVID-19 severity: a population-based study

Abstract Background The prognosis of patients with COVID-19 infection is uncertain. We derived and validated a new risk model for predicting progression to disease severity, hospitalization, admission to intensive care unit (ICU) and mortality in patients with COVID-19 infection (Gal-COVID-19 scores). Methods This is a retrospective cohort study of patients with COVID-19 infection confirmed by reverse transcription polymerase chain reaction (RT-PCR) in Galicia, Spain. Data were extracted from electronic health records of patients, including age, sex and comorbidities according to International Classification of Primary Care codes (ICPC-2). Logistic regression models were used to estimate the probability of disease severity. Calibration and discrimination were evaluated to assess model performance. Results The incidence of infection was 0.39% (10 454 patients). A total of 2492 patients (23.8%) required hospitalization, 284 (2.7%) were admitted to the ICU and 544 (5.2%) died. The variables included in the models to predict severity included age, gender and chronic comorbidities such as cardiovascular disease, diabetes, obesity, hypertension, chronic obstructive pulmonary disease, asthma, liver disease, chronic kidney disease and haematological cancer. The models demonstrated a fair–good fit for predicting hospitalization {AUC [area under the receiver operating characteristics (ROC) curve] 0.77 [95% confidence interval (CI) 0.76, 0.78]}, admission to ICU [AUC 0.83 (95%CI 0.81, 0.85)] and death [AUC 0.89 (95%CI 0.88, 0.90)]. Conclusions The Gal-COVID-19 scores provide risk estimates for predicting severity in COVID-19 patients. The ability to predict disease severity may help clinicians prioritize high-risk patients and facilitate the decision making of health authorities.


Introduction
In December 2019, China reported to the World Health Organization (WHO) several cases of pneumonia of unknown origin in Wuhan, in the province of Hubei. 1 These cases were later confirmed to be caused by a novel coronavirus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which was renamed coronavirus disease 2019 (COVID-19). 2 The disease rapidly spread to most countries in the world. By May 31, 5.9 million people had become infected and 367 166 had died. 3 The clinical manifestations range from asymptomatic infection to pneumonia, which can progress to acute respiratory distress syndrome, multiorganic failure and, ultimately, death. 1,4-6 About 80% of reported cases have mild symptoms, but 15-20% will progress to severe pneumonia that will cause death to 1-5% of patients. According to predictive models, 7 age and the presence of some particular comorbidities (hypertension, cardiorespiratory disease or diabetes) 5 are associated with a higher risk for disease progression.
No specific therapies or vaccines have yet been developed to prevent or reduce the risk of developing complications of COVID-19. For health authorities to be able to allocate the resources necessary in each health district, it is crucial that COVID-19 patients at a higher risk of developing severe disease [i.e. hospitalization, ICU (intensive care unit) admission or death] 8-10 are identified early and accurately.
The purpose of this study was to develop and validate a prognostic model to identify patients with COVID-19 infection at a higher risk of hospitalization, ICU admission and death, based on their age, gender, comorbidities and geographic place of residence. These data are available in the electronic health records (EHR) of Primary Care centres and are classified in accordance with the International Classification of Primary Care (ICPC-2).

Source of data
A retrospective cohort study was performed of patients diagnosed with COVID-19 in any of its clinical forms in Galicia, Spain, from 6 March 2020, when the first case in the region was reported, to 7 May 2020. Galicia is a region in the northwest of Spain [(area: 29 574.4 km 2 ; population: 2 700 441 inhabitants (1 303 453 males; population density: 91.3 inhabitants/km 2 )] with a mean age of 47.2 years and 515 488 people >70 years old (19.1% of the total population).
Data were collected from the Galician Health Service (Servizo Galego de Saú de, SERGAS) database, which contains longitudinal data of the population in Galicia. The SERGAS database is based on data from 63 databases of healthcare services serving >95% of the population 11 including public healthcare services, hospitals, primary care centres, pharmacies, emergency services, statesubsidized health entities and stakeholders. Epidemiological data were obtained using ICPC-2 from EHR of Primary Care centres using an automated technique.
The study was conducted in accordance with the guidelines of the Declaration of Helsinki and the principles of good clinical practice and was approved by the Institutional Review Board (IRB) of the Galician Health Service on 3 April 2020 (#2020/194). Informed consent forms were waived by the IRB.

Definitions
A confirmed case of COVID-19 was defined as a positive reverse transcription polymerase chain reaction (RT-PCR) test on samples obtained from nasal or throat swabs performed in accordance with WHO protocol. 12 RT-PCR was performed in people with symptoms consistent with COVID-19 (i.e. fever, chills, severe tiredness, sore throat, cough, shortness of breath, headache, anosmia or ageusia, and nausea, vomiting or diarrhoea), or contact with suspected or confirmed cases. Only laboratory-confirmed cases are registered in a single database and were considered for analysis.
Patients with uncomplicated disease, but with a oxygen saturation (SaO 2 ) > 95% and a respiratory rate <25 breaths/min, all considered as low-risk (<60 years of age and without comorbidities), and high-risk patients (>60 years and with comorbidities), were monitored as follows. (i) At home by the TELEA system, a home monitoring platform for monitoring respiratory and heart rate, temperature and SaO 2 . 13 (ii) Patients without internet connection at home were monitored via 2-3 telephone calls daily. If the clinical status of the patient deteriorated, a physician contacted them to decide whether hospitalization was required or not. (iii) Previously-institutionalized patients or those without enough assistance at home were transferred to a socio-health centre adapted as a hospital. All patients diagnosed with COVID-19 pneumonia were hospitalized. Pneumonia was defined as an acute respiratory disorder characterized by cough, at least a novel condensation on thoracic X-ray, and a fever of four or more days of duration, or dyspnea/tachypnea. 14 COVID-19 was considered severe and the patient was a candidate for ICU admission if they required mechanical ventilation or had a fraction of inspired oxygen of !60%. 15 Patient's total comorbid burden was determined by the Charlson Comorbidity Index, which predicts the 10-year life expectancy of patients with multiple comorbidities. 16

Outcomes
We focused on three key outcomes: hospitalization, ICU admission and death of any cause after RT-PCR diagnosis in the study period.

Predictors
Based on a review of existing literature to identify comorbidities associated with COVID-19 prognosis, 5,9 the following data were extracted from the EHR of each patient: age, sex and ICPC-2 comorbidities [allergy, lymphoma/leukaemia, acquired immune deficiency syndrome (AIDS), malignant neoplasm, peptic ulcer, chronic enteritis, liver disease, ischaemic heart disease, heart failure, atrial fibrillation, heart valve disease, hypertension, cerebrovascular disease, peripheral vascular disease, rheumatoid arthritis, alcohol and drug abuse, tobacco abuse, dementia, psychosis, chronic obstructive pulmonary disease, asthma, malignant neoplasm of skin, psoriasis, obesity, diabetes, lipid disorder and chronic kidney disease]. An individual was considered to have any of these conditions if they had suffered it at some point of their life. A detailed description of comorbidities and their corresponding codes is available in Supplementary

Statistical analysis
We used a random sample of 70% to derive the multivariable logistic regression models, and their performance, and the remaining 30% for validation. First, in the derivation sample, all predictors described above were included in the models to estimate the probability of hospital admission or evolving in a critical case or death. Final models included age, sex and a combination of comorbidities. Beginning with a model containing all potential covariates, the variable with the least significant P value was removed and tested using the likelihood-ratio test until all variables left in the model significantly (at alpha ¼ 0.05) contributed to the model.
Results are presented as odds ratio (OR) with 95% confidence intervals (CIs). The Nagelkerke R 2 was used to calculate the proportion of the explained variance of clinical outcomes by the selected predictors. The different aspects of model performance were studied, including calibration and discrimination. Calibration was assessed using the Brier score and by plotting the non-parametric estimate of the association between the observed frequencies and the predicted probabilities. 17 The receiver operating characteristics (ROC) curves [and the corresponding area under the ROC curve (AUC)] were calculated to test for discrimination. To correct optimism, internal validation was performed for each model using the bootstrap procedure with 500 bootstrapped samples. 17 The final models were selected to derive scores for clinical use (Gal-COVID-19 score), and nomograms were created. Criteria for this selection included both discriminant ability (defined by the AUC) and model simplicity. Finally, the coefficients (scores) derived from the derivation cohort were also validated on the validation cohort.
To verify the robustness of the models, additional regression analyses were performed to predict admission to ICU and mortality in those patients who had completed the course of the disease.
All statistical analyses were carried out in R version 3.5.1 using the packages BayesX and rms. These packages are freely available at http://cran.r-project.org. The analysis conforms to the reporting standards of TRIPOD. 18 Results A total of 10 454 subjects [4172 men (40%); mean age 58 years] acquired the disease, which accounts for 0.39% of the population. Of them, 2492 cases (23.8%) required hospitalization, 284 (2.7%) were admitted to the ICU and 544 (5.2%) died [of whom 154 (28.3%) had not been hospitalized].
The median length of stay was 19 days (interquartile range 7, 38). At the end of the study, 291 (11.6%) hospitalized patients had not yet been discharged, had been transferred to the ICU or died. The patients who were still hospitalized were older and had fewer comorbidities than hospitalized patients who had been discharged or died. The median ICU stay time was 15 days (interquartile range 3,28). Of the patients admitted to ICU, 43 (15.1%) had not yet been discharged or died. Figure 1 shows the distribution of COVID-19 by age and gender and the incidence of the disease by age group. Supplementary Table 2, available as Supplementary data at IJE online, shows the total population of Galicia, the number of COVID-19 positives and their distribution by age and gender. Figure 2 gives the distribution of all laboratory-confirmed cases of COVID-19 reported by municipalities in Galicia in accordance with official statistics, expressed as absolute values and as percentages of the population. The highest incidence was observed in municipalities located in the southeast of the region, which coincides with the highway that connects Madrid with Galicia. Table 1 displays the demographic characteristics (age and gender) and comorbidities of patients. Severity of disease increases with age and frequency of comorbidities, and is higher in men than in women (Supplementary Figure 1, available as Supplementary data at IJE online). Notably, the patients who died out of hospital were the ones with a more advanced age and a higher prevalence of chronic diseases such as dementia, dependence and immobilization. The median Charlson Comorbidity Index Score was 2 (interquartile range 1, 4).
Predictors in the derivation sample and performance of the models  available as Supplementary data at IJE online]. The variables included in the model to predict hospitalization included age, gender, dependence, heart failure, hypertension, rheumatoid arthritis, tobacco abuse, dementia, chronic obstructive pulmonary disease, asthma, obesity and diabetes (  The Nagelkerke R 2 of Gal-COVID-19 and Charlson index in the validation samples were 0.24 and 0.19 for predicting hospitalization, 0.17 and 0.004, respectively, for predicting ICU admission, and 0.25 and 0.23, respectively, for predicting death. We fitted regression models with the validation data to estimate the coefficients for each risk factor. None of these coefficients differed significantly from those in the derivation sample, either in hospitalization, ICU admission or death (Figure 3).
Participants in the validation dataset were divided into groups of predicted probabilities according to the distribution of the Gal-COVID-19 scores for risk of hospitalization, ICU admission and death. Overall, the rates of incidence of observed hospitalizations were similar to those predicted by Gal-COVID-19 scores in the groups of predicted risk in the validation dataset. There was slight under-and over-estimation of risk amongst highest risk strata for ICU admission and death, respectively (Figure 4).  Figures in the supplementary material illustrate a method to estimate the risk of progression to hospitalization, ICU admission and death based on an overall score calculated by the sum of the individual scores obtained in the variables of the model (Supplementary Figures 2, 3 and   4, respectively, available as Supplementary data at IJE online). Table 3 shows the individual score of each of the predictors for predicting hospitalization, ICU admission and death. Supplementary material provides a spreadsheet in  The distribution of risk of hospitalization, ICU admission and death is shown in Figure 5. The risk thresholds of 5, 10 and 25% for hospitalization accounted for 7.4, 30.8 and 59.7% of the COVID-19 population, respectively. For predicting ICU admission, the risk thresholds of 1 and 5% accounted for 46.5 and 84.8% of the COVID-19 population, respectively. For predicting death, the risk thresholds of 1 and 5% accounted for 50.3 and 71.6% of the COVID-19 population, respectively.
Of the hospitalized patients, 291 (11.6%) had not yet been discharged, admitted to the ICU or died. No differences in estimates were found between analyses performed to estimate the risk of ICU admission or death, respectively, on all patients and on those who had completed the course Figure 4 Calibration plots of the final models for predicting hospitalization, ICU admission and death in the derivation cohort. The dotted line shows the actual relation between observed outcomes and predicted risks; the solid line shows the smoothed relation. Ideally, these lines equal the dashed diagonal line that represents perfect calibration. Enter the scores for gender and comorbidities in the left column and the patient's age in the middle column. Sum the points obtained for each of the predictors. Enter the total score and the associated risk for hospitalization, admission and death in the right column. For example, a 60-year-old woman with asthma and diabetes with any other comorbidity, would sum 132 points [(age ¼ 60 years, 65 points) þ (dependence ¼ no, 14 points) þ (dementia ¼ no, 22 points) þ (asthma ¼ yes, 21 points) þ (diabetes ¼ yes, 10 points)]. The risk for hospitalization would be >30%. Likewise, proceed to estimate the risk for ICU admission and death. b COPD, Chronic obstructive pulmonary disease. of the disease (Supplementary Tables 4 and 5, available as Supplementary data at IJE online).

Discussion
A population-based study was performed to derive and validate new risk models for predicting hospitalization, ICU admission and mortality in patients with COVID-19 infection. Data were extracted from EHR generated by general practitioners as part of their routine practice. The results of this study are of relevance as the course of COVID-19 is unknown and may cause death. In most cases, the disease can be controlled by closely monitoring its course. However, critical patients require hospitalization, the administration of aggressive treatments and critical care.
In Galicia, 0.39% of the population tested positive for COVID-19 in RT-PCR tests. A seroepidemiological population-based survey sponsored by the Spanish Government (ENE-Covid19) revealed that 2.1% of this population was positive for COVID-19. 19 This inconsistency may be due to the fact that one in three infections seems to be asymptomatic, while a substantial number of symptomatic cases were not tested. It may affect the evaluation of the extent of the epidemic but not the models for predicting outcomes.
In total, 28.3% of patients who died (154 patients) had not been hospitalized. This may be explained by the fact that previously institutionalized patients, who had an older age, disabilities, and chronic terminal illnesses, were transferred to socio-health centres adapted as a hospital. Unlike other regions in Spain, the COVID-19 pandemic did not lead to shortages of hospitalization and ICU facilities. Since there were no clear guidelines in this regard, the decision for hospital admission and ICU admission was made following the primary role of beneficence and nonmaleficence in resource allocation according to pre-COVID criteria.
This predictive model is based on age, gender, presence of chronic diseases and risk factors (i.e. cardiovascular disease, neoplasm, diabetes, chronic obstructive pulmonary disease, obesity, hypertension, liver disease, chronic kidney disease). These factors have been demonstrated to be powerful predictors of progression and mortality 4,6,[20][21][22] and are routinely recorded in primary care. The results of this study confirm that age is a risk factor of hospitalization, ICU admission and death. The effect of age on T-and Bcell function and excessive production of type 2 cytokines probably reduce control of viral replication and result in a prolonged inflammatory response, which facilitates disease progression. 23 In most studies, the disease had a higher prevalence in men, 1,[4][5][6][7]24 whereas in our study the disease was more frequent in women (60.1% of women). This may be due to the fact that studies generally include critical patients and, in this subgroup of patients, men are twice as likely as women to require hospitalization (  have more comorbidities than patients with mild disease. 4,6,20 Thus, pneumonia may increase the risk of cardiovascular events, 25 and other diseases such as arterial hypertension, 26 diabetes 27,28 and obesity 29 may contribute to a poorer prognosis of COVID-19 infection.
The findings of this study have relevant clinical implications. Our prediction models may be useful to predict disease severity in patients with COVID-19 infection in primary care or community-based settings. This tool may help clinicians prioritize high-risk patients and decide whether they need to be referred to a hospital, where a diagnosis and appropriate treatment for the characteristics of the patient will be established. In addition, this predictive model identifies patients with severe disease who will probably need intensive care, and provides key information to the patient and their family on disease prognosis. In addition, these scores make it possible to establish risk levels, even arbitrarily, which may be useful to guide decisionmaking.
This study has two strengths: first, it is a populationbased study that accounts for virtually the totality of cases of COVID-19 diagnosed in a well-defined region (Galicia, Spain). Second, this study is based on a high-quality, internally-validated database of EHR that provides a large sample, reflects real-life conditions and includes individuals who are not generally recruited in cohort studies. There are other studies investigating the association between comorbidities and disease progression, but most have been conducted in hospitalized patients but not in the general population. In population-based studies such as the UK-Biobank Cohort, subjects who had been previously evaluated were followed-up until confirmation of COVID-19 infection or COVID-19-related admission. 30,31 Similar results are also found in an international study from six countries to develop and validate a risk score analyzing electronic medical records. In the development database their authors sample 150 000 patients with influenza or flu-like symptoms. 32 In addition, a recent systematic literature review revealed ten prognostic models for predicting mortality or progression to severe disease, but only one study involved patients from countries other than China, and all studies had been categorized as being at a high risk of bias. 33 The study also has several limitations. First, since the model was developed based on a single population, the lack of external validation is a major limitation. In addition, the calibration results suggest that the model's performance should be assessed and recalibrated when used in other populations. Further studies are needed to generalize the clinical value of this predictive model in other geographic areas. Second, disease classification systems (ICPC-2 in this case) may lead to underdiagnosis. 34 Third, outcomes such as discharge disposition or death were not available for patients still in hospital at the end of the study, because they had not completed their hospital course. Although no differences were found in the estimates between analyses performed on the totality of patients and those who had completed the course of the disease, the probability of death or ICU admission may have been underestimated.

Conclusion
Our results provide evidence that age, gender and comorbidities, which are routinely recorded by general practitioners in EHR, may be useful to predict COVID-19 severity, need for hospitalization or ICU admission and death. This information may help clinicians to prioritize high-risk patients and facilitate the adoption of the appropriate healthcare strategies.

Supplementary data
Supplementary data are available at IJE online.

Data availability
The data underlying this article will be shared on reasonable request to the corresponding author.