A Systematic Review of Predictor Composition, Outcomes, Risk of Bias, and Validation of COVID-19 Prognostic Scores

Abstract Background Numerous prognostic scores have been published to support risk stratification for patients with coronavirus disease 2019 (COVID-19). Methods We performed a systematic review to identify the scores for confirmed or clinically assumed COVID-19 cases. An in-depth assessment and risk of bias (ROB) analysis (Prediction model Risk Of Bias ASsessment Tool [PROBAST]) was conducted for scores fulfilling predefined criteria ([I] area under the curve [AUC)] ≥ 0.75; [II] a separate validation cohort present; [III] training data from a multicenter setting [≥2 centers]; [IV] point-scale scoring system). Results Out of 1522 studies extracted from MEDLINE/Web of Science (20/02/2023), we identified 242 scores for COVID-19 outcome prognosis (mortality 109, severity 116, hospitalization 14, long-term sequelae 3). Most scores were developed using retrospective (75.2%) or single-center (57.1%) cohorts. Predictor analysis revealed the primary use of laboratory data and sociodemographic information in mortality and severity scores. Forty-nine scores were included in the in-depth analysis. The results indicated heterogeneous quality and predictor selection, with only five scores featuring low ROB. Among those, based on the number and heterogeneity of validation studies, only the 4C Mortality Score can be recommended for clinical application so far. Conclusions The application and translation of most existing COVID scores appear unreliable. Guided development and predictor selection would have improved the generalizability of the scores and may enhance pandemic preparedness in the future.

The coronavirus disease 2019 (COVID-19) pandemic has created a state of emergency in health systems across the globe [1].Hospitals were overcrowded with patients and decisions for their management had to be made quickly.At the same time resource constraints limit the treatment of all patients with adequate therapies.Even in 2023, when the pandemic transitioned into an endemic state [2], the dynamically evolving variants of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) still cause severe disease in individuals, regardless of immunity, vaccines, and therapeutic interventions [3,4], especially when they are of elevated age or have comorbidities [5].
Especially during the first wave of the pandemic, scientists and clinicians rushed their efforts to support decision making, often trying to define thresholds of defined symptoms or scores.Such clinical prognostic scores are derived from models that estimate an individual's probability for a particular condition by combining and weighting predictive factors, mainly in an easy-to-apply manner (eg, additive point systems).Compared to a more complex, information-intense, and accurate (statistical) outcome prediction model, a score is a clinical decision support tool that facilitates fast applicability and unambiguous communication.Clinicians use such scores as "prediction rules" daily to reduce severe outcomes by modifying therapeutic considerations according to given risks [6,7].Although clinical judgments remain irreplaceable [5], a score's validity, reliability, and trustworthiness depend on the quality criteria applied during development and adequate validation.Scores can be developed for different scenarios (eg, predicting in-hospital mortality after admission or hospitalization at diagnosis) [7], making their application relevant in different settings.
Although numerous predictive models for COVID-19 were published [8][9][10], most are of heterogeneous methodological quality or remain unvalidated.The scores were not universally implemented in everyday clinical care and treatment instructions.The current Infectious Diseases Society of America (IDSA) guideline (05/2023) [11] does not recommend a specific tool for outcome prognosis.The World Health Organization's (WHO) guideline on Therapeutics and COVID-19 (01/2023) [4] reported that reliable tools are needed, especially for using available medication.Although it mentioned the ISARIC's (International Severe Acute Respiratory and emerging Infection Consortium) 4C Mortality Score (4C) [12], the "need for better evidence on prognosis" is emphasized [4].The WHO's Living Guidance for clinical management of COVID-19 (01/2023) also suggests "clinical judgment […] rather than currently available prediction models" [13].In summary, evidence for prognostic scores is poor [8][9][10], and the translation into clinical practice remains elusive.At the same time, the need for reliable stratification tools is emphasized in COVID-19 guidelines [4,13].
In our systematic review, we focus on the critical appraisal of predictors and the transferability of clinical scores to support implementation in routine care.We aim to identify scores for daily clinical care, provide an effective overview for decisionmakers, and pave the way for future pandemic preparedness.

Systematic Review Question, Inclusion, and Exclusion Criteria
For this systematic review, we identified the COVID-19 prognostic clinical scores developed from the onset of the pandemic.We included original scores designed or modified for the management of COVID-19 based on individual patient data from clinically assumed or confirmed COVID-19 cases.We did not preselect publications on specific patient care levels, timings of predictor measurement, predictor types, or targeted specific outcomes.We excluded regression or other prediction models unsuitable for scoring, predictors based on single observations, scores focusing on specific subpopulations (eg, comorbidities, pharmaceutical trials), and mathematical virus transmission simulations.In the first step, we extracted information from all identified studies (termed "all scores") that fulfilled the primary inclusion criteria (see Table 1).Second, we selected scores for an in-depth analysis (Level 2 [L2]) based on predefined criteria: (I) area under the curve (AUC) ≥ 0.75, (II) a separate validation cohort, (III) training data from a multicenter setting (≥ 2 centers), and (IV) the result of the score mapped on a point scale (for details see Supplementary Text 1).Only scores fulfilling the L2 criteria were further evaluated for risk of bias (ROB).The other scores were assigned to Level 1 (L1).

Data Sources, Search Strategy, and Data Extraction
We searched MEDLINE and Web of Science on 14 April 2022 and 20 February 2023 using a prespecified search strategy combining domains regarding "COVID-19", "Prediction", "Scoring" and "Validation metrics" (Supplementary Text 2).Our processing was based on the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines [14].For the extracted information, see Supplementary Text 3.All literature processing tasks, including screening, data extraction, and ROB assessment were independently performed by two reviewers (K.A., R. G.).In case of disagreement, consensus was reached by discussion.
If not stated otherwise, the unit of analysis was one score per outcome and predictor set.We also provide an overview of external validations identified by an ad hoc search in the same literature retrieval with a reduced set of extraction items.
Extracted AUCs are presented with range or median and interquartile range (IQR); categorical information is reported in absolute numbers and percentages (n (%)).The sample size was evaluated using the (estimated) events per variable (EPV), with low EPVs indicating a higher risk for overfitting (see Supplementary Text 3) [15].

Outcomes and Categorization of Scores
Based on the identified literature, the following outcomes were present: fatal outcomes (in-hospital mortality, death within specified time intervals), disease severity (classified as composite outcomes, eg, need for mechanical ventilation, intensive care  unit [ICU] admission, or death), hospitalization, and the post-COVID condition (PCC).We categorized the scores by the type of outcome and the timing of predictor measurement (Table 2).

Risk of Bias Assessment
Flaws in a study's design, conduct, or analysis methods can cause systematic errors (bias) of effect estimates.The Prediction model Risk Of Bias Assessment Tool (PROBAST) specifies the adequacy of methods when developing a clinical prediction rule by assessing the ROB within its four subdomains: "participants," "predictors," "outcome," and "analysis."The ratings "low," "unclear," or "high" evaluate the validity of the study and condense to an overall ROB.A "high" rating within at least 1 question or subdomain leads to an overall ROB of "high" [15].

RESULTS
The PRISMA flow chart (Figure 1) shows the literature evaluation procedure.Of 1522 studies extracted from the database, 242 original COVID-19 scores met the primary inclusion criteria, and 49 met the L2 criteria (details for all scores in Supplementary Table 1 and L2 in Supplementary Table 2).Comparative summary statistics matching this section are presented in Table 3.

Data Basis and General Study Characteristics (All Scores)
All studies were published between 2020 and 2023.Most scores were developed based on cohorts with <1000 participants (64.0%) in a retrospective (75.2%) and/or single-center (57.1%) design.Fifty-seven percent of the models were validated in a separate cohort, including random splits, temporal, or geographical (external) validation.The median AUC was 0.83 (IQR = [0.77,0.90]).
The study populations included hospitalized cases without restriction to specific conditions (65.3%), patients with severe disease (15.3%), or patients admitted to the emergency department (ED) (10.3%).The major timing for prediction was admission to hospital or emergency department (ED) (79.8%).Predicted outcomes (all scores) were mortality (45.0%), severity (as composite endpoints) (47.9%), hospitalization (5.8%), and PCC (1.2%).Among the 188 different predictors (extracted from all scores), age (68.2%) was the most frequently included, followed by C-reactive protein (CRP) (29.8%).This also applied to mortality or severity scores, where the importance of laboratory data, demographics, and physiological information stood out.Hospitalization scores often included age (87.8%) and dyspnea (57.1%) (for the top 20 predictors in each category, see Supplementary Table 3).The number of predictors per score ranged from two to 29. Figure 2 shows the frequency of predictor use in relation to the overall AUC of the scores.We also present the predictor domains by score, category, and inclusion level (Figure 3, Supplementary Figures 1 and 2).
The scores predicting PCC primarily used symptom information.None of them met the L2 criterium AUC ≥ 0.75 and were therefore not further analyzed.

Risk of Bias
Many studies did not adhere to general guidance for developing predictive models [7,15,16], so that information relevant to their evaluation was unavailable.Most scores raised at least one concern within one of the PROBAST domains, leading to an overall high ROB (low 10.2%, unclear 6.1%, high 83.7%)(Figure 4, Supplementary Table 4).The primary concerns pertained especially to the "analysis" domain, namely, the We present n (%) for categorical information and the median (IQR) for continuous information.The column "All" includes all scores fulfilling the a priori inclusion criteria.In contrast, Level 1 merely includes scores that did not fulfill the selection criteria and Level 2 only includes the scores fulfilling the criteria (see Methods section).As a result of two granularity levels of data extraction, some information is only available for Level 2 scores.
Abbreviations: AUC, area under the curve; COVID, coronavirus disease; ICU, intensive care unit; IQR, interquartile range; PCC, post-COVID-19 condition; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2. a Events per variable (EPV) were estimated using the absolute number of candidate predictors.Some studies did not precisely name the number of candidate predictors.To generate assumptions regarding the sample size, we counted predictors indicated as candidates in tables or texts (signed by "∼" in Supplementary Table 2), even though we acknowledge that using the number of regression coefficients instead is more precise [15].absence of calibration measures (eg, adaptation of the relation of estimated and observed event probabilities) [17], failure to account for over-optimism (which could be met by, eg, bootstrapping or cross-validation) [15,18], mishandling of missing values (eg, use of complete case analysis instead of imputation methods) [15], and insufficient validation techniques (eg, using random splits instead of geographical (external) validation) [15].

External Validations
Only a fraction of the scores (n = 33) were validated externally (see Supplementary Table 5).The 4C was replicated most frequently: 37 validations from 20 countries yielded a primarily robust median AUC of 0.80 (range: 0.55 to 0.93) for different outcomes.Based on our literature search, most COVID-19 scores remained unvalidated.

DISCUSSION
This systematic review investigated the quantity and quality of clinical scores predicting COVID-19 outcomes.Although numerous scores were developed specifically for this purpose, none were implemented in COVID-19 treatment guidelines [4,11,13] to become part of the clinical routine.Our analysis showed that most scores insufficiently adhered to the quality criteria required to ascertain validity, reliability, and trustworthiness.

Scores Identified With Low or Unclear ROB
Most scores (n = 41) were found to carry a significant ROB due to methodological choices.We identified only 5 scores with low and 3 scores with unclear ROB (Table 4).
The 4C score [12] can be recommended for prognostication of mortality, as it is based on a large, prospective cohort, makes use of widely available predictors (in high-resource settings at hospital admission), and was frequently validated (at least 37 validations in 20 countries) [26].Although the CCEDRRN COVID-19 Mortality [20] and SEIMC scores [25] were also developed on large cohorts, the designs of their development studies are based on retrospective data, and both would profit from additional validations (CCEDRRN: 0 validations; SEIMC: 4 validations, 4 countries).We identified the PRIEST score [23] and the LMIC-PRIEST [21], developed on large, multicenter cohorts, to be potentially suitable for the prognosis of COVID-19 severity.The LMIC-PRIEST might especially be relevant for low-and middle-income countries (LMIC) [21].Although disease severity seems more relevant nowadays, only a few validation studies exist (PRIEST: 3 validations in 3 countries; LMIC-PRIEST: 0 validations, but external validation within the original study).A broad clinical application should thus wait for further validation.In general, an adjustment to altered frequencies of COVID-19-related deaths and severe disease courses since score development (eg, by immunity and vaccination) should be investigated.The AUC is displayed on the right y-axis.The sorting of the scores is determined by (I) the absolute number of categories and (II) the relative proportion across all scores.The color gradient from green to blue indicates the availability of the category, although in case of doubt, this also depends on the level of care.Similar presentations of scores assigned to categories 2-4 are displayed in the Supplementary Material.Abbreviations: AUC, area under the curve; COVID-19, coronavirus disease 2019.

Predictor Selection, Applicability, and Complexities
Many scores included a wide range of predictors (Figure 2) from different domains (Figure 3).We may assume that data availability often impacts the predictor choices more than what is recommended by best practice guidelines [6,27].This heterogeneity is most likely a result of differences across studies, such as the scope of data sources used, entry criteria for analyzed cohorts, slight differences in endpoints and definitions, and statistical approaches employed.The heterogeneous clinical appearance of COVID-19 and changing vaccination statuses may have added to that heterogeneity.However, the review reveals a common set of predictors used in many scores (eg, age and CRP), whereas others were included in only 1 or very few (eg, nausea or hypotension).
Our results indicate that COVID-19 mortality or severity scores should include age, respiratory conditions, laboratory data, and comorbidities to predict outcomes reliably [10,28].Pre-hospital scores (eg, predicting hospital admission) primarily use information on comorbidities and sociodemographic information, applicable without diagnostic infrastructure.Overall, symptoms and imaging appeared to play a minor role.Among the 20 most frequently used predictors, 6 (age, sex, diabetes, hypertension (as part of metabolic syndrome), blood urea nitrogen (BUN), creatinine) represent components of baseline assessment for (organ-related) infection outcomes or differential diagnoses, 10 (CRP, lactate dehydrogenase, oxygen saturation, respiratory rate, neutrophils-lymphocyte ratio, lymphopenia, dyspnea, thrombocytopenia, blood pressure, paO2/FiO2, temperature, and leukocytes) are accepted markers of overall infection severity/sepsis [29][30][31][32].In contrast, only 2 (D-dimer, albumin) may not be universally accepted as part of a baseline assessment for moderate to severe respiratory illnesses [33,34].It is not surprising that studies primarily confirm prior knowledge, since clinical practice is based on existing evidence.This in turn leads to selection of established markers for patient screening and thus limited availability of markers for score validation.Prospective determination of comprehensive metabolic panels might well lead to more effective models.This observation may partly be attributed to prior knowledge as a key criterion in defining the data sets that, in turn, were used in the analyses.It should be emphasized this means that the most used criteria are generally available during patient care in medium to high-resource settings.However, given the considerable overlap of predictors of general infection and severity/sepsis, it also suggests that the scores might not add much to existing knowledge on respiratory infection outcomes and are probably not very specific to COVID-19.
Non-routine laboratory indicators such as D-dimer and interleukine-6 restrict score applicability to high-resource settings [10].However, because D-dimer belonged to the top 10 predictors (Figure 2), further studies are warranted to define the incremental value of such parameters for successful patient management.Overall, laboratory tests for scores (eg, indicators of kidney function or protein metabolism [urea, BUN, creatinine] or indicators for inflammation such as the CRP vs leucocytes) may restrict practicability to specific resource and management settings.Not all general practitioners and outpatient departments will perform comprehensive tests based on moderate respiratory symptoms [35].The association between data availability, care setting, and regional standards is a likely source of bias that may limit transferability.Furthermore, nonconventional or time-dependent predictors such as arrival mode [20] or admission date [22] are less generalizable for validation in most cohorts.Only one score asked for the vaccination status [36], as most scores were developed on data from the early pandemic.vaccination status may be among the most critical outcome predictors today [37,38].

Limitations of the Evidence Included in the Review
Differences in score development design may lead to varied performance [15].Notably, we observed a substantial variation in sample sizes, settings, and case definitions.Population characteristics, including age [28,39], ethnicity [40], and immunity influence COVID-19 outcomes.Additionally, we noted differences in preconditions for specific therapies or hospital admissions among countries [41].Further complicating matters, the comparability of composite outcomes was limited due to the variation in the combination selected by different study groups.Good performance measures, in combination with small sample sizes or inconsistent reporting of both discrimination and calibration measures, indicate a higher risk for overfitting [15,42].Regarding "all" scores, high AUCs (78.5% ≥ 0.75) often came together with relatively small sample sizes (64.0%≤ 1000 patients).Scores should not be applied in clinical practice until validations show generalizability, applicability, and robust performances across various patient characteristics that match regional circumstances [8,42,43].

Comparison to Other Studies
With the abundance of published models and scores, identifying "all" relevant items is demanding.Therefore, complementary approaches are needed.We identified a few reviews on COVID-19 predictions or scores, all focusing on different approaches and yielding a (slightly) different set of models, both overall and in terms of low ROB [8,9,44,45].The 4C score [12], the PRIEST model [23] and the NEWS2 were repeatedly discussed as favorable prognostic tools.

Limitations of the Review Process
We restricted the detailed analysis to scores that fulfilled predefined criteria; thus, the L2 results refer to scores not representative of "all" scores.A broader approach, including additional sources, might have revealed further relevant studies.We did not contact the studies' authors for missing information and used a restricted Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) checklist [7] (see Supplementary Text 3) focused on aspects considered most relevant to our research question.Well-established early warning scores were not within our scope but are reported to have a robust performance in validation studies [46].

CONCLUSION
Our study is a comprehensive analysis of COVID-19 scores regarding predictor assessment and applicability.Most scores exhibited a marked ROB and lacked external validation.In future pandemics, data and resource sharing alongside the application of recommended model development and reporting guidelines [6,7,15,16] would improve score quality and visibility, leading to better implementation for the benefit of the patients.
With currently 3 years of COVID-19 investigations at data retrieval, we also recognize the absence of reliable scoring systems for the prognosis of PCC.Because outcomes have continuously improved since the first wave of the pandemic, many experts consider PCC has surpassed severe illness and death as a health hazard.Reliable predictors of poor long-term outcomes would be an asset for decision making and the design of future clinical trials.
In conclusion, none of the numerous scores that have been developed received strong guideline recommendations on an international level.The current consensus is that the predictive tools for COVID-19 are helpful but should only support and not replace physician's judgments.

Figure 1 .
Figure 1.PRISMA flow chart.*Not every study had a DOI or had multiple DOIs.Abbreviations: COVID-19, coronavirus disease 2019; DOI, digital object identifier; PRISMA, Preferred Reporting Items for Systematic reviews and Meta-Analyses.

b
Regarding population characteristics, "severe disease" includes ICU patients and patients with respiratory complications, pneumonia, intubation, or other severe conditions.c Multiple options possible.d Level 1 (L1) includes those scores among "all" scores that did not fulfill the Level 2 selection criteria.e Or median with IQR.A Systematic Review of COVID-19 Prognostic Scores • CID 2024:78 (15 April) • 893

Figure 2 .
Figure 2. Relationship between predictor frequency within all scores (irrespective of category) and AUC of the overall score.The predictors were grouped by predictor type.Only predictors that were integrated at least twice in a score are presented.Abbreviations: AUC, area under the curve; COVID-19, coronavirus disease 2019.

Table 1 . Inclusion and Exclusion Criteria for the Selection of Literature
Abbreviation: COVID-19, coronavirus disease 2019.a A radiological score without further combination with other clinical predictors was considered a single predictor.

Table 4 . Characteristics of Scores With Low or Unclear Risk of Bias (ROB) Rating
Results from clinical trials indicate that Results of the ROB assessment using PROBAST.Abbreviations: ED, emergency department; ICU, intensive care unit; ROB, risk of bias. the selected set of scores and all scores assessed in Level 2 are presented in Supplementary Table 2. Abbreviations: AUC, area under the curve; BP, blood pressure; BUN, blood urea nitrogen; COVID-19, coronavirus disease 2019; CRP, C-reactive protein; DC, development cohort; ED, emergency department; GCS, Glasgow Coma Scale; GFR, glomerular filtration rate; RR, respiratory rate; NIV, non-invasive ventilation; NLR, neutrophils-lymphocytes-ratio; SaO2, oxygen saturation; VC, validation cohort.Cumulative sample size consists of development cohort plus VC(s).Outcomes in the development cohort or the whole cohort (*) if not otherwise stated.