Ensemble learning for poor prognosis predictions: A case study on SARS-CoV-2

Abstract Objective Risk prediction models are widely used to inform evidence-based clinical decision making. However, few models developed from single cohorts can perform consistently well at population level where diverse prognoses exist (such as the SARS-CoV-2 [severe acute respiratory syndrome coronavirus 2] pandemic). This study aims at tackling this challenge by synergizing prediction models from the literature using ensemble learning. Materials and Methods In this study, we selected and reimplemented 7 prediction models for COVID-19 (coronavirus disease 2019) that were derived from diverse cohorts and used different implementation techniques. A novel ensemble learning framework was proposed to synergize them for realizing personalized predictions for individual patients. Four diverse international cohorts (2 from the United Kingdom and 2 from China; N = 5394) were used to validate all 8 models on discrimination, calibration, and clinical usefulness. Results Results showed that individual prediction models could perform well on some cohorts while poorly on others. Conversely, the ensemble model achieved the best performances consistently on all metrics quantifying discrimination, calibration, and clinical usefulness. Performance disparities were observed in cohorts from the 2 countries: all models achieved better performances on the China cohorts. Discussion When individual models were learned from complementary cohorts, the synergized model had the potential to achieve better performances than any individual model. Results indicate that blood parameters and physiological measurements might have better predictive powers when collected early, which remains to be confirmed by further studies. Conclusions Combining a diverse set of individual prediction models, the ensemble method can synergize a robust and well-performing model by choosing the most competent ones for individual patients.


INTRODUCTION
Risk prediction models are widely used in clinical practice to inform decision making. [1][2][3] Good models cannot only improve health service efficiencies, but also predict deterioration 4 in a proactive manner, 5 with a great potential to improve outcomes and save lives. Such evidence-based decision making supports are particularly important in an epidemic or pandemic outbreak, not only for informing the treatments or managements of those infected, but also for optimizing healthcare services to minimize indirect effects to most vulnerable service users. For example, the recent severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused substantial excess mortality, 6,7 at least partly due to an indirect effect on healthcare systems, leading to a loss of capacity to provide elective and emergency care within the "golden window" of opportunity. [7][8][9] To mitigate excess mortality, more targeted inpatient care in future waves could be informed by (1) better risk prediction and (2) insights from international coronavirus disease 2019 (COVID-19) (we use the terms SARS-CoV-2 and COVID-19 interchangeably) datasets and experience to validate models and learn from different countries' responses.
There have been numerous prediction models developed for COVID-19, 10-14 but most were derived in small datasets, had low methodological quality, and are unvalidated. 13 In addition, models learned from single cohorts (even from several centers) might not have the predictive power to achieve good performance in situations in which a disease spreads to the whole population, leading to greatly diverse prognoses. In this study, we reproduced various prediction models with reasonable quality and synergized them using ensemble learning 15 to assess their collective ability to accurately discriminate mild and severe patients in a diverse set of 4 patient cohorts from the United Kingdom and China with varying patterns of disease severity ( Figure 1A). In particular, China and the United Kingdom had very different approaches to hospital admission for COVID-19. In Wuhan, admission was routine with patients triaged to low intensity (Fangcang hospitals) 16 or higher dependency (designated hospitals) settings, whereas in the United Kingdom, admission of patients with more severe disease or at perceived higher risk of severe disease was prioritized. These differences enabled us to assess model performance in different settings. For outcomes specifically, we primarily focused on poor prognosis defined by either death or intensive care unit stay. Figure 2 depicts the architecture of this work-synergizing individual models from the literature for preventing excess mortality. For prediction models ( Figure 1B), 7 models-Dong, 10 Shi, 17 Gong, 18 Lu, 19 Yan, 20 Xie, 21 and Levy 22 -were chosen with different model types using diverse sets of predictors. Derivation cohorts were diverse, originating from 6 regions in 2 countries, with median ages ranging from 44 to 65 years, and with mortality varying between 7% and 52%. Such diversity provides leverage for synergizing insights from these derivation cohorts to obtain a collective and hopefully improved predictive power.

MATERIALS AND METHODS
To synergize models derived from multinational datasets, we used ensemble learning, 15,23 a machine learning methodology that is particularly effective when single models perform well at certain subsets of the whole data samples but none of them can achieve good overall performances. The rationale is to partition the data samples into groups and choose the most suitable model(s) for particular groups (eg, to give more weights to models derived from older populations with more severe cases for a 78-year-old patient with lymphocyte count of 0.7) so that the optimal overall prediction result can be achieved. Figure 1C shows a synthetic and schematic illustration of such a situation. In (conventional) ensemble learning scenarios, weak predictors are usually trained on subsets of the same dataset. The key difference of this work is that the weak predictors were not trained locally on one particular dataset, but rather were selected from the literature (ie, learned from external datasets, which the ensemble model does not have access to) and reimplemented for aggregation.
The aggregation approaches used in this study do not belong to the stacking method (also called stacked generalization), 24 that is to learn a new model using inputs from individual classifiers. Instead, they are inspired by bagging predictors 25 -aggregating results in a data-independence manner.

Validation and analytics cohorts
The first Wuhan cohort (Wuhan01) consisted of 2869 adults with COVID-19 confirmed by reverse transcriptase polymerase chain reaction admitted to 1 of 2 hospitals in Wuhan, China (Wuhan Sixth Hospital and Taikang Tongji Hospital), admitted between February 1 and 23, 2020, and who died or were discharged on or before March 29, 2020. The second Wuhan cohort (Wuhan02) consisted of 357 adults with COVID-19 from Tongji Hospital, data of which was collected between January 1 and March 4, 2020.  20 and (2) the 2 were followed up in different periods related to the surge ( Figure 1A.2). Table 1 gives the baseline for comparing poor prognosis or died and not poor prognosis and did not die subgroups of all 4 cohorts. All cohorts were retrospective and extracted from electronic health records for this study. Demographics and (a)   Clinical features

Prediction model selection and reimplementation
In May 2020, we conducted a literature search for COVID-19 poor prognosis models. The search and selection process are described with details in Supplementary Figure S1. Briefly, for prediction models ( Figure 1B), we selected COVID-19 prognosis (either death or severity) models that were (1) reproducible (implementable models with all parameters reported); (2) using predictors that are readily available at community triage at large scale (ie, demographics, underlying conditions, blood tests, and vital signs); and (3) with sufficient information describing the derivation cohort including cohort size, interquartile range of age, country/region, follow-up period, and mortality and poor prognosis ratios. Table 2 describes information of the 7 models including the outcomes, computational methods, information of derivation cohorts (eg, size, region or country, mortality rate, follow-up period). We reimplemented these 7 prediction models by extracting all parameters from their published or preprint manuscripts or publicfacing websites. Five different models are implemented including decision tree, logistic regression, nomogram, scoring, and NOCOS (a customized transparent model). We also extracted derivation cohort size, follow-up periods, and distributions of numeric predictors (bloods and vitals). Supplementary Table S1 shows predictors used by each prediction model and also gives the numeric variable distributions of their derivation cohorts. Figure 1B illustrates the timeline of the follow-up periods of all models' derivation cohorts.

Competence assessment framework for model selection
The key to obtaining an effective ensemble model is a good aggregation mechanism that can choose the best-performing model(s) for individual patients so that an overall optimal classification could be achieved. Stacking methods (learning a model from individual classifiers) usually produce better ensembles than bagging (majority vote or weighted majority vote). 23 However, the former requires labeled data to further learn a model, which is not possible in our scenario (ie, using the ensemble model in clinical decision making for managing . Therefore, a data-independent approach (like bagging) is required.
For risk prediction models, their predictive capacities are underpinned by the patient characteristics of their derivation cohorts. For example, given a new patient, models that were trained on (enough number of) similar patients likely perform better than those that were not. The conventional bagging methods (majority vote or their variations) are unlikely to work very well, as they are not capable of capturing such a similarity and its associations with model competence.
We propose a novel bagging mechanism using a competence assessment framework for assisting model selections in the aggregation step. The framework is designed to quantify the competence of each model for a given patient data sample. Three factors are considered. The first factor is called familiarity competence, which quantifies the previously mentioned similarity (ie, how familiar is a model with the new patient sample to be predicted). The second factor is the general competence, which can be reflected by the derivation cohort size, as we know prediction models derived from large cohorts are usually superior to those from smaller ones. The final factor is to consider data completeness of a patient sample relative to a prediction model. "Absolute" data completeness of our validation cohorts is observed to be relatively good, meaning if a clinical feature is collected at a hospital most patients tend to have it. However, "relative" completeness (ie, given a prediction model, the percentage of its risk predictors available in the dataset) varies significantly. Model predictive powers are likely to be compromised by such relative incompleteness, which therefore needs to be considered in the framework.
We first specify the calculation of the familiarity competence. Let P ¼ p 1 ; . . . ; p k f gbe the set of all numeric predictors, dist m; p ð Þ¼ðm p ; q1 p ; q3 p Þ be the distribution (median, first quartile, and third quartile, respectively) of p in the model m's derivation cohort. Given a patient data sample: where v p is the numeric value of predictor p, the familiarity competence of m on p is defined as follows.   Values are median (interquartile range) or n (%). For outcomes, poor prognosis is defined as severities including length of stay, intensive care unit stay, or categories of treatments. For model type, scoring refers to models that calculate a sum from scores predefined to individual predictor values; logistic regression and decision tree refers to models in which these computational models are used; nomogram refers to models represented as a 2-dimensional graphical calculating diagram. a Customized model.
The final competence calculation is defined as the following formula. The first component divides the familiarity competence by the total number of numeric predictors of the model, incorporating the relative data completeness of s to m. The second component is general competence based on the size of a model's derivation cohort. Assuming that the 2 components are equally important, we calculate the overall competence as a product of the 2 where P M is the set of all numeric predictors of m; h(m) is the derivation cohort size and M is the set of all models.

Prediction fusion in ensemble model
Different methods have been proposed in multiple classifier systems 26 to combine individual classifiers for achieving more accurate classifications. Depending on whether further training is used or not, the combination methods can be categorized as trainable combiners vs nontrainable combiners. The former (eg, AdaBoost) 27 requires labeled data in the application domain (ie, where the ensemble model is going to be used). The latter (eg, majority vote combiner) can be used in a data-independent manner (ie, applicable in new domains without the need of further training). The motivation of this work is to use the ensemble or combined model to inform decision making in care pathways or policy making, where labeled data are not available. Therefore, nontrainable combiners were used. A set of fusion methods were implemented. For competenceindependent ones, we implemented voting (majority, 1 positive, and 1 negative) and scoring (maximum and average), which are common fusion strategies used in ensemble learning. 26 When all models are assessed against the data of a given patient, the competence values can then be used to fuse predictions (probabilities of poor prognosis) from all models. We implemented the following: trust-the-most-competent mode (use the prediction of the one with highest competence value); wisdom-of-the-crowd mode (use the weighted average of all predictions); highest-in-top-competent-ones mode (use the maxim probability in top k competent models [k ¼ 3, 5]). Supplementary Figure S2 gives an illustrative example of the 3 fusion strategies. Wisdom of the crowd performed the best in our experiments and was used in this work.
The original model design is another factor that needs to be considered in the prediction fusion. Individual models were designed for predicting different severities: mortality or different definitions of severities. We manually defined a severity score for each model (death models: 1.0; poor prognosis ones: 0.3) and combined those scores in the final fusion formula as follows. The formula considers predictions from all individual models and combines them as weighted average. where S m is the predefined severity score of m.

RESULTS
The performances of prediction models were evaluated on 3 aspects: discrimination (C-Index), model calibration and a number of parameters defining likely clinical utility. For discrimination ( Figure 3A)   For clinical usefulness, we focus on decision-making support for admission strategies (ie, who to admit and to where). It is not appropriate to use a fixed threshold of probability to validate model performances, as (1) individual models are derived from cohorts with diverse severities and on slightly different definitions of poor prognosis and (2) severity in the validation cohorts also varies significantly. Instead, for each validation cohort we compute an event rate (number of poor prognosis or deceased patients divided by total number of patients) and for models we compute a prediction rate (predicted events divided by total number of patients). We then validate the sensitivity and specificity of a model when its prediction rate is closest to 1.5 times of the event rate or a minimal ratio of 0.15, whichever is larger. Figure 3B shows the performances of all models on 4 cohorts using cohort-specific prediction rate. We observed the ensemble model consistently outperforms individual models across all cohorts on positive predictive value, sensitivity, and specificity. We observed prediction rate-based cutoffs led to quite different performances on the metrics of positive predictive value, sensitivity, and specificity. These were what we expected. For example, for Wuhan01, the mortality rate is 2.4%, which is close to the population level. Therefore, we would expect a good model to have high specificity (ensemble model achieved 0.88) to correctly reject less severe patients so that hospital capacity can be mostly reserved for patients likely to deteriorate (without admitting too many mild patients). On the contrary, when the cohort is very severe (eg, Wu-han02), high sensitivity is preferred (ensemble model: 0.96) as we do not want to discharge those who would likely need intensive care.
To quantify how well the ensemble model reclassifies patients, we also calculated the net reclassification improvements 28 by comparing them with the best individual model on each validation cohort. Table 3 gives the details, in which the ensemble model achieved net improvements in all cases with the biggest on Wuhan02 and the smallest on KCH.
We also evaluated the model calibrations of all models on all 4 cohorts: Figure 3C shows the calibration slope and calibration in large, and Supplementary Figure S3 depicts the calibration plots. For individual models, similar to C-index performances, they did not perform consistently well across cohorts. For example, Xie had very good calibration on Wuhan01, while it performed poorly on UHB. Again, the ensemble model has shown robust performances on all cohorts-calibrations were good to very good generally.

DISCUSSION
This work has shown that single models for prediction did not consistently perform well. For example, Dong's C-index on Wuhan02 is the best in individual models, but it only achieved the fourth-highest C-in-

Wuhan01
Wuhan02* KCH UHB * Yan/Shi were not evaluated on Wuhan02 as they were derived from the same hospital data   dex on KCH. Similar situations were observed on other top single models, including Xie and Levy. On the one hand, the challenge of getting consistent performances in diverse cohorts resides in the fact that COVID-19 prognosis will vary depending on variables underlying demography (age and comorbidity of the populations) and severities of disease in different settings (because of different admission strategies). For models derived from single cohorts, their prediction capacities were limited by the characteristics of data samples they have seen. Therefore, they are unlikely to achieve a high performance in external cohorts when there are many patients with novel characteristics. On the other hand, ensemble learning methods have the potential to make the best use of all available models. If these models were learned from complementary cohorts, the synergized model will have the potential to achieve better performances than any single model by using most competent ones for individual patients.
Comparing results in the United Kingdom (patients being admitted with more severe disease) and Chinese cohorts (more patients being admitted with mild disease), all models consistently performed worse on UK cohorts. Considering the fact that individual models used quite diverse predictors, adopted different computational algorithms, and were derived from different regions and countries, it seems the observed poorer performances are likely associated with the United Kingdom's response to the first wave of COVID-19 surge. The United Kingdom mainly admitted severe patients aiming to reserve health service capacities. Therefore, one possible explanation is that blood parameters and physiological measurements are better collected as early as possible to contribute to improved predictive utility.
One limitation of this work was that we were unable to include prediction models that were learned from European cohorts, particularly from the United Kingdom. Including more local models would probably facilitate the ensemble framework to identify those predictors that are more predictive in the European cohorts, which would in turn improve the overall performance in the UK cohorts. In our future work, we will create a web platform to allow the community to share models so that a wide range of diverse and complementary models can be synergized.

CONCLUSION
In this study we selected and reimplemented 7 prediction models for COVID-19 with diverse derivation cohorts and different implementation techniques. A novel ensemble learning framework was proposed to synergize them for realizing personalized predictions for individual patients. Four international COVID-19 cohorts were used in validating both individual and ensemble models. Validation results showed that ensemble methods could synergize a robust and good-performing model by choosing the most competent model for individual patients.

AUTHOR CONTRIBUTIONS
HW, HZ, ZI, RD, and BG conceived the study design and developed the study objectives. ZI, HZ, and TS contributed to the statistical analyses. KD provided overall clinical input to the study. HW performed the model reimplementation, ensemble learning, and software development. For King's College Hospital data, DB and JTT were responsible for the data extraction and preparation; JTT, KO, and RZ provided clinical input; and JTT performed data validation. For University Hospitals Birmingham data, AK, VRG, and TV were responsible for data extraction and preparation; FG-S, TW, TV, and GVG provided clinical input and validated the results on University Hospitals Birmingham data. For the Wuhan01 cohort, XW, XZ, XW, and JS extracted the data from the EHR system. HW and HZ preprocessed the raw data and conducted the prediction model validations; BG, HW, HZ, TS, and JS interpreted the data and results. For Wuhan02, Professor Ye Yan and KL were responsible for data extraction and preparation; KL, HW conducted the prediction model validations; YY interpreted the data and results. All authors contributed to the interpretation of the data, critical revision of the manuscript, and approved the final version of the manuscript.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE
The King's College Hospital component of the project operated under London South East Research Ethics Committee (reference 18/LO/2048) approval granted to the King's Electronic Records Research Interface (KERRI); specific work on COVID-19 research was reviewed with expert patient input on a virtual committee with Caldicott Guardian oversight. The University Hospitals Birmingham validation was performed as part of a service evaluation agreed with approval from trust research leads and the Caldicott Guardian. The Wuhan validations were approved by the Research Ethics Committee of Shanghai Dongfang Hospital and Taikang Tongji Hospital.

SUPPLEMENTARY MATERIAL
Supplementary is available at Journal of the American Medical Informatics Association online ACKNOWLEDGMENTS This work uses data provided by patients and collected by the National Health S as part of their care and support. We thank Prof Ye Yan (from Huazhong University of Science and Technology, Wuhan, China) for his support of providing access to the Wuhan02 cohort.

DATA AVAILABILITY
Metadata of individual prediction models, their reimplementations, ensemble learning methods and all validation scripts are available at https://github.com/ Honghan/EnsemblePrediction. Details of the validation cohorts are described at https://covid.datahelps.life/.
The Wuhan01 and Wuhan02 datasets used in the study will not be available due to inability to fully anonymize in line with ethical requirements. Applications for research access should be sent to TS and details will be made available via https://covid.datahelps.life/prediction/.
A subset of the KCH dataset limited to anonymizable information (eg, only SNOMED codes and aggregated demographics) is available on request to researchers with suitable training in information governance and human confidentiality protocols subject to approval by the King's College Hospital Information Governance committee; applications for research access should be sent to kch-tr.cogstackrequests@nhs.net. This dataset cannot be released publicly due to the risk of re-identification of such granular individual level data, as determined by the King's College Hospital Caldicott Guardian.
A subset of the University Hospitals Birmingham dataset limited to aggregate anonymized information is available on request to researchers with suitable training in information governance and human confidentiality protocols, subject to approval and data sharing agreements by the University Hospitals Birmingham NHS Foundation Trust.