Systematic review and meta-analysis of mortality risk prediction models in adult cardiac surgery

Abstract OBJECTIVES The most used mortality risk prediction models in cardiac surgery are the European System for Cardiac Operative Risk Evaluation (ES) and Society of Thoracic Surgeons (STS) score. There is no agreement on which score should be considered more accurate nor which score should be utilized in each population subgroup. We sought to provide a thorough quantitative assessment of these 2 models. METHODS We performed a systematic literature review and captured information on discrimination, as quantified by the area under the receiver operator curve (AUC), and calibration, as quantified by the ratio of observed-to-expected mortality (O:E). We performed random effects meta-analysis of the performance of the individual models as well as pairwise comparisons and subgroup analysis by procedure type, time and continent. RESULTS The ES2 {AUC 0.783 [95% confidence interval (CI) 0.765–0.800]; O:E 1.102 (95% CI 0.943–1.289)} and STS [AUC 0.757 (95% CI 0.727–0.785); O:E 1.111 (95% CI 0.853–1.447)] showed good overall discrimination and calibration. There was no significant difference in the discrimination of the 2 models (difference in AUC −0.016; 95% CI −0.034 to −0.002; P = 0.09). However, the calibration of ES2 showed significant geographical variations (P < 0.001) and a trend towards miscalibration with time (P=0.057). This was not seen with STS. CONCLUSIONS ES2 and STS are reliable predictors of short-term mortality following adult cardiac surgery in the populations from which they were derived. STS may have broader applications when comparing outcomes across continents as compared to ES2. REGISTRATION Prospero (https://www.crd.york.ac.uk/PROSPERO/) CRD42020220983.


INTRODUCTION
Cardiac surgery carries an inherent risk of perioperative mortality and morbidity. This varies considerably depending on the patients' characteristics, baseline pathology and planned surgical intervention. Prediction models have been created [1][2][3][4][5][6] to quantify this risk. These models are utilized when counselling patients, discussing patients within the multi-disciplinary team, for benchmarking performance and more recently in guidelines for the management of aortic stenosis and deciding between surgical or transcatheter treatments [7,8]. Present models predominantly quantify the risk of death in the short term. The most cited models are the European System for Cardiac Operative Risk Evaluation (ES) [1,2,9] and the Society of Thoracic Surgeons (STS) score [10,11].
There is no guidance at present on which is the optimum score to utilize in a given clinical or research setting and concerns have arisen regarding the degree of applicability of a specific model to a localized population given the heterogenous populations from which they were originally derived. This leaves clinicians with the difficult decision of choosing which model to utilize when reporting and comparing outcomes. The relative performance of these models is thus the focus of this systematic review. We aim to build on previous work by using dedicated statistical methods to evaluate the comparative discrimination and calibration of the ES2 and STS not only in the wider cardiac surgery spectrum but also as they are applied to specific subgroups of the population. We believe that this is the most thorough comparison of these models.

METHODS
The data and scripts that support the findings of this study are available from the corresponding author upon reasonable request.

Systematic review
We report on the original papers and subsequent external validations available and draw comparisons between the models' discriminatory power, as defined by the area under the receiver operator curve (AUC) or C-statistic, and their calibration, as defined by the ratio of the observed-to-expected mortality (O:E) within 30 days of the operation or the same hospital admission. Longer-term follow-up data were not included in the analysis to allow parity among studies and with the originally published papers on STS and ES2. A systematic literature review and metaanalysis of the above findings followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses [12] and Metaanalysis Of Observational Studies in Epidemiology principles [13].
Our librarian conducted a literature search, restricting articles to those translatable into English and referencing adults only, using the described search string (Supplementary Material, Table  S1). We also hand-searched the reference lists of papers identified but did not contact the authors. Excluded papers and rationale for exclusion have been noted ( Fig. 1 and Supplementary Material, Table S2). If studies performed subgroup analysis such that the AUC or predicted mortality was not available for the whole dataset, then the subgroups were treated as independent populations. Institutes reporting on multiple occasions but utilizing different populations of patients were also treated as independent populations. The search is updated to 29 October 2020. Papers were screened and data extracted independently by 3 reviewers (SS/AD/LD). Outliers and studies with a high risk of bias were included the primary analysis following discussion between 2 authors (SS/UB). SS/UB had full access to all the data in the study and take responsibility for its integrity and the data analysis. The data extraction items were based on the CHARMS checklist [14] and the risk of bias was assessed using the PROBAST tool [15,16]

Statistical analysis
Data were extracted as frequency and percentage for categorical variables and mean and standard deviation for continuous variables. The outcomes were AUC and O:E. Two separate analyses were conducted. First, we reviewed each score in turn and provided pooled estimates of AUC and O:E for comparison in accordance with previously published guidance [16][17][18]. It was assumed that variation in these parameters across studies was prone to between-study heterogeneity, due to the varied casemix of populations studied, and thus, a random effects model was utilized [17]. The standard error of the AUC was calculated using Newcombe Method 4 [19]: h i mn ĉ is the estimated AUC, n is the number of observed events and m is the number of non-events, m* = n* = [1/2 (m + n)] -1). Analysis was conducted using R (version 4.0.3). Meta-analysis models were formed using R-package 'metamisc' [17] and 'metafor' [20] and results displayed as forest plots. We reported 95% prediction interval (PI), which takes into account the betweenstudy heterogeneity [17].
Second, for studies reporting ES2 and STS, we established pooled estimates of discrimination (AUC) and calibration (O:E) for each model and compared the confidence intervals (CIs). The lack of overlap in CIs indicated a marked difference in performance. The differences in AUCs and standard error of the difference in AUCs [6,21] were calculated per paper and utilized in a meta-analysis with the 'metafor' [20] package.
We also conducted stratified analysis by operation, continent and time. All ES2 papers were published after 2011; however, we separated the papers into studies solely reporting on patients operated on in or after 2010 ('post-2010') and those that contained data on patients operated on prior to 2010 ('pre-2010'), on whom the authors had retrospectively calculated the ES2. We repeated the main comparisons stratifying by risk of bias (Supplementary Material, Figs. S1-S4). The presence of smallstudy effects was verified by visual inspection of the funnel plots (Supplementary Material, Figs. S5 and S6). Statistical heterogeneity was tested using Cochrane Q-test, and extent of statistical consistency was measured with I 2 , which describes the percentage of the variability in effect estimates due to heterogeneity rather than sampling error (chance).
We found that ES2 calibration varied significantly between continents (P < 0.0001).  Fig. S8). There was statistical evidence of an association between AUC and O:E and the type of operation (P < 0.0001), largely driven by in 1 mitral study ( Table 3).  Table 2). There was a statistically significant correlation between AUC and the continent of the study (P = 0.03; Table 4 (Table 4). There were no significant differences in STS score between continents nor over time.

Society of Thoracic
European System for Cardiac Operative Risk Evaluation 2 versus Society of Thoracic Surgeons in comparative studies. There was no difference in discrimination between ES2

DISCUSSION
We compared the performance of the 2 most used mortality prediction models in adult cardiac surgery-ES2 and STS scores, using measures of discrimination (AUC) and calibration (O:E). Discrimination is a model's ability to successfully differentiate between those likely and unlikely to experience an event in each population. Calibration describes the certainty with which it can predict the occurrence of an event in an individual. Both should be optimized to have a truly efficient model. Our results build on findings from 3 previous meta-analyses [6,22,23] by providing a dedicated statistical technique to quantitatively assess calibration in addition to discrimination and performing extended subgroup analysis.
The most notable finding of our study was that whilst the ES2 and STS performed well across the whole population, there was significant variation in the performance of ES2 between continents. It was shown to work well in the continent from which it was derived (i.e. Europe) but over-predicted risk in NA and NZ and under-predicted risk in SA. The availability of the coefficients for ES2 in the public domain may explain why this is more widely reported and there are substantially more papers from Europe. There was a tendency of ES2 to under-predict risk in papers with patients operated on solely after 2010.
However, the STS score showed good and stable performance in all continents and across both time periods studied. The STS score regression coefficients are not in the public domain and it utilizes far more variables to provide procedure-specific outcome calculations of morbidity and mortality. Consequently, the STS score performance was reported far less frequently. A key difference in the models is that STS is recalibrated annually to ensure the O:E ratio remains around 1 [10,11].
Analysis of papers providing direct comparisons of calibration of the 2 models suggested a non-significant difference between them. The same predominance of European papers was not seen here and this may account for the discrepancy in our findings. It would have been interesting to evaluate the calibration of these models using the calibration slope or calibration in large, however this is often not reported. The Hosmer-Lemeshow statistic is one of the most widely reported statistics regarding model calibration but does not lend itself to statistical comparison between studies.
Over time the risk profile of patients has increased but operative mortality has decreased and ES has been shown to suffer from poor calibration, especially in those at highest risk [69][70][71][72][73]. The lack of availability of individual patient-level data limited our ability to analyse differential model performance in high and low-risk populations. Further review of these population subgroups would be of clinical importance.
Clinicians need to balance the superior performance of the STS with the relative parsimony and ease of use of ES2. Our findings suggest that ES2 and STS can be used in the populations

Limitations
Bias may have been introduced into the study as we only reviewed articles in English. Abstracts and unpublished works could not be included and may have resulted in publication bias. Small study effects and significant heterogeneity could not be negated despite performing meta-regression, subgroup and sensitivity analyses. We were only able to compare studies in whom the AUC and O:E ratios could be derived, and a large study [74] was excluded due to this. Reclassification metrics have been shown to be a good estimate of model discrimination [75]; however, they were not reported in these studies and the lack of individual patient-level data made their derivation impossible.
The ES2 and STS calibration demonstrated statistically significant differences by type of operation which was driven by a singular study on mitral operations. Most studies evaluated either a mixed population, aortic valve replacements ± CABG or isolated CABG. There were few studies with dedicated performance measures on mitral valve, aortic or off-pump CABG and so the utility of these scoring systems in these subgroups could not be evaluated accurately. With the increasing number of      'prophylactic' aortic aneurysm operations being conducted and the emergence of transcatheter mitral interventions the validation of existing risk prediction models in these populations will become increasingly relevant. Some interventional cardiologists have reported the use of these scoring systems in the prediction of risk in their patients and this is partially reflected in the latest guidelines [7]. We did not review the accuracy of these models in patients undergoing interventional procedures and so cannot comment on their applicability in this setting.

CONCLUSIONS
The results of this meta-analysis validate the use of either ES2 or STS in the prediction of mortality following adult cardiac surgery, especially in the continent from which they were derived. Both scores show good discrimination throughout the populations studied. The STS may be better calibrated when evaluating outcomes across European and North American centres. Future research should focus on analysis of large databases of individual patient-level data to corroborate these findings.

SUPPLEMENTARY MATERIAL
Supplementary material is available at ICVTS online.

ACKNOWLEDGEMENT
We would like to thank Ms. Joanna Hooper (librarian) for conducting the literature search.

Funding
This work was supported by the Bristol Biomedical Research Centre (NIHR Bristol BRC).