-
PDF
- Split View
-
Views
-
Cite
Cite
Fabio Barili, Davide Pacini, Antonio Capo, Olivera Rasovic, Claudio Grossi, Francesco Alamanni, Roberto Di Bartolomeo, Alessandro Parolari, Does EuroSCORE II perform better than its original versions? A multicentre validation study, European Heart Journal, Volume 34, Issue 1, 1 January 2013, Pages 22–29, https://doi.org/10.1093/eurheartj/ehs342
Close - Share Icon Share
Abstract
The European System for Cardiac Operation Risk Evaluation (EuroSCORE) is widely used for predicting in-hospital mortality after cardiac surgery. A new score (EuroSCORE II) has been recently developed to update the previously released versions. This study was undertaken to validate EuroSCORE II, to compare its performance with the original EuroSCOREs and to evaluate the effects of the removal of those factors that were included in the score even if they were statistically non-significant.
Data on 12 325 consecutive patients who underwent major cardiac surgery in a 6-year period were retrieved from three prospective institutional databases. Discriminatory power was assessed using the c-index and comparison among the scores' performances was performed with Delong, bootstrap, and Venkatraman methods. Calibration was evaluated with calibration curves and associated statistics.
The in-hospital mortality rate was 2.2%. The discriminatory power was high and similar in all algorithms (area under the curve 0.82, 95% CI: 0.79–0.84 for additive EuroSCORE; 0.82, 95% CI: 0.79–0.84 for logistic EuroSCORE; 0.82, 95% CI: 0.80–0.85 for EuroSCORE II). The EuroSCORE II had a fair calibration till 30%-predicted values and over-predicted beyond. The removal of non-significant factors from EuroSCORE II did not affect performance, being both the calibration and discrimination comparable.
This validation study demonstrated that EuroSCORE II is a good predictor of perioperative mortality. It showed an optimal calibration until 30%-predicted mortality. Nonetheless, it does not seem to significantly improve the performance of older versions in the higher tertiles of risk. Moreover, it could be simplified, as the removal from the algorithm of non-significant factors does not alter its performance.
See page 10 for the editorial comment on this article (doi:10.1093/eurheartj/ehs343)
Introduction
The European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) is a new tool for the estimation of in-hospital mortality after cardiac surgery recently launched to update the older additive and logistic EuroSCOREs developed in 1999.1–3 These previous versions have rapidly gained wide popularity and are used worldwide for the cardiac surgery risk stratification and for the assessment of the quality of cardiac surgical services.4,5 They were tested and validated even for the prediction of perioperative complications and long-term outcomes.6–9 Moreover, additive/logistic EuroSCOREs have been employed in the recent years, together with the STS score, for the screening and selection of high-risk patients eligible for new surgical techniques, such as TAVI and MitraClip®, but the analysis of its performance within these specific surgical subpopulations has first underlined a tendency to over-predict the risk of mortality and of morbidity.10–12 The lack of calibration has been confirmed by several studies and is considered the result of the changing epidemiology of cardiac surgery and the improvement of surgical techniques and of perioperative care.13–16 Algorithm's performance strictly depends on homogeneity between the study group and the population from which the score is modelled and a gap of 15 years can explain the poor calibration in the contemporary cardiac surgery. Hence, the new EuroSCORE II was developed in order to improve the score's performance in traditional cardiac surgical procedures and its internal validation has showed a better calibration associated with a constant optimal discrimination.1 Nonetheless, these data are not yet validated in an external population.
The EuroSCORE II algorithm appears to be more complex than the previous versions, although the core of risk factors is almost the same.1 Some definitions are more precise. The symptomatic status has been defined incorporating the NYHA class and the CCS Class 4, while the outdated unstable angina was removed. Renal impairment has been classified considering the creatinine clearance while a wide categorization was used for the definition of ejection fraction, pulmonary artery systolic pressure, and urgency. Moreover, more attention was focused on the weight of surgical procedures and four classes of procedures replaced the previous definition of ‘other than isolated coronary artery bypass grafting (CABG)’. Analysing the panel of the final risk factors of the EuroSCORE II, some factors have been included in the algorithm although they were not independent predictors of mortality (P > 0.05) in the model development. As an example, the weight of the procedure one non-CABG has a P-value of 0.966 (close to 1). Its 95% confidence interval consequently ranges between −0.28 and 0.29, meaning that one non-CABG is at the same time and with the same weight a risk factor and a protective factor.
This study was designed with two main endpoints. It was undertaken to externally validate EuroSCORE II and compare its performance with those of the previous versions. Moreover, we sought to evaluate the discrimination and the calibration of the EuroSCORE II after removing the non-significant risk factors.
Methods
Study population and study design
The study population included all patients who underwent cardiac surgery in a 6-year period (from 2006 to 2011, 12 325 patients enrolled) within the departments of cardiac surgery of two university hospitals and one regional hospital. Transcatheter/percutaneous valve implant procedures were excluded from the study group, as they were not considered in the development of the EuroSCORE II algorithm.
Preoperative and demographic information, operative data and perioperative mortality, and complications for all patients were retrieved from the institutional databases that are prospectively collected. The Institutional Review Boards approved the data set's use for research. The Institutional Ethical Committees approved the study and the requirement for informed written consent was waived on the condition that subjects' identities were masked. Data from the three centres were matched and stored in a dedicated data set.
The EuroSCORE II was primarily developed to predict in-hospital mortality,1 hence the external validation was performed on the prediction of in-hospital mortality. For the evaluation of the performance of the three scores, additive EuroSCORE, logistic EuroSCORE, and the new EuroSCORE II were calculated for each patient in accordance with published guidelines with a dedicated software.1–3 Moreover we sought to evaluate the role of those factors that have been included in the algorithm although they were not independent predictors of mortality in the model development and a modified EuroSCORE II was computed, omitting in the regression algorithm all these non-significant factors.
Data analysis
The performance of the EuroSCORE models was analysed focusing on discrimination power and calibration.17,18 The discrimination performance indicates the extent to which the model distinguishes between patients who will die or survive in the perioperative period. It was evaluated by constructing receiver operating characteristic curves for each model and calculating the area under the curve (AUC) with 95% confidence intervals.19–21 Numerically, an area of 1.0 indicates the perfect discrimination power, whereas an area of 0.5 indicates no discrimination of the binary outcome. The comparison among curves was analysed with Delong, bootstrap, and Venkatraman methods, the first two comparing the AUC and the last the ROC curves themselves.21 Another index used to evaluate the predictive abilities was the Somers' Dxy rank correlation between predicted probabilities and observed responses. When Dxy = 0, the model is making random prediction, when Dxy = 1, the predictions are perfectly discriminating.22
Calibration refers to the agreement between observed outcomes and predictions. For example, 15 in-hospital deaths should be observed in a 100 patients' group with 15%-predicted mortality. The calibration performance can be evaluated by generating calibration plots that visually compare the prediction with the observed probability.17,21–23 The calibration plot is characterized by an intercept, which indicates the extent that predictions are systematically low or high, and a calibration slope that should be 0. The perfect calibrated predictions stay on the 45-degree line, while a curve below or above the diagonal, respectively, reflects overestimation and underestimation. For each model, the comparison of actual slope and intercept with the ideal value of 1 and 0 was performed with the U statistic and tested against a χ2 distribution with 2 degrees of freedom. Moreover, calibration was tested with the Hosmer–Lemeshow goodness-of-fit test, which compares observed to predicted values by decile of predicted probability.
The accuracy of the models was also tested calculating the Brier score (quadratic difference between predicted probability and observed outcome for each patient), an overall performance measure that is 0 when the prediction is perfect.21–23
Missing values occurred for variables ‘chronic pulmonary disease’ (0.23%), ‘extracardiac arteriopathy’ (0.14%), ‘neurological dysfunction disease’ (0.17%), ‘poor mobility’ (0.13%), ‘NYHA class’ (0.25%), ‘LVEF’ (0.07%), and ‘recent myocardial infarction’ (0.09%). Missing values were substituted by means of multiple imputation, as described in order to reduce bias and increase statistical power.22,24
Two-sided statistics were performed with a significance level of 0.05. For all analyses, the R 2.14.0 software was used [R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/].
Results
Table 1 reports EuroSCORE values of our patient cohort and of the patients used for the development study.1 The mean values of additive EuroSCORE, logistic EuroSCORE, and EuroSCORE II of our patient population were, respectively, 5.8 ± 3.1, 7.6 ± 8.7, and 2.8 ± 3.9. It should be noted that there were some differences in preoperative mortality rates between the validation study and the EuroSCORE II development study that were consistent with EuroSCORE II-predicted mortality rates in the two groups; this probably reflects slightly different demographic and clinical data of the two patient groups. Table 2 describes the preoperative and surgical data included in old and new EuroSCORE models and a summary of the surgical procedures performed. Baseline characteristics described in Tables 1 and 2 were not significantly different among the three centres.
Baseline statistics of the study population and of the EuroSCORE II group1
| . | EuroSCORE II study (16 828 patients) . | External validation (12 325 patients) . |
|---|---|---|
| EuroSCORE II | 3.9%a | 2.8% ± 3.9% |
| First quartile (25%) | — | 1.04% |
| Median value | — | 1.73% |
| Third quartile (75%) | — | 3.08% |
| EuroSCORE II >10% | — | 426 pts (3.5%) |
| EuroSCORE II >30% | — | 68 pts (0.6%) |
| Additive EuroSCORE | 5.8a | 5.8 ± 3.1 |
| Logistic EuroSCORE | 7.6%a | 7.6% ± 8.7% |
| In-hospital mortality | 3.9% | 2.2% |
| . | EuroSCORE II study (16 828 patients) . | External validation (12 325 patients) . |
|---|---|---|
| EuroSCORE II | 3.9%a | 2.8% ± 3.9% |
| First quartile (25%) | — | 1.04% |
| Median value | — | 1.73% |
| Third quartile (75%) | — | 3.08% |
| EuroSCORE II >10% | — | 426 pts (3.5%) |
| EuroSCORE II >30% | — | 68 pts (0.6%) |
| Additive EuroSCORE | 5.8a | 5.8 ± 3.1 |
| Logistic EuroSCORE | 7.6%a | 7.6% ± 8.7% |
| In-hospital mortality | 3.9% | 2.2% |
aCalculated in the validation data subset of the EuroSCORE II study.1
Baseline statistics of the study population and of the EuroSCORE II group1
| . | EuroSCORE II study (16 828 patients) . | External validation (12 325 patients) . |
|---|---|---|
| EuroSCORE II | 3.9%a | 2.8% ± 3.9% |
| First quartile (25%) | — | 1.04% |
| Median value | — | 1.73% |
| Third quartile (75%) | — | 3.08% |
| EuroSCORE II >10% | — | 426 pts (3.5%) |
| EuroSCORE II >30% | — | 68 pts (0.6%) |
| Additive EuroSCORE | 5.8a | 5.8 ± 3.1 |
| Logistic EuroSCORE | 7.6%a | 7.6% ± 8.7% |
| In-hospital mortality | 3.9% | 2.2% |
| . | EuroSCORE II study (16 828 patients) . | External validation (12 325 patients) . |
|---|---|---|
| EuroSCORE II | 3.9%a | 2.8% ± 3.9% |
| First quartile (25%) | — | 1.04% |
| Median value | — | 1.73% |
| Third quartile (75%) | — | 3.08% |
| EuroSCORE II >10% | — | 426 pts (3.5%) |
| EuroSCORE II >30% | — | 68 pts (0.6%) |
| Additive EuroSCORE | 5.8a | 5.8 ± 3.1 |
| Logistic EuroSCORE | 7.6%a | 7.6% ± 8.7% |
| In-hospital mortality | 3.9% | 2.2% |
aCalculated in the validation data subset of the EuroSCORE II study.1
Descriptive statistics of the EuroSCOREs' risk factors and surgical data in the validation study population
| Variable . | Additive/logistic EuroSCORE . | EuroSCORE II . | Modified EuroSCORE II . |
|---|---|---|---|
| Preoperative data and comorbidities (%) | |||
| Age (years) | 67.4 ± 11.8 | ||
| Gender (female) | 3885 (31.5) | ||
| Chronic pulmonary diseasea | 763 (6.2) | Removed | |
| Extracardiac arteriopathya | 1426 (11.6) | ||
| Neurological dysfunction diseasea | 81 (0.7) | NIM | |
| Poor mobilitya | NIM | 61 (0.5) | Removed |
| Previous cardiac surgerya | 719 (5.8) | ||
| Serum creatinine >200 μmol/La | 366 (3.0) | NIM | |
| Creatinine clearance 50–85 mL/min | NIM | 8389 (68.1) | |
| Creatinine clearance <50 mL/min | NIM | 1146 (9.3) | |
| Dialysis | NIM | 66 (0.5) | |
| Active endocarditisa | 169 (1.4) | ||
| Critical preoperative statea | 246 (2.0) | ||
| Unstable anginaa | 578 (4.7) | NIM | |
| Diabetes on insuline | NIM | 516 (4.2) | |
| NYHA Class II | NIM | 4060 (32.9) | Removed |
| NYHA Class III | NIM | 1194 (9.7) | |
| NYHA Class IV | NIM | 163 (1.3) | |
| CCS Class IV | NIM | 487 (3.9) | Removed |
| Left ventricular ejection function (LVEF) (%) | |||
| LVEF 30–50% | 2433 (19.7) | ||
| LVEF <30% | 502 (4.1) | NIM | |
| LVEF 20–30% | NIM | 478(3.9) | |
| LVEF <20% | NIM | 24 (0.2) | |
| Recent myocardial infarcta | 1946 (15.8) | Removed | |
| Pulmonary hypertension (%) | |||
| Systolic PA pressure >60 mmHg | 330 (2.7) | NIM | |
| Systolic PA pressure >55 mmHg | NIM | 349 (2.8) | |
| Systolic PA pressure 31–55 mmHg | NIM | 922 (7.5) | Removed |
| Urgencya (%) | |||
| Urgent | NIM | 1033 (8.4) | |
| Emergency | 435 (3.5) | ||
| Salvage | NIM | 11 (0.1) | |
| Surgical data (%) | |||
| Other than isolated CABGa | 7379 (59.9) | NIM | |
| Surgery on thoracic aortaa | 1525 (12.4) | ||
| Post-infarct septal rupturea | 23 (0.2%) | NIM | |
| Number of surgical proceduresa (%) | |||
| One non-CABG | NIM | 3892 (31.6) | Removed |
| Two | NIM | 2791 (22.6) | |
| Three or more | NIM | 696 (5.6) | |
| Coronary artery bypass grafting | 6457 (52.4) | ||
| Mitral valve surgery | 3056 (24.8) | ||
| Aortic valve surgery | 4146 (33.6) | ||
| Tricuspid valve surgery | 985 (7.9) | ||
| Surgery for left ventricular aneurysm | 258 (2.1) | ||
| Other major heart procedures | 503 (4.1) | ||
| Variable . | Additive/logistic EuroSCORE . | EuroSCORE II . | Modified EuroSCORE II . |
|---|---|---|---|
| Preoperative data and comorbidities (%) | |||
| Age (years) | 67.4 ± 11.8 | ||
| Gender (female) | 3885 (31.5) | ||
| Chronic pulmonary diseasea | 763 (6.2) | Removed | |
| Extracardiac arteriopathya | 1426 (11.6) | ||
| Neurological dysfunction diseasea | 81 (0.7) | NIM | |
| Poor mobilitya | NIM | 61 (0.5) | Removed |
| Previous cardiac surgerya | 719 (5.8) | ||
| Serum creatinine >200 μmol/La | 366 (3.0) | NIM | |
| Creatinine clearance 50–85 mL/min | NIM | 8389 (68.1) | |
| Creatinine clearance <50 mL/min | NIM | 1146 (9.3) | |
| Dialysis | NIM | 66 (0.5) | |
| Active endocarditisa | 169 (1.4) | ||
| Critical preoperative statea | 246 (2.0) | ||
| Unstable anginaa | 578 (4.7) | NIM | |
| Diabetes on insuline | NIM | 516 (4.2) | |
| NYHA Class II | NIM | 4060 (32.9) | Removed |
| NYHA Class III | NIM | 1194 (9.7) | |
| NYHA Class IV | NIM | 163 (1.3) | |
| CCS Class IV | NIM | 487 (3.9) | Removed |
| Left ventricular ejection function (LVEF) (%) | |||
| LVEF 30–50% | 2433 (19.7) | ||
| LVEF <30% | 502 (4.1) | NIM | |
| LVEF 20–30% | NIM | 478(3.9) | |
| LVEF <20% | NIM | 24 (0.2) | |
| Recent myocardial infarcta | 1946 (15.8) | Removed | |
| Pulmonary hypertension (%) | |||
| Systolic PA pressure >60 mmHg | 330 (2.7) | NIM | |
| Systolic PA pressure >55 mmHg | NIM | 349 (2.8) | |
| Systolic PA pressure 31–55 mmHg | NIM | 922 (7.5) | Removed |
| Urgencya (%) | |||
| Urgent | NIM | 1033 (8.4) | |
| Emergency | 435 (3.5) | ||
| Salvage | NIM | 11 (0.1) | |
| Surgical data (%) | |||
| Other than isolated CABGa | 7379 (59.9) | NIM | |
| Surgery on thoracic aortaa | 1525 (12.4) | ||
| Post-infarct septal rupturea | 23 (0.2%) | NIM | |
| Number of surgical proceduresa (%) | |||
| One non-CABG | NIM | 3892 (31.6) | Removed |
| Two | NIM | 2791 (22.6) | |
| Three or more | NIM | 696 (5.6) | |
| Coronary artery bypass grafting | 6457 (52.4) | ||
| Mitral valve surgery | 3056 (24.8) | ||
| Aortic valve surgery | 4146 (33.6) | ||
| Tricuspid valve surgery | 985 (7.9) | ||
| Surgery for left ventricular aneurysm | 258 (2.1) | ||
| Other major heart procedures | 503 (4.1) | ||
NIM, factor not included in the model; Removed, factor removed in the modified version of EuroSCORE II as not significant in the EuroSCORE II development study.
aAs defined by EuroSCORE algorithms.
Descriptive statistics of the EuroSCOREs' risk factors and surgical data in the validation study population
| Variable . | Additive/logistic EuroSCORE . | EuroSCORE II . | Modified EuroSCORE II . |
|---|---|---|---|
| Preoperative data and comorbidities (%) | |||
| Age (years) | 67.4 ± 11.8 | ||
| Gender (female) | 3885 (31.5) | ||
| Chronic pulmonary diseasea | 763 (6.2) | Removed | |
| Extracardiac arteriopathya | 1426 (11.6) | ||
| Neurological dysfunction diseasea | 81 (0.7) | NIM | |
| Poor mobilitya | NIM | 61 (0.5) | Removed |
| Previous cardiac surgerya | 719 (5.8) | ||
| Serum creatinine >200 μmol/La | 366 (3.0) | NIM | |
| Creatinine clearance 50–85 mL/min | NIM | 8389 (68.1) | |
| Creatinine clearance <50 mL/min | NIM | 1146 (9.3) | |
| Dialysis | NIM | 66 (0.5) | |
| Active endocarditisa | 169 (1.4) | ||
| Critical preoperative statea | 246 (2.0) | ||
| Unstable anginaa | 578 (4.7) | NIM | |
| Diabetes on insuline | NIM | 516 (4.2) | |
| NYHA Class II | NIM | 4060 (32.9) | Removed |
| NYHA Class III | NIM | 1194 (9.7) | |
| NYHA Class IV | NIM | 163 (1.3) | |
| CCS Class IV | NIM | 487 (3.9) | Removed |
| Left ventricular ejection function (LVEF) (%) | |||
| LVEF 30–50% | 2433 (19.7) | ||
| LVEF <30% | 502 (4.1) | NIM | |
| LVEF 20–30% | NIM | 478(3.9) | |
| LVEF <20% | NIM | 24 (0.2) | |
| Recent myocardial infarcta | 1946 (15.8) | Removed | |
| Pulmonary hypertension (%) | |||
| Systolic PA pressure >60 mmHg | 330 (2.7) | NIM | |
| Systolic PA pressure >55 mmHg | NIM | 349 (2.8) | |
| Systolic PA pressure 31–55 mmHg | NIM | 922 (7.5) | Removed |
| Urgencya (%) | |||
| Urgent | NIM | 1033 (8.4) | |
| Emergency | 435 (3.5) | ||
| Salvage | NIM | 11 (0.1) | |
| Surgical data (%) | |||
| Other than isolated CABGa | 7379 (59.9) | NIM | |
| Surgery on thoracic aortaa | 1525 (12.4) | ||
| Post-infarct septal rupturea | 23 (0.2%) | NIM | |
| Number of surgical proceduresa (%) | |||
| One non-CABG | NIM | 3892 (31.6) | Removed |
| Two | NIM | 2791 (22.6) | |
| Three or more | NIM | 696 (5.6) | |
| Coronary artery bypass grafting | 6457 (52.4) | ||
| Mitral valve surgery | 3056 (24.8) | ||
| Aortic valve surgery | 4146 (33.6) | ||
| Tricuspid valve surgery | 985 (7.9) | ||
| Surgery for left ventricular aneurysm | 258 (2.1) | ||
| Other major heart procedures | 503 (4.1) | ||
| Variable . | Additive/logistic EuroSCORE . | EuroSCORE II . | Modified EuroSCORE II . |
|---|---|---|---|
| Preoperative data and comorbidities (%) | |||
| Age (years) | 67.4 ± 11.8 | ||
| Gender (female) | 3885 (31.5) | ||
| Chronic pulmonary diseasea | 763 (6.2) | Removed | |
| Extracardiac arteriopathya | 1426 (11.6) | ||
| Neurological dysfunction diseasea | 81 (0.7) | NIM | |
| Poor mobilitya | NIM | 61 (0.5) | Removed |
| Previous cardiac surgerya | 719 (5.8) | ||
| Serum creatinine >200 μmol/La | 366 (3.0) | NIM | |
| Creatinine clearance 50–85 mL/min | NIM | 8389 (68.1) | |
| Creatinine clearance <50 mL/min | NIM | 1146 (9.3) | |
| Dialysis | NIM | 66 (0.5) | |
| Active endocarditisa | 169 (1.4) | ||
| Critical preoperative statea | 246 (2.0) | ||
| Unstable anginaa | 578 (4.7) | NIM | |
| Diabetes on insuline | NIM | 516 (4.2) | |
| NYHA Class II | NIM | 4060 (32.9) | Removed |
| NYHA Class III | NIM | 1194 (9.7) | |
| NYHA Class IV | NIM | 163 (1.3) | |
| CCS Class IV | NIM | 487 (3.9) | Removed |
| Left ventricular ejection function (LVEF) (%) | |||
| LVEF 30–50% | 2433 (19.7) | ||
| LVEF <30% | 502 (4.1) | NIM | |
| LVEF 20–30% | NIM | 478(3.9) | |
| LVEF <20% | NIM | 24 (0.2) | |
| Recent myocardial infarcta | 1946 (15.8) | Removed | |
| Pulmonary hypertension (%) | |||
| Systolic PA pressure >60 mmHg | 330 (2.7) | NIM | |
| Systolic PA pressure >55 mmHg | NIM | 349 (2.8) | |
| Systolic PA pressure 31–55 mmHg | NIM | 922 (7.5) | Removed |
| Urgencya (%) | |||
| Urgent | NIM | 1033 (8.4) | |
| Emergency | 435 (3.5) | ||
| Salvage | NIM | 11 (0.1) | |
| Surgical data (%) | |||
| Other than isolated CABGa | 7379 (59.9) | NIM | |
| Surgery on thoracic aortaa | 1525 (12.4) | ||
| Post-infarct septal rupturea | 23 (0.2%) | NIM | |
| Number of surgical proceduresa (%) | |||
| One non-CABG | NIM | 3892 (31.6) | Removed |
| Two | NIM | 2791 (22.6) | |
| Three or more | NIM | 696 (5.6) | |
| Coronary artery bypass grafting | 6457 (52.4) | ||
| Mitral valve surgery | 3056 (24.8) | ||
| Aortic valve surgery | 4146 (33.6) | ||
| Tricuspid valve surgery | 985 (7.9) | ||
| Surgery for left ventricular aneurysm | 258 (2.1) | ||
| Other major heart procedures | 503 (4.1) | ||
NIM, factor not included in the model; Removed, factor removed in the modified version of EuroSCORE II as not significant in the EuroSCORE II development study.
aAs defined by EuroSCORE algorithms.
Performance of EuroSCORE II
The ROC curves are plotted in Figure 1. The AUC was high in all algorithms, being 0.82 (95% CI: 0.79–0.84) for additive EuroSCORE, 0.82 (95% CI: 0.79–0.84) for logistic EuroSCORE, and 0.82 (95% CI: 0.80–0.85) for EuroSCORE II. The comparison among scores' performances did not show significant differences between EuroSCORE II and the previous original versions (P = 0.78 for EuroSCORE II vs. additive EuroSCORE; P = 0.93 for EuroSCORE II vs. logistic EuroSCORE; P = 0.86 for additive EuroSCORE vs. logistic EuroSCORE) (Table 3).
Predictive performance of the EuroSCORE II and its previous versions
| . | EuroSCORE II . | Logistic EuroSCORE . | Additive EuroSCORE . |
|---|---|---|---|
| Overall performance | |||
| Brier score | 0.021 | 0.026 | NA |
| Discrimination | |||
| AUC (95% CI) | 0.82 (0.80–0.85) | 0.82 (0.79–0.84) | 0.82 (0.79–0.84) |
| DeLong's test P-value | 0.78 (vs. AddES) | 0.86 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Bootstrap method P-value | 0.78 (vs. AddES) | 0.85 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Venkatraman P-value | 0.84 (vs. AddES) | 0.99 (vs. AddES) | — |
| 0.97 (vs. LogES) | — | — | |
| Somers' Dxy | 0.658 | 0.638 | — |
| Calibration | |||
| Slope | 1.236 | 1.122 | NA |
| Intercept | 0.431 | −1.187 | NA |
| U statistic P-value | 0.00 | 0.00 | NA |
| Hosmer–Lemeshow test P-value | 0.00 | 0.00 | NA |
| . | EuroSCORE II . | Logistic EuroSCORE . | Additive EuroSCORE . |
|---|---|---|---|
| Overall performance | |||
| Brier score | 0.021 | 0.026 | NA |
| Discrimination | |||
| AUC (95% CI) | 0.82 (0.80–0.85) | 0.82 (0.79–0.84) | 0.82 (0.79–0.84) |
| DeLong's test P-value | 0.78 (vs. AddES) | 0.86 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Bootstrap method P-value | 0.78 (vs. AddES) | 0.85 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Venkatraman P-value | 0.84 (vs. AddES) | 0.99 (vs. AddES) | — |
| 0.97 (vs. LogES) | — | — | |
| Somers' Dxy | 0.658 | 0.638 | — |
| Calibration | |||
| Slope | 1.236 | 1.122 | NA |
| Intercept | 0.431 | −1.187 | NA |
| U statistic P-value | 0.00 | 0.00 | NA |
| Hosmer–Lemeshow test P-value | 0.00 | 0.00 | NA |
AddES, additive EuroSCORE; LogLES, logistic EuroSCORE; NA, not applicable, as additive EuroSCORE does not calculate predicted mortality.
Best performance for: Brier score = 0, AUC = 1, Somers' Dxy = 1, Slope = 1, Intercept = 0, non-significant P-values of the U statistic, and Hosmer–Lemeshow test.
Predictive performance of the EuroSCORE II and its previous versions
| . | EuroSCORE II . | Logistic EuroSCORE . | Additive EuroSCORE . |
|---|---|---|---|
| Overall performance | |||
| Brier score | 0.021 | 0.026 | NA |
| Discrimination | |||
| AUC (95% CI) | 0.82 (0.80–0.85) | 0.82 (0.79–0.84) | 0.82 (0.79–0.84) |
| DeLong's test P-value | 0.78 (vs. AddES) | 0.86 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Bootstrap method P-value | 0.78 (vs. AddES) | 0.85 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Venkatraman P-value | 0.84 (vs. AddES) | 0.99 (vs. AddES) | — |
| 0.97 (vs. LogES) | — | — | |
| Somers' Dxy | 0.658 | 0.638 | — |
| Calibration | |||
| Slope | 1.236 | 1.122 | NA |
| Intercept | 0.431 | −1.187 | NA |
| U statistic P-value | 0.00 | 0.00 | NA |
| Hosmer–Lemeshow test P-value | 0.00 | 0.00 | NA |
| . | EuroSCORE II . | Logistic EuroSCORE . | Additive EuroSCORE . |
|---|---|---|---|
| Overall performance | |||
| Brier score | 0.021 | 0.026 | NA |
| Discrimination | |||
| AUC (95% CI) | 0.82 (0.80–0.85) | 0.82 (0.79–0.84) | 0.82 (0.79–0.84) |
| DeLong's test P-value | 0.78 (vs. AddES) | 0.86 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Bootstrap method P-value | 0.78 (vs. AddES) | 0.85 (vs. AddES) | — |
| 0.93 (vs. LogES) | — | — | |
| Venkatraman P-value | 0.84 (vs. AddES) | 0.99 (vs. AddES) | — |
| 0.97 (vs. LogES) | — | — | |
| Somers' Dxy | 0.658 | 0.638 | — |
| Calibration | |||
| Slope | 1.236 | 1.122 | NA |
| Intercept | 0.431 | −1.187 | NA |
| U statistic P-value | 0.00 | 0.00 | NA |
| Hosmer–Lemeshow test P-value | 0.00 | 0.00 | NA |
AddES, additive EuroSCORE; LogLES, logistic EuroSCORE; NA, not applicable, as additive EuroSCORE does not calculate predicted mortality.
Best performance for: Brier score = 0, AUC = 1, Somers' Dxy = 1, Slope = 1, Intercept = 0, non-significant P-values of the U statistic, and Hosmer–Lemeshow test.
Receiver operating characteristic curves for the three models. The diagonal line represents no discriminatory power (AUC: 0.50). In the three panels, the receiver operating characteristic curves with their 95% confidence intervals are reported.
The calibration curves of EuroSCORE II and logistic EuroSCORE are shown in Figure 2. Calibration plot and related statistics cannot be calculated for additive EuroSCORE as it is a simple and user-friendly instrument with additive property, but it does not predict in-hospital mortality. The predicted mortality of additive EuroSCORE is computed through logistic EuroSCORE, with the coefficients of the two scores coincident. The pattern of calibration was different between the two scores. The calibration of EuroSCORE II is close to the ideal diagonal until 30%-predicted probability and diverge significantly and markedly afterward showing over-prediction. Logistic EuroSCORE shows a progressive trend to over-prediction also in the low-predicted risk but seems to be more calibrated in high-predicted risk. Both scores have significant P-values (P < 0.05) for the related summary statistics (unreliability, Hosmer–Lemeshow test) indicating that they do not provide accurate probabilities (Table 3). The performance of old and new scores was consistent in each of the three centres included in the study and no differences were observed in the 6-year time-period.
Calibration plots of EuroSCORE II and logistic EuroSCORE. The diagonal line represents the perfect calibration. For EuroSCORE II, the predictions are well calibrated for patients with a predicted mortality <30%, while the score progressively overpredicts afterward to a level similar to logistic EuroSCORE. Logistic EuroSCORE shows a pattern of constant overprediction (miscalibration in the large, negative intercept).
Evaluation of EuroSCORE II performance, removing non-significant factors
The non-significant factors of EuroSCORE II were NYHA II, CCS4, chronic pulmonary dysfunction, neurological or musculoskeletal dysfunction affecting mobility, recent myocardial infarction, PA systolic pressure between 33 and 55 mmHg, and 1-non CABG operation.1 The modified EuroSCORE II (ESII) calculated for each patient excluding these factors had a mean value of 2.6 ± 3.61, close to the EuroSCORE II mean value. Its discriminatory power was similar to that of EuroSCORE II (AUC: 0.82, 95% CI: 0.79–0.84) and not significantly different (P-value 0.87 with DeLong's test, confirmed by the bootstrap and Venkatraman methods). Even the calibration plot and the associated statistics were equivalent, demonstrating that the performance of EuroSCORE II after removing the non-significant factors was very similar (Figure 3).
Comparison of performance between EuroSCORE II (black lines) and its modified version (red lines), obtained removing non-significant factors. The discrimination was similar and not significant different (P-values >0.05 for all tests). Even the calibration was equivalent, as further shown by the associated statistics.
Discussion
Risk stratification and risk scoring systems in adult cardiac surgery are becoming increasingly important as they provide reliable estimation of the risks associated with surgical procedures, and they permit, in some cases, comparison of outcomes among institutions and surgeons by adjusting for varying patients case-mix; in addition, they ultimately may provide a more accurate assessment of the indication for surgery in individual patients by facilitating a more precise balancing of potential risks and benefits.25–30 In recent years additive and logistic EuroSCORE models have been widely used as risk prediction tools in adult cardiac surgery, especially in Europe. These models, however, were derived from data collection and the analysis of >13 000 consecutive cases in >100 European Centers in 1995, when isolated coronary bypass grafting was by far the most frequent adult cardiac surgical procedure that was performed.2,3 As both additive and logistic EuroSCORE risk models have been developed for quite sometime now, it is expected that they may not reflect accurately the current cardiac surgical practice that has changed over the years due to improved surgical techniques and postoperative patient management advances. And several recent papers have suggested that these algorithms may be no more adequate for risk estimation, due to an overestimation of the adult cardiac surgical risk in the range of two- to three-fold.13–16,31,32 Recently, the limitations of EuroSCORE original models' performances were also highlighted by the comparison with the STS score, which predicts more accurately the observed mortality, especially in the highest risk patients. In fact, the STS score is derived from a larger data set of patients operated in a more current era, and risk models were separately developed into three different surgical categories (CABG, valves, and CABG plus valves) and included more covariates.30,32
EuroSCORE II was mainly conceived to overcome the performance's limitations of its previous versions, reducing the constant high-grade over-prediction that was widely demonstrated in the literature.1–3 The internal validation computed by the Authors after score's development appeared to fail the main outcome, as the very good discrimination was not associated with a significantly better calibration. Our study confirmed the unsatisfactory calibration of old and new versions of EuroSCORE (U-statistic and the Hosmer–Lemeshow goodness-of-fit test P-values <0.05), although a diverse pattern of miscalibration was highlighted (Figure 1). EuroSCORE II has an optimal calibration till 30%-predicted probability, whereas it progressively overpredicts afterward, leading to a calibration line that was below and progressively distant from the diagonal. On the other hand, logistic EuroSCORE shows a somehow constant overprediction of mortality for all classes of risk. In other words, EuroSCORE II performs very well in the first tertile of risk, which represents the vast majority of our study group, but then its discrimination dramatically deteriorates, whereas logistic EuroSCORE systematically overpredicts even in patients at low risk. To our knowledge, no other studies on EuroSCORE II are currently available for the comparison of performances with the previously released versions and the only external validation has been performed in a subpopulation of isolated CABG.33 The performance has been reported to be optimal for both discrimination and calibration; however, this result can be in part explained by the composition of the original EuroSCORE II population, as isolated CABG was 46.7% of the group and hence the score could be more precise in that subgroup. Nonetheless, we also performed this analysis in isolated CABG confirming in this case a very good calibration, a Brier score closer to 0 but again a U statistic P-value <0.05, that indicates poor reliability (data not shown). And this is the proof of concept that the original limitations of EuroSCORE I seem to be improved but not solved yet in the new version. The poor calibration in high-risk profiles could represent a main limitation for the risk estimation in candidates to transcatheter valves. Although these new procedures were not included in the EuroSCORE II development and EuroSCORE II has not been validated in these subgroups, it is expected that this new version will be employed for the selection of high-risk patients, as this has already happened for its previous versions, leading to potential risk over-estimation. An analysis of score performance in large transcatheter valve registries is necessary to validate EuroSCORE II and the need of a new dedicated score has been already claimed to overcome the potential limitations of existing tools.
A peculiar aspect of the EuroSCORE II model was the inclusion of a high quote of risk factors that were not significantly associated with the outcome of interest (e.g. mortality) by multivariate regression. The methods for selecting a subset of covariates usually take into account the significance level for both the choice of initial variables and the model reduction.22,34,35 Moreover, parsimony in selecting independent regressors is recommended in order to avoid model's overfitting and the analysis can be more consistent and effective if confined to the few variables or pre-selected combinations of variables that are the most powerful predictors, instead of including all the many variables that might be statistically significant.36 A simpler model that predicts an outcome with the same performance of complex algorithms should always be preferred.37 Hence, forcing 24% of non-significant factors in the final reduced algorithm needs some explanation or rationale that, however, was not provided by ESII developers, at least to our knowledge. Our results show that the removal of those factors does not have any impact on the performance of the EuroSCORE II, as both calibration and discrimination are very similar. Their inclusion leads to a more complicated and less user-friendly score without adding any predictive advantage, and, in our opinion, it should be reconsidered, after a further step of external validation of the reduced EuroSCORE II model. On the other hand, one potential drawback in the practical use of a score is the variable adhesion to the definitions of risk factors, which can lead to very diverse performances in groups with similar demographic and clinical data, as shown in the literature. Hence, only a correct employment of the algorithm can level to higher predictive performance both simple and complex scores.
The lack of parsimony is not the only challenging issue of the new EuroSCORE. The removal of one major risk factor of logistic EuroSCORE (post-infarction ventricular septal rupture) due to low incidence in the data set could reflect a limitation in the number of patients enrolled. High-risk conditions with a very low prevalence but with substantial perioperative mortality can be detected and evaluated only with a patient population of adequate size, and hence a preliminary evaluation of the sample size needed to achieve sufficient study power should have been advisable.25,38 Considering the available data, we cannot discriminate whether the absence of post-infarction ventricular septal rupture in the EuroSCORE II is due to insufficient study power. Moreover, the renewal of EuroSCORE did not analyse possible regressors that emerged to be highly related to perioperative mortality and that could have a major weight in an updated score.39,40 Together with these criticisms, EuroSCORE II was improved and modernized in some definitions, especially the kidney function and the weight of surgery. The creatinine clearance has been demonstrated to be a more precise predictor of mortality than serum creatinine and hence it has been expected to improve risk prediction. Moreover, a more complex classification of operations based on nature and number of procedures substituted the dichotomization of isolated CABG and all other surgeries, considered a major limitation of original EuroSCORE.
Limitations
The potential limitation of the study is its retrospective nature. Data were derived from three institutional data sets that were prospectively collected. In all data sets, original EuroSCORE factors have been collected in specific columns together with the scores' values, hence logistic and additive EuroSCORE for each patient were computed again to check the correctness of the values. EuroSCORE II new factors were derived from the data sets, creatinine clearance was computed as suggested, and categorizations were recalculated from continuous data.
Conclusions
This large validation study demonstrated that the new EuroSCORE II predicts perioperative mortality with more than satisfactory performance. It showed an optimal calibration until 30%-predicted mortality, which represents a large proportion of the patients. Nonetheless, it does not seem to significantly improve the performance of additive and logistic EuroSCOREs in the higher tertiles of risk. Moreover, the removal of non-significant factors did not alter the performance of the EuroSCORE II, demonstrating that it can be further simplified. Taken together, these data suggest that ESII might benefit from the simplification of the variable list included; moreover it could be advisable to test some new predictors, which might improve its performance in the high-risk patient population.
Authors' contributions
F.B., D.P., and A.P. ideated the project and compiled/drafted the article. F.B. checked data sets, performed statistical analysis. A.C., O.R., C.G., R.D.B., and F.A. managed the three institutional data set, calculated and checked the scores for the analysis. All authors participated in the conception of the manuscript, drafted and revised the article and gave their final approval to the text.
Conflict of interest: none declared.


