Abstract

Aims

The European System for Cardiac Operation Risk Evaluation (EuroSCORE) is widely used for predicting in-hospital mortality after cardiac surgery. A new score (EuroSCORE II) has been recently developed to update the previously released versions. This study was undertaken to validate EuroSCORE II, to compare its performance with the original EuroSCOREs and to evaluate the effects of the removal of those factors that were included in the score even if they were statistically non-significant.

Methods and results

Data on 12 325 consecutive patients who underwent major cardiac surgery in a 6-year period were retrieved from three prospective institutional databases. Discriminatory power was assessed using the c-index and comparison among the scores' performances was performed with Delong, bootstrap, and Venkatraman methods. Calibration was evaluated with calibration curves and associated statistics.

The in-hospital mortality rate was 2.2%. The discriminatory power was high and similar in all algorithms (area under the curve 0.82, 95% CI: 0.79–0.84 for additive EuroSCORE; 0.82, 95% CI: 0.79–0.84 for logistic EuroSCORE; 0.82, 95% CI: 0.80–0.85 for EuroSCORE II). The EuroSCORE II had a fair calibration till 30%-predicted values and over-predicted beyond. The removal of non-significant factors from EuroSCORE II did not affect performance, being both the calibration and discrimination comparable.

Conclusion

This validation study demonstrated that EuroSCORE II is a good predictor of perioperative mortality. It showed an optimal calibration until 30%-predicted mortality. Nonetheless, it does not seem to significantly improve the performance of older versions in the higher tertiles of risk. Moreover, it could be simplified, as the removal from the algorithm of non-significant factors does not alter its performance.

See page 10 for the editorial comment on this article (doi:10.1093/eurheartj/ehs343)

Introduction

The European System for Cardiac Operative Risk Evaluation II (EuroSCORE II) is a new tool for the estimation of in-hospital mortality after cardiac surgery recently launched to update the older additive and logistic EuroSCOREs developed in 1999.1–3 These previous versions have rapidly gained wide popularity and are used worldwide for the cardiac surgery risk stratification and for the assessment of the quality of cardiac surgical services.4,5 They were tested and validated even for the prediction of perioperative complications and long-term outcomes.6–9 Moreover, additive/logistic EuroSCOREs have been employed in the recent years, together with the STS score, for the screening and selection of high-risk patients eligible for new surgical techniques, such as TAVI and MitraClip®, but the analysis of its performance within these specific surgical subpopulations has first underlined a tendency to over-predict the risk of mortality and of morbidity.10–12 The lack of calibration has been confirmed by several studies and is considered the result of the changing epidemiology of cardiac surgery and the improvement of surgical techniques and of perioperative care.13–16 Algorithm's performance strictly depends on homogeneity between the study group and the population from which the score is modelled and a gap of 15 years can explain the poor calibration in the contemporary cardiac surgery. Hence, the new EuroSCORE II was developed in order to improve the score's performance in traditional cardiac surgical procedures and its internal validation has showed a better calibration associated with a constant optimal discrimination.1 Nonetheless, these data are not yet validated in an external population.

The EuroSCORE II algorithm appears to be more complex than the previous versions, although the core of risk factors is almost the same.1 Some definitions are more precise. The symptomatic status has been defined incorporating the NYHA class and the CCS Class 4, while the outdated unstable angina was removed. Renal impairment has been classified considering the creatinine clearance while a wide categorization was used for the definition of ejection fraction, pulmonary artery systolic pressure, and urgency. Moreover, more attention was focused on the weight of surgical procedures and four classes of procedures replaced the previous definition of ‘other than isolated coronary artery bypass grafting (CABG)’. Analysing the panel of the final risk factors of the EuroSCORE II, some factors have been included in the algorithm although they were not independent predictors of mortality (P > 0.05) in the model development. As an example, the weight of the procedure one non-CABG has a P-value of 0.966 (close to 1). Its 95% confidence interval consequently ranges between −0.28 and 0.29, meaning that one non-CABG is at the same time and with the same weight a risk factor and a protective factor.

This study was designed with two main endpoints. It was undertaken to externally validate EuroSCORE II and compare its performance with those of the previous versions. Moreover, we sought to evaluate the discrimination and the calibration of the EuroSCORE II after removing the non-significant risk factors.

Methods

Study population and study design

The study population included all patients who underwent cardiac surgery in a 6-year period (from 2006 to 2011, 12 325 patients enrolled) within the departments of cardiac surgery of two university hospitals and one regional hospital. Transcatheter/percutaneous valve implant procedures were excluded from the study group, as they were not considered in the development of the EuroSCORE II algorithm.

Preoperative and demographic information, operative data and perioperative mortality, and complications for all patients were retrieved from the institutional databases that are prospectively collected. The Institutional Review Boards approved the data set's use for research. The Institutional Ethical Committees approved the study and the requirement for informed written consent was waived on the condition that subjects' identities were masked. Data from the three centres were matched and stored in a dedicated data set.

The EuroSCORE II was primarily developed to predict in-hospital mortality,1 hence the external validation was performed on the prediction of in-hospital mortality. For the evaluation of the performance of the three scores, additive EuroSCORE, logistic EuroSCORE, and the new EuroSCORE II were calculated for each patient in accordance with published guidelines with a dedicated software.1–3 Moreover we sought to evaluate the role of those factors that have been included in the algorithm although they were not independent predictors of mortality in the model development and a modified EuroSCORE II was computed, omitting in the regression algorithm all these non-significant factors.

Data analysis

The performance of the EuroSCORE models was analysed focusing on discrimination power and calibration.17,18 The discrimination performance indicates the extent to which the model distinguishes between patients who will die or survive in the perioperative period. It was evaluated by constructing receiver operating characteristic curves for each model and calculating the area under the curve (AUC) with 95% confidence intervals.19–21 Numerically, an area of 1.0 indicates the perfect discrimination power, whereas an area of 0.5 indicates no discrimination of the binary outcome. The comparison among curves was analysed with Delong, bootstrap, and Venkatraman methods, the first two comparing the AUC and the last the ROC curves themselves.21 Another index used to evaluate the predictive abilities was the Somers' Dxy rank correlation between predicted probabilities and observed responses. When Dxy = 0, the model is making random prediction, when Dxy = 1, the predictions are perfectly discriminating.22

Calibration refers to the agreement between observed outcomes and predictions. For example, 15 in-hospital deaths should be observed in a 100 patients' group with 15%-predicted mortality. The calibration performance can be evaluated by generating calibration plots that visually compare the prediction with the observed probability.17,21–23 The calibration plot is characterized by an intercept, which indicates the extent that predictions are systematically low or high, and a calibration slope that should be 0. The perfect calibrated predictions stay on the 45-degree line, while a curve below or above the diagonal, respectively, reflects overestimation and underestimation. For each model, the comparison of actual slope and intercept with the ideal value of 1 and 0 was performed with the U statistic and tested against a χ2 distribution with 2 degrees of freedom. Moreover, calibration was tested with the Hosmer–Lemeshow goodness-of-fit test, which compares observed to predicted values by decile of predicted probability.

The accuracy of the models was also tested calculating the Brier score (quadratic difference between predicted probability and observed outcome for each patient), an overall performance measure that is 0 when the prediction is perfect.21–23

Missing values occurred for variables ‘chronic pulmonary disease’ (0.23%), ‘extracardiac arteriopathy’ (0.14%), ‘neurological dysfunction disease’ (0.17%), ‘poor mobility’ (0.13%), ‘NYHA class’ (0.25%), ‘LVEF’ (0.07%), and ‘recent myocardial infarction’ (0.09%). Missing values were substituted by means of multiple imputation, as described in order to reduce bias and increase statistical power.22,24

Two-sided statistics were performed with a significance level of 0.05. For all analyses, the R 2.14.0 software was used [R Development Core Team (2011). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/].

Results

Table 1 reports EuroSCORE values of our patient cohort and of the patients used for the development study.1 The mean values of additive EuroSCORE, logistic EuroSCORE, and EuroSCORE II of our patient population were, respectively, 5.8 ± 3.1, 7.6 ± 8.7, and 2.8 ± 3.9. It should be noted that there were some differences in preoperative mortality rates between the validation study and the EuroSCORE II development study that were consistent with EuroSCORE II-predicted mortality rates in the two groups; this probably reflects slightly different demographic and clinical data of the two patient groups. Table 2 describes the preoperative and surgical data included in old and new EuroSCORE models and a summary of the surgical procedures performed. Baseline characteristics described in Tables 1 and 2 were not significantly different among the three centres.

Table 1

Baseline statistics of the study population and of the EuroSCORE II group1

EuroSCORE II study (16 828 patients)External validation (12 325 patients)
EuroSCORE II3.9%a2.8% ± 3.9%
 First quartile (25%)1.04%
 Median value1.73%
 Third quartile (75%)3.08%
 EuroSCORE II >10%426 pts (3.5%)
 EuroSCORE II >30%68 pts (0.6%)
Additive EuroSCORE5.8a5.8 ± 3.1
Logistic EuroSCORE7.6%a7.6% ± 8.7%
In-hospital mortality3.9%2.2%
EuroSCORE II study (16 828 patients)External validation (12 325 patients)
EuroSCORE II3.9%a2.8% ± 3.9%
 First quartile (25%)1.04%
 Median value1.73%
 Third quartile (75%)3.08%
 EuroSCORE II >10%426 pts (3.5%)
 EuroSCORE II >30%68 pts (0.6%)
Additive EuroSCORE5.8a5.8 ± 3.1
Logistic EuroSCORE7.6%a7.6% ± 8.7%
In-hospital mortality3.9%2.2%

aCalculated in the validation data subset of the EuroSCORE II study.1

Table 1

Baseline statistics of the study population and of the EuroSCORE II group1

EuroSCORE II study (16 828 patients)External validation (12 325 patients)
EuroSCORE II3.9%a2.8% ± 3.9%
 First quartile (25%)1.04%
 Median value1.73%
 Third quartile (75%)3.08%
 EuroSCORE II >10%426 pts (3.5%)
 EuroSCORE II >30%68 pts (0.6%)
Additive EuroSCORE5.8a5.8 ± 3.1
Logistic EuroSCORE7.6%a7.6% ± 8.7%
In-hospital mortality3.9%2.2%
EuroSCORE II study (16 828 patients)External validation (12 325 patients)
EuroSCORE II3.9%a2.8% ± 3.9%
 First quartile (25%)1.04%
 Median value1.73%
 Third quartile (75%)3.08%
 EuroSCORE II >10%426 pts (3.5%)
 EuroSCORE II >30%68 pts (0.6%)
Additive EuroSCORE5.8a5.8 ± 3.1
Logistic EuroSCORE7.6%a7.6% ± 8.7%
In-hospital mortality3.9%2.2%

aCalculated in the validation data subset of the EuroSCORE II study.1

Table 2

Descriptive statistics of the EuroSCOREs' risk factors and surgical data in the validation study population

VariableAdditive/logistic EuroSCOREEuroSCORE IIModified EuroSCORE II
Preoperative data and comorbidities (%)
 Age (years)67.4 ± 11.8
 Gender (female)3885 (31.5)
 Chronic pulmonary diseasea763 (6.2)Removed
 Extracardiac arteriopathya1426 (11.6)
 Neurological dysfunction diseasea81 (0.7)NIM
 Poor mobilityaNIM61 (0.5)Removed
 Previous cardiac surgerya719 (5.8)
 Serum creatinine >200 μmol/La366 (3.0)NIM
 Creatinine clearance 50–85 mL/minNIM8389 (68.1)
 Creatinine clearance <50 mL/minNIM1146 (9.3)
 DialysisNIM66 (0.5)
 Active endocarditisa169 (1.4)
 Critical preoperative statea246 (2.0)
 Unstable anginaa578 (4.7)NIM
 Diabetes on insulineNIM516 (4.2)
 NYHA Class IINIM4060 (32.9)Removed
 NYHA Class IIINIM1194 (9.7)
 NYHA Class IVNIM163 (1.3)
 CCS Class IVNIM487 (3.9)Removed
Left ventricular ejection function (LVEF) (%)
 LVEF 30–50%2433 (19.7)
 LVEF <30%502 (4.1)NIM
 LVEF 20–30%NIM478(3.9)
 LVEF <20%NIM24 (0.2)
Recent myocardial infarcta1946 (15.8)Removed
Pulmonary hypertension (%)
 Systolic PA pressure >60 mmHg330 (2.7)NIM
 Systolic PA pressure >55 mmHgNIM349 (2.8)
 Systolic PA pressure 31–55 mmHgNIM922 (7.5)Removed
Urgencya (%)
 UrgentNIM1033 (8.4)
 Emergency435 (3.5)
 SalvageNIM11 (0.1)
Surgical data (%)
 Other than isolated CABGa7379 (59.9)NIM
 Surgery on thoracic aortaa1525 (12.4)
 Post-infarct septal rupturea23 (0.2%)NIM
Number of surgical proceduresa (%)
 One non-CABGNIM3892 (31.6)Removed
 TwoNIM2791 (22.6)
 Three or moreNIM696 (5.6)
Coronary artery bypass grafting6457 (52.4)
Mitral valve surgery3056 (24.8)
Aortic valve surgery4146 (33.6)
Tricuspid valve surgery985 (7.9)
Surgery for left ventricular aneurysm258 (2.1)
Other major heart procedures503 (4.1)
VariableAdditive/logistic EuroSCOREEuroSCORE IIModified EuroSCORE II
Preoperative data and comorbidities (%)
 Age (years)67.4 ± 11.8
 Gender (female)3885 (31.5)
 Chronic pulmonary diseasea763 (6.2)Removed
 Extracardiac arteriopathya1426 (11.6)
 Neurological dysfunction diseasea81 (0.7)NIM
 Poor mobilityaNIM61 (0.5)Removed
 Previous cardiac surgerya719 (5.8)
 Serum creatinine >200 μmol/La366 (3.0)NIM
 Creatinine clearance 50–85 mL/minNIM8389 (68.1)
 Creatinine clearance <50 mL/minNIM1146 (9.3)
 DialysisNIM66 (0.5)
 Active endocarditisa169 (1.4)
 Critical preoperative statea246 (2.0)
 Unstable anginaa578 (4.7)NIM
 Diabetes on insulineNIM516 (4.2)
 NYHA Class IINIM4060 (32.9)Removed
 NYHA Class IIINIM1194 (9.7)
 NYHA Class IVNIM163 (1.3)
 CCS Class IVNIM487 (3.9)Removed
Left ventricular ejection function (LVEF) (%)
 LVEF 30–50%2433 (19.7)
 LVEF <30%502 (4.1)NIM
 LVEF 20–30%NIM478(3.9)
 LVEF <20%NIM24 (0.2)
Recent myocardial infarcta1946 (15.8)Removed
Pulmonary hypertension (%)
 Systolic PA pressure >60 mmHg330 (2.7)NIM
 Systolic PA pressure >55 mmHgNIM349 (2.8)
 Systolic PA pressure 31–55 mmHgNIM922 (7.5)Removed
Urgencya (%)
 UrgentNIM1033 (8.4)
 Emergency435 (3.5)
 SalvageNIM11 (0.1)
Surgical data (%)
 Other than isolated CABGa7379 (59.9)NIM
 Surgery on thoracic aortaa1525 (12.4)
 Post-infarct septal rupturea23 (0.2%)NIM
Number of surgical proceduresa (%)
 One non-CABGNIM3892 (31.6)Removed
 TwoNIM2791 (22.6)
 Three or moreNIM696 (5.6)
Coronary artery bypass grafting6457 (52.4)
Mitral valve surgery3056 (24.8)
Aortic valve surgery4146 (33.6)
Tricuspid valve surgery985 (7.9)
Surgery for left ventricular aneurysm258 (2.1)
Other major heart procedures503 (4.1)

NIM, factor not included in the model; Removed, factor removed in the modified version of EuroSCORE II as not significant in the EuroSCORE II development study.

aAs defined by EuroSCORE algorithms.

Table 2

Descriptive statistics of the EuroSCOREs' risk factors and surgical data in the validation study population

VariableAdditive/logistic EuroSCOREEuroSCORE IIModified EuroSCORE II
Preoperative data and comorbidities (%)
 Age (years)67.4 ± 11.8
 Gender (female)3885 (31.5)
 Chronic pulmonary diseasea763 (6.2)Removed
 Extracardiac arteriopathya1426 (11.6)
 Neurological dysfunction diseasea81 (0.7)NIM
 Poor mobilityaNIM61 (0.5)Removed
 Previous cardiac surgerya719 (5.8)
 Serum creatinine >200 μmol/La366 (3.0)NIM
 Creatinine clearance 50–85 mL/minNIM8389 (68.1)
 Creatinine clearance <50 mL/minNIM1146 (9.3)
 DialysisNIM66 (0.5)
 Active endocarditisa169 (1.4)
 Critical preoperative statea246 (2.0)
 Unstable anginaa578 (4.7)NIM
 Diabetes on insulineNIM516 (4.2)
 NYHA Class IINIM4060 (32.9)Removed
 NYHA Class IIINIM1194 (9.7)
 NYHA Class IVNIM163 (1.3)
 CCS Class IVNIM487 (3.9)Removed
Left ventricular ejection function (LVEF) (%)
 LVEF 30–50%2433 (19.7)
 LVEF <30%502 (4.1)NIM
 LVEF 20–30%NIM478(3.9)
 LVEF <20%NIM24 (0.2)
Recent myocardial infarcta1946 (15.8)Removed
Pulmonary hypertension (%)
 Systolic PA pressure >60 mmHg330 (2.7)NIM
 Systolic PA pressure >55 mmHgNIM349 (2.8)
 Systolic PA pressure 31–55 mmHgNIM922 (7.5)Removed
Urgencya (%)
 UrgentNIM1033 (8.4)
 Emergency435 (3.5)
 SalvageNIM11 (0.1)
Surgical data (%)
 Other than isolated CABGa7379 (59.9)NIM
 Surgery on thoracic aortaa1525 (12.4)
 Post-infarct septal rupturea23 (0.2%)NIM
Number of surgical proceduresa (%)
 One non-CABGNIM3892 (31.6)Removed
 TwoNIM2791 (22.6)
 Three or moreNIM696 (5.6)
Coronary artery bypass grafting6457 (52.4)
Mitral valve surgery3056 (24.8)
Aortic valve surgery4146 (33.6)
Tricuspid valve surgery985 (7.9)
Surgery for left ventricular aneurysm258 (2.1)
Other major heart procedures503 (4.1)
VariableAdditive/logistic EuroSCOREEuroSCORE IIModified EuroSCORE II
Preoperative data and comorbidities (%)
 Age (years)67.4 ± 11.8
 Gender (female)3885 (31.5)
 Chronic pulmonary diseasea763 (6.2)Removed
 Extracardiac arteriopathya1426 (11.6)
 Neurological dysfunction diseasea81 (0.7)NIM
 Poor mobilityaNIM61 (0.5)Removed
 Previous cardiac surgerya719 (5.8)
 Serum creatinine >200 μmol/La366 (3.0)NIM
 Creatinine clearance 50–85 mL/minNIM8389 (68.1)
 Creatinine clearance <50 mL/minNIM1146 (9.3)
 DialysisNIM66 (0.5)
 Active endocarditisa169 (1.4)
 Critical preoperative statea246 (2.0)
 Unstable anginaa578 (4.7)NIM
 Diabetes on insulineNIM516 (4.2)
 NYHA Class IINIM4060 (32.9)Removed
 NYHA Class IIINIM1194 (9.7)
 NYHA Class IVNIM163 (1.3)
 CCS Class IVNIM487 (3.9)Removed
Left ventricular ejection function (LVEF) (%)
 LVEF 30–50%2433 (19.7)
 LVEF <30%502 (4.1)NIM
 LVEF 20–30%NIM478(3.9)
 LVEF <20%NIM24 (0.2)
Recent myocardial infarcta1946 (15.8)Removed
Pulmonary hypertension (%)
 Systolic PA pressure >60 mmHg330 (2.7)NIM
 Systolic PA pressure >55 mmHgNIM349 (2.8)
 Systolic PA pressure 31–55 mmHgNIM922 (7.5)Removed
Urgencya (%)
 UrgentNIM1033 (8.4)
 Emergency435 (3.5)
 SalvageNIM11 (0.1)
Surgical data (%)
 Other than isolated CABGa7379 (59.9)NIM
 Surgery on thoracic aortaa1525 (12.4)
 Post-infarct septal rupturea23 (0.2%)NIM
Number of surgical proceduresa (%)
 One non-CABGNIM3892 (31.6)Removed
 TwoNIM2791 (22.6)
 Three or moreNIM696 (5.6)
Coronary artery bypass grafting6457 (52.4)
Mitral valve surgery3056 (24.8)
Aortic valve surgery4146 (33.6)
Tricuspid valve surgery985 (7.9)
Surgery for left ventricular aneurysm258 (2.1)
Other major heart procedures503 (4.1)

NIM, factor not included in the model; Removed, factor removed in the modified version of EuroSCORE II as not significant in the EuroSCORE II development study.

aAs defined by EuroSCORE algorithms.

Performance of EuroSCORE II

The ROC curves are plotted in Figure 1. The AUC was high in all algorithms, being 0.82 (95% CI: 0.79–0.84) for additive EuroSCORE, 0.82 (95% CI: 0.79–0.84) for logistic EuroSCORE, and 0.82 (95% CI: 0.80–0.85) for EuroSCORE II. The comparison among scores' performances did not show significant differences between EuroSCORE II and the previous original versions (P = 0.78 for EuroSCORE II vs. additive EuroSCORE; P = 0.93 for EuroSCORE II vs. logistic EuroSCORE; P = 0.86 for additive EuroSCORE vs. logistic EuroSCORE) (Table 3).

Table 3

Predictive performance of the EuroSCORE II and its previous versions

EuroSCORE IILogistic EuroSCOREAdditive EuroSCORE
Overall performance
 Brier score0.0210.026NA
Discrimination
 AUC (95% CI)0.82 (0.80–0.85)0.82 (0.79–0.84)0.82 (0.79–0.84)
 DeLong's test P-value0.78 (vs. AddES)0.86 (vs. AddES)
0.93 (vs. LogES)
 Bootstrap method P-value0.78 (vs. AddES)0.85 (vs. AddES)
0.93 (vs. LogES)
 Venkatraman P-value0.84 (vs. AddES)0.99 (vs. AddES)
0.97 (vs. LogES)
 Somers' Dxy0.6580.638
Calibration
 Slope1.2361.122NA
 Intercept0.431−1.187NA
U statistic P-value0.000.00NA
 Hosmer–Lemeshow test P-value0.000.00NA
EuroSCORE IILogistic EuroSCOREAdditive EuroSCORE
Overall performance
 Brier score0.0210.026NA
Discrimination
 AUC (95% CI)0.82 (0.80–0.85)0.82 (0.79–0.84)0.82 (0.79–0.84)
 DeLong's test P-value0.78 (vs. AddES)0.86 (vs. AddES)
0.93 (vs. LogES)
 Bootstrap method P-value0.78 (vs. AddES)0.85 (vs. AddES)
0.93 (vs. LogES)
 Venkatraman P-value0.84 (vs. AddES)0.99 (vs. AddES)
0.97 (vs. LogES)
 Somers' Dxy0.6580.638
Calibration
 Slope1.2361.122NA
 Intercept0.431−1.187NA
U statistic P-value0.000.00NA
 Hosmer–Lemeshow test P-value0.000.00NA

AddES, additive EuroSCORE; LogLES, logistic EuroSCORE; NA, not applicable, as additive EuroSCORE does not calculate predicted mortality.

Best performance for: Brier score = 0, AUC = 1, Somers' Dxy = 1, Slope = 1, Intercept = 0, non-significant P-values of the U statistic, and Hosmer–Lemeshow test.

Table 3

Predictive performance of the EuroSCORE II and its previous versions

EuroSCORE IILogistic EuroSCOREAdditive EuroSCORE
Overall performance
 Brier score0.0210.026NA
Discrimination
 AUC (95% CI)0.82 (0.80–0.85)0.82 (0.79–0.84)0.82 (0.79–0.84)
 DeLong's test P-value0.78 (vs. AddES)0.86 (vs. AddES)
0.93 (vs. LogES)
 Bootstrap method P-value0.78 (vs. AddES)0.85 (vs. AddES)
0.93 (vs. LogES)
 Venkatraman P-value0.84 (vs. AddES)0.99 (vs. AddES)
0.97 (vs. LogES)
 Somers' Dxy0.6580.638
Calibration
 Slope1.2361.122NA
 Intercept0.431−1.187NA
U statistic P-value0.000.00NA
 Hosmer–Lemeshow test P-value0.000.00NA
EuroSCORE IILogistic EuroSCOREAdditive EuroSCORE
Overall performance
 Brier score0.0210.026NA
Discrimination
 AUC (95% CI)0.82 (0.80–0.85)0.82 (0.79–0.84)0.82 (0.79–0.84)
 DeLong's test P-value0.78 (vs. AddES)0.86 (vs. AddES)
0.93 (vs. LogES)
 Bootstrap method P-value0.78 (vs. AddES)0.85 (vs. AddES)
0.93 (vs. LogES)
 Venkatraman P-value0.84 (vs. AddES)0.99 (vs. AddES)
0.97 (vs. LogES)
 Somers' Dxy0.6580.638
Calibration
 Slope1.2361.122NA
 Intercept0.431−1.187NA
U statistic P-value0.000.00NA
 Hosmer–Lemeshow test P-value0.000.00NA

AddES, additive EuroSCORE; LogLES, logistic EuroSCORE; NA, not applicable, as additive EuroSCORE does not calculate predicted mortality.

Best performance for: Brier score = 0, AUC = 1, Somers' Dxy = 1, Slope = 1, Intercept = 0, non-significant P-values of the U statistic, and Hosmer–Lemeshow test.

Figure 1

Receiver operating characteristic curves for the three models. The diagonal line represents no discriminatory power (AUC: 0.50). In the three panels, the receiver operating characteristic curves with their 95% confidence intervals are reported.

The calibration curves of EuroSCORE II and logistic EuroSCORE are shown in Figure 2. Calibration plot and related statistics cannot be calculated for additive EuroSCORE as it is a simple and user-friendly instrument with additive property, but it does not predict in-hospital mortality. The predicted mortality of additive EuroSCORE is computed through logistic EuroSCORE, with the coefficients of the two scores coincident. The pattern of calibration was different between the two scores. The calibration of EuroSCORE II is close to the ideal diagonal until 30%-predicted probability and diverge significantly and markedly afterward showing over-prediction. Logistic EuroSCORE shows a progressive trend to over-prediction also in the low-predicted risk but seems to be more calibrated in high-predicted risk. Both scores have significant P-values (P < 0.05) for the related summary statistics (unreliability, Hosmer–Lemeshow test) indicating that they do not provide accurate probabilities (Table 3). The performance of old and new scores was consistent in each of the three centres included in the study and no differences were observed in the 6-year time-period.

Figure 2

Calibration plots of EuroSCORE II and logistic EuroSCORE. The diagonal line represents the perfect calibration. For EuroSCORE II, the predictions are well calibrated for patients with a predicted mortality <30%, while the score progressively overpredicts afterward to a level similar to logistic EuroSCORE. Logistic EuroSCORE shows a pattern of constant overprediction (miscalibration in the large, negative intercept).

Evaluation of EuroSCORE II performance, removing non-significant factors

The non-significant factors of EuroSCORE II were NYHA II, CCS4, chronic pulmonary dysfunction, neurological or musculoskeletal dysfunction affecting mobility, recent myocardial infarction, PA systolic pressure between 33 and 55 mmHg, and 1-non CABG operation.1 The modified EuroSCORE II (ESII) calculated for each patient excluding these factors had a mean value of 2.6 ± 3.61, close to the EuroSCORE II mean value. Its discriminatory power was similar to that of EuroSCORE II (AUC: 0.82, 95% CI: 0.79–0.84) and not significantly different (P-value 0.87 with DeLong's test, confirmed by the bootstrap and Venkatraman methods). Even the calibration plot and the associated statistics were equivalent, demonstrating that the performance of EuroSCORE II after removing the non-significant factors was very similar (Figure 3).

Figure 3

Comparison of performance between EuroSCORE II (black lines) and its modified version (red lines), obtained removing non-significant factors. The discrimination was similar and not significant different (P-values >0.05 for all tests). Even the calibration was equivalent, as further shown by the associated statistics.

Discussion

Risk stratification and risk scoring systems in adult cardiac surgery are becoming increasingly important as they provide reliable estimation of the risks associated with surgical procedures, and they permit, in some cases, comparison of outcomes among institutions and surgeons by adjusting for varying patients case-mix; in addition, they ultimately may provide a more accurate assessment of the indication for surgery in individual patients by facilitating a more precise balancing of potential risks and benefits.25–30 In recent years additive and logistic EuroSCORE models have been widely used as risk prediction tools in adult cardiac surgery, especially in Europe. These models, however, were derived from data collection and the analysis of >13 000 consecutive cases in >100 European Centers in 1995, when isolated coronary bypass grafting was by far the most frequent adult cardiac surgical procedure that was performed.2,3 As both additive and logistic EuroSCORE risk models have been developed for quite sometime now, it is expected that they may not reflect accurately the current cardiac surgical practice that has changed over the years due to improved surgical techniques and postoperative patient management advances. And several recent papers have suggested that these algorithms may be no more adequate for risk estimation, due to an overestimation of the adult cardiac surgical risk in the range of two- to three-fold.13–16,31,32 Recently, the limitations of EuroSCORE original models' performances were also highlighted by the comparison with the STS score, which predicts more accurately the observed mortality, especially in the highest risk patients. In fact, the STS score is derived from a larger data set of patients operated in a more current era, and risk models were separately developed into three different surgical categories (CABG, valves, and CABG plus valves) and included more covariates.30,32

EuroSCORE II was mainly conceived to overcome the performance's limitations of its previous versions, reducing the constant high-grade over-prediction that was widely demonstrated in the literature.1–3 The internal validation computed by the Authors after score's development appeared to fail the main outcome, as the very good discrimination was not associated with a significantly better calibration. Our study confirmed the unsatisfactory calibration of old and new versions of EuroSCORE (U-statistic and the Hosmer–Lemeshow goodness-of-fit test P-values <0.05), although a diverse pattern of miscalibration was highlighted (Figure 1). EuroSCORE II has an optimal calibration till 30%-predicted probability, whereas it progressively overpredicts afterward, leading to a calibration line that was below and progressively distant from the diagonal. On the other hand, logistic EuroSCORE shows a somehow constant overprediction of mortality for all classes of risk. In other words, EuroSCORE II performs very well in the first tertile of risk, which represents the vast majority of our study group, but then its discrimination dramatically deteriorates, whereas logistic EuroSCORE systematically overpredicts even in patients at low risk. To our knowledge, no other studies on EuroSCORE II are currently available for the comparison of performances with the previously released versions and the only external validation has been performed in a subpopulation of isolated CABG.33 The performance has been reported to be optimal for both discrimination and calibration; however, this result can be in part explained by the composition of the original EuroSCORE II population, as isolated CABG was 46.7% of the group and hence the score could be more precise in that subgroup. Nonetheless, we also performed this analysis in isolated CABG confirming in this case a very good calibration, a Brier score closer to 0 but again a U statistic P-value <0.05, that indicates poor reliability (data not shown). And this is the proof of concept that the original limitations of EuroSCORE I seem to be improved but not solved yet in the new version. The poor calibration in high-risk profiles could represent a main limitation for the risk estimation in candidates to transcatheter valves. Although these new procedures were not included in the EuroSCORE II development and EuroSCORE II has not been validated in these subgroups, it is expected that this new version will be employed for the selection of high-risk patients, as this has already happened for its previous versions, leading to potential risk over-estimation. An analysis of score performance in large transcatheter valve registries is necessary to validate EuroSCORE II and the need of a new dedicated score has been already claimed to overcome the potential limitations of existing tools.

A peculiar aspect of the EuroSCORE II model was the inclusion of a high quote of risk factors that were not significantly associated with the outcome of interest (e.g. mortality) by multivariate regression. The methods for selecting a subset of covariates usually take into account the significance level for both the choice of initial variables and the model reduction.22,34,35 Moreover, parsimony in selecting independent regressors is recommended in order to avoid model's overfitting and the analysis can be more consistent and effective if confined to the few variables or pre-selected combinations of variables that are the most powerful predictors, instead of including all the many variables that might be statistically significant.36 A simpler model that predicts an outcome with the same performance of complex algorithms should always be preferred.37 Hence, forcing 24% of non-significant factors in the final reduced algorithm needs some explanation or rationale that, however, was not provided by ESII developers, at least to our knowledge. Our results show that the removal of those factors does not have any impact on the performance of the EuroSCORE II, as both calibration and discrimination are very similar. Their inclusion leads to a more complicated and less user-friendly score without adding any predictive advantage, and, in our opinion, it should be reconsidered, after a further step of external validation of the reduced EuroSCORE II model. On the other hand, one potential drawback in the practical use of a score is the variable adhesion to the definitions of risk factors, which can lead to very diverse performances in groups with similar demographic and clinical data, as shown in the literature. Hence, only a correct employment of the algorithm can level to higher predictive performance both simple and complex scores.

The lack of parsimony is not the only challenging issue of the new EuroSCORE. The removal of one major risk factor of logistic EuroSCORE (post-infarction ventricular septal rupture) due to low incidence in the data set could reflect a limitation in the number of patients enrolled. High-risk conditions with a very low prevalence but with substantial perioperative mortality can be detected and evaluated only with a patient population of adequate size, and hence a preliminary evaluation of the sample size needed to achieve sufficient study power should have been advisable.25,38 Considering the available data, we cannot discriminate whether the absence of post-infarction ventricular septal rupture in the EuroSCORE II is due to insufficient study power. Moreover, the renewal of EuroSCORE did not analyse possible regressors that emerged to be highly related to perioperative mortality and that could have a major weight in an updated score.39,40 Together with these criticisms, EuroSCORE II was improved and modernized in some definitions, especially the kidney function and the weight of surgery. The creatinine clearance has been demonstrated to be a more precise predictor of mortality than serum creatinine and hence it has been expected to improve risk prediction. Moreover, a more complex classification of operations based on nature and number of procedures substituted the dichotomization of isolated CABG and all other surgeries, considered a major limitation of original EuroSCORE.

Limitations

The potential limitation of the study is its retrospective nature. Data were derived from three institutional data sets that were prospectively collected. In all data sets, original EuroSCORE factors have been collected in specific columns together with the scores' values, hence logistic and additive EuroSCORE for each patient were computed again to check the correctness of the values. EuroSCORE II new factors were derived from the data sets, creatinine clearance was computed as suggested, and categorizations were recalculated from continuous data.

Conclusions

This large validation study demonstrated that the new EuroSCORE II predicts perioperative mortality with more than satisfactory performance. It showed an optimal calibration until 30%-predicted mortality, which represents a large proportion of the patients. Nonetheless, it does not seem to significantly improve the performance of additive and logistic EuroSCOREs in the higher tertiles of risk. Moreover, the removal of non-significant factors did not alter the performance of the EuroSCORE II, demonstrating that it can be further simplified. Taken together, these data suggest that ESII might benefit from the simplification of the variable list included; moreover it could be advisable to test some new predictors, which might improve its performance in the high-risk patient population.

Authors' contributions

F.B., D.P., and A.P. ideated the project and compiled/drafted the article. F.B. checked data sets, performed statistical analysis. A.C., O.R., C.G., R.D.B., and F.A. managed the three institutional data set, calculated and checked the scores for the analysis. All authors participated in the conception of the manuscript, drafted and revised the article and gave their final approval to the text.

Conflict of interest: none declared.

References

1
Nashef
SA
Roques
F
Sharples
LD
Nilsson
J
Smith
C
Goldstone
AR
Lockowandt
U
EuroSCORE II
Eur J Cardiothorac Surg
2012
, vol. 
41
 (pg. 
734
-
744
)
2
Nashef
SA
Roques
F
Michel
P
Gauducheau
E
Lemeshow
S
Salamon
R
European system for cardiac operative risk evaluation (EuroSCORE)
Eur J Cardiothorac Surg
1999
, vol. 
16
 (pg. 
9
-
13
)
3
Roques
F
Michel
P
Goldstone
AR
Nashef
SA
The logistic EuroSCORE
Eur Heart J
2003
, vol. 
24
 (pg. 
882
-
883
)
4
Akar
AR
Kurtcephe
M
Sener
E
Alhan
C
Durdu
S
Kunt
AG
Güvenir
HA
Group for the Turkish Society of Cardiovascular Surgery and Turkish Ministry of Health
Validation of the EuroSCORE risk models in Turkish adult cardiac surgical population
Eur J Cardiothorac Surg
2011
, vol. 
40
 (pg. 
730
-
735
)
5
Yap
CH
Reid
C
Yii
M
Rowland
MA
Mohajeri
M
Skillington
PD
Seevanayagam
S
Smith
JA
Validation of the EuroSCORE model in Australia
Eur J Cardiothorac Surg
2006
, vol. 
29
 (pg. 
441
-
446
)
6
Kobayashi
KJ
Williams
JA
Nwakanma
LU
Weiss
ES
Gott
VL
Baumgartner
WA
Conte
JV
EuroSCORE predicts short- and mid-term mortality in combined aortic valve replacement and coronary artery bypass patients
J Card Surg
2009
, vol. 
24
 (pg. 
637
-
643
)
7
Kasimir
MT
Bialy
J
Moidl
R
Simon-Kupilik
N
Mittlböck
M
Hiesmayr
M
Wolner
E
Simon
P
EuroSCORE predicts mid-term outcome after combined valve and coronary bypass surgery
Heart Valve Dis
2004
, vol. 
13
 (pg. 
439
-
443
)
8
Ettema
RG
Peelen
LM
Schuurmans
MJ
Nierich
AP
Kalkman
CJ
Moons
KG
Prediction models for prolonged intensive care unit stay after cardiac surgery: systematic review and validation study
Circulation
2010
, vol. 
122
 (pg. 
682
-
689
)
9
De Maria
R
Mazzoni
M
Parolini
M
Gregori
D
Bortone
F
Arena
V
Parodi
O
Predictive value of EuroSCORE on long term outcome in cardiac surgery patients: a single institution study
Heart
2005
, vol. 
91
 (pg. 
779
-
784
)
10
Mack
MJ
Risk scores for predicting outcomes in valvular heart disease: how useful?
Curr Cardiol Rep
2011
, vol. 
13
 (pg. 
107
-
112
)
11
Paranskaya
L
D'Ancona
G
Bozdag-Turan
I
Akin
I
Kische
S
Turan
GR
Divchev
D
Rehders
TC
Schneider
H
Ortak
J
Nienaber
CA
Ince
H
Early and mid-term outcomes of percutaneous mitral valve repair with the MitraClip®: comparative analysis of different EuroSCORE strata
EuroIntervention
2012
 
Jun 13. pii: 20120517-02. (Epub ahead of print)
12
Tamburino
C
Barbanti
M
Capodanno
D
Sarkar
K
Cammalleri
V
Scarabelli
M
Mulè
M
Immè
S
Aruta
P
Ussia
GP
Early- and mid-term outcomes of transcatheter aortic valve implantation in patients with logistic EuroSCORE less than 20%: a comparative analysis between different risk strata
Catheter Cardiovasc Interv
2012
, vol. 
79
 (pg. 
132
-
140
)
13
Parolari
A
Pesce
LL
Trezzi
M
Loardi
C
Kassem
S
Brambillasca
C
Miguel
B
Tremoli
E
Biglioli
P
Alamanni
F
Performance of EuroSCORE in CABG and off-pump coronary artery bypass grafting: single institution experience and meta-analysis
Eur Heart J
2009
, vol. 
30
 (pg. 
297
-
304
)
14
Parolari
A
Pesce
LL
Trezzi
M
Cavallotti
L
Kassem
S
Loardi
C
Pacini
D
Tremoli
E
Alamanni
F
EuroSCORE performance in valve surgery: a meta-analysis
Ann Thorac Surg
2010
, vol. 
89
 (pg. 
787
-
793
)
15
Lebreton
G
Merle
S
Inamo
J
Hennequin
JL
Sanchez
B
Rilos
Z
Roques
F
Limitations in the inter-observer reliability of EuroSCORE: what should change in EuroSCORE II?
Eur J Cardiothorac Surg
2011
, vol. 
40
 (pg. 
1304
-
1308
)
16
Siregar
S
Groenwold
RH
de Heer
F
Bots
ML
van der Graaf
Y
van Herwerden
LA
Performance of the original EuroSCORE
Eur J Cardiothorac Surg
2012
, vol. 
41
 (pg. 
746
-
754
)
17
Bartfay
E
Bartfay
WJ
Accuracy assessment of prediction in patient outcomes
J Eval Clin Pract
2008
, vol. 
14
 (pg. 
1
-
10
)
18
Harrell
FE
Jr
Lee
KL
Mark
DB
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
1996
, vol. 
15
 (pg. 
361
-
387
)
19
Cook
NR
Use and misuse of the receiver operating characteristic curve in risk prediction
Circulation
2007
, vol. 
115
 (pg. 
928
-
935
)
20
Sing
T
Sander
O
Beerenwinkel
N
Lengauer
T
ROCR: visualizing classifier performance in R
Bioinformatics
2005
, vol. 
21
 (pg. 
3940
-
3941
)
21
Robin
X
Turck
N
Hainard
A
Tiberti
N
Lisacek
F
Sanchez
JC
Müller
M
pROC: an open-source package for R and S+ to analyze and compare ROC curves
BMC Bioinformatics
2011
, vol. 
12
 pg. 
77
 
22
Harrell
FE
Jr
Regression Modelling Strategies
2001
New York, NY
Springer
23
Steyerberg
EW
Vickers
AJ
Cook
NR
Gerds
T
Gonen
M
Obuchowski
N
Pencina
MJ
Kattan
MW
Assessing the performance of prediction models: a framework for traditional and novel measures
Epidemiology
2010
, vol. 
21
 (pg. 
128
-
138
)
24
Janssen
KJ
Donders
AR
Harrell
FE
Jr
Vergouwe
Y
Chen
Q
Grobbee
DE
Moons
KG
Missing covariate data in medical research: to impute is better than to ignore
J Clin Epidemiol
2010
, vol. 
63
 (pg. 
721
-
727
)
25
Ranucci
M
Risk stratification in cardiac surgery
Semin Cardiothorac Vasc Anesth
2010
, vol. 
14
 (pg. 
66
-
67
)
26
Warner
CD
Weintraub
WS
Craver
JM
Jones
EL
Gott
JP
Guyton
RA
Effect of cardiac surgery patient characteristics on patient outcomes from 1981 through 1995
Circulation
1997
, vol. 
96
 (pg. 
1575
-
1579
)
27
Pintor
PP
Colangelo
S
Bobbio
M
Evolution of case-mix in heart surgery: from mortality risk to complication risk
Eur J Cardiothorac Surg
2002
, vol. 
22
 (pg. 
927
-
933
)
28
Dupuis
JY
Wang
F
Nathan
H
Lam
M
Grimes
S
Bourke
M
The cardiac anesthesia risk evaluation score: a clinically useful predictor of mortality and morbidity after cardiac surgery
Anesthesiology
2001
, vol. 
94
 (pg. 
194
-
204
)
29
Ranucci
M
Castelvecchio
S
Conte
M
Megliola
G
Speziale
G
Fiore
F
Guarracino
F
Scolletta
S
Escobar
RM
Falco
M
Bignami
E
Landoni
G
The easier, the better: age, creatinine, ejection fraction score for operative mortality risk stratification in a series of 29 659 patients undergoing elective cardiac surgery
J Thorac Cardiovasc Surg
2011
, vol. 
142
 (pg. 
581
-
586
)
30
Shahian
DM
O'Brien
SM
Filardo
G
Ferraris
VA
Haan
CK
Rich
JB
Normand
SL
DeLong
ER
Shewan
CM
Dokholyan
RS
Peterson
ED
Edwards
FH
Anderson
RP
Society of Thoracic Surgeons Quality Measurement Task Force
The Society of Thoracic Surgeons 2008 cardiac surgery risk models: part 3–valve plus coronary artery bypass grafting surgery
Ann Thorac Surg
2009
, vol. 
88
 
1 Suppl.
(pg. 
S43
-
S62
)
31
Tran
DT
Dupuis
JY
Mesana
T
Ruel
M
Nathan
HJ
Comparison of the EuroSCORE and Cardiac Anesthesia Risk Evaluation (CARE) score for risk-adjusted mortality analysis in cardiac surgery
Eur J Cardiothorac Surg
2012
, vol. 
41
 (pg. 
307
-
313
)
32
Wendt
D
Osswald
BR
Kayser
K
Thielmann
M
Tossios
P
Massoudy
P
Kamler
M
Jakob
H
Society of Thoracic Surgeons score is superior to the EuroSCORE determining mortality in high risk patients undergoing isolated aortic valve replacement
Ann Thorac Surg
2009
, vol. 
88
 (pg. 
468
-
474
)
33
Biancari
F
Vasques
F
Mikkola
R
Martin
M
Lahtinen
J
Heikkinen
J
Validation of EuroSCORE II in patients undergoing coronary artery bypass surgery
Ann Thorac Surg
2012
, vol. 
93
 (pg. 
1930
-
1935
)
34
Sheather
SJ
A Modern Approach to Regression with R
2009
New York, NY
Springer
35
Hosmer
DW
Lemeshow
S
Applied Survival Analysis. Regression Modeling of Time to Event Data
1999
New York, NY
John Wiley & Sons, Ltd
36
Wells
CK
Feinstein
AR
Walter
SD
A comparison of multivariable mathematical methods for predicting survival, III: accuracy of predictions in generating and challenge sets
J Clin Epidemiol
1990
, vol. 
43
 (pg. 
361
-
372
)
37
Ranucci
M
Castelvecchio
S
Menicanti
L
Frigiola
A
Pelissero
G
Risk of assessing mortality risk in elective cardiac operations: age, creatinine, ejection fraction, and the law of parsimony
Circulation
2009
, vol. 
119
 (pg. 
3053
-
3061
)
38
Harvey
BJ
Lang
TA
Hypothesis testing, study power, and sample size
Chest
2010
, vol. 
138
 (pg. 
734
-
737
)
39
Karkouti
K
Wijeysundera
DN
Beattie
WS
Reducing Bleeding in Cardiac Surgery (RBC) Investigators
Risk associated with preoperative anemia in cardiac surgery: a multicenter cohort study
Circulation
2008
, vol. 
117
 (pg. 
478
-
484
)
40
Ranucci
M
Dedda
UD
Castelvecchio
S
Menicanti
L
Frigiola
A
Pelissero
G
Surgical and Clinical Outcome Research (SCORE) Group
Impact of preoperative anemia on outcome in adult cardiac surgery: a propensity-matched analysis
Ann Thorac Surg
2012
 
Jun 12 (Epub ahead of print)

Supplementary data