- Split View
-
Views
-
Cite
Cite
Rémi Kaboré, Maria C. Haller, Jérôme Harambat, Georg Heinze, Karen Leffondré, Risk prediction models for graft failure in kidney transplantation: a systematic review, Nephrology Dialysis Transplantation, Volume 32, Issue suppl_2, April 2017, Pages ii68–ii76, https://doi.org/10.1093/ndt/gfw405
- Share Icon Share
Abstract
Risk prediction models are useful for identifying kidney recipients at high risk of graft failure, thus optimizing clinical care. Our objective was to systematically review the models that have been recently developed and validated to predict graft failure in kidney transplantation recipients. We used PubMed and Scopus to search for English, German and French language articles published in 2005–15. We selected studies that developed and validated a new risk prediction model for graft failure after kidney transplantation, or validated an existing model with or without updating the model. Data on recipient characteristics and predictors, as well as modelling and validation methods were extracted. In total, 39 articles met the inclusion criteria. Of these, 34 developed and validated a new risk prediction model and 5 validated an existing one with or without updating the model. The most frequently predicted outcome was graft failure, defined as dialysis, re-transplantation or death with functioning graft. Most studies used the Cox model. There was substantial variability in predictors used. In total, 25 studies used predictors measured at transplantation only, and 14 studies used predictors also measured after transplantation. Discrimination performance was reported in 87% of studies, while calibration was reported in 56%. Performance indicators were estimated using both internal and external validation in 13 studies, and using external validation only in 6 studies. Several prediction models for kidney graft failure in adults have been published. Our study highlights the need to better account for competing risks when applicable in such studies, and to adequately account for post-transplant measures of predictors in studies aiming at improving monitoring of kidney transplant recipients.
INTRODUCTION
Predicting outcomes to guide clinical care, decision-making and resource allocation is a challenging issue in kidney transplantation. Although the survival of kidney transplant recipients has improved during the past 20 years, their life expectancy remains far below that of the general population [1–5]. Anticipating therapeutic procedures in patients at high risk of losing their kidney graft is a key to improve graft survival. Kidney biopsy is the ‘gold standard’ diagnostic tool to assess the graft rejection process and thus identify patients at risk. Unfortunately, this procedure remains invasive and expensive. Therefore, graft rejection initiation is often discovered too late to allow adjustment of care in time. Tools for predicting the long-term risk of graft loss appear to be suitable and cheap alternatives to addressing this difficulty. Several studies were conducted to propose clinical prediction tools for end-stage renal disease, and few systematic reviews of these articles have been realized [6, 7]. However, studies on graft failure prediction seem to be less frequent and, to the best of our knowledge, no systematic review of these tools and related methodological issues has been conducted. Yet, kidney graft failure prediction raises several methodological issues including the presence of competing risks [8]. The use of available information on biomarkers after transplantation to improve graft failure prediction also raises some methodological challenges. The method used to validate the developed prediction tool is also of importance since its results indicate the relevance and suitability of the prediction tools in clinical practice.
Our objective was therefore to review prognostic studies published in the past 10 years that developed and validated, or only validated, a risk prediction model for graft failure after transplantation in kidney recipients, and to discuss methodological approaches.
MATERIALS AND METHODS
Search strategy
A search syntax with Boolean combinations of terms for kidney transplantation and prognostic models was constructed and used to identify risk prediction models for graft failure in kidney transplant recipients in Scopus and PubMed (Table 1) [9]. Detailed search strategies are provided in Supplementary Appendix 1. The results of these searches were limited to studies published in English, German and French between 1 January 2005 and 31 December 2015. Reference lists of included articles were also reviewed for relevant citations.
Study selection
R.K. screened retrieved articles by title and abstract and two reviewers (R.K. and J.H.) assessed the full text of each potentially relevant study to determine eligibility for inclusion using pre-defined criteria. We included studies that developed and validated a risk prediction model to predict graft failure in kidney transplant recipients as well as studies that validated an existing model with or without model updating. We excluded diagnostic models, narrative reviews, commentaries, case reports and editorials that contained no original data. Prediction models that were not designed to predict graft failure following kidney transplantation or did not perform any validation were also excluded. In case of discordance between the two reviewers, the final decision about inclusion/exclusion of the study was based on a discussion with a third reviewer (K.L.).
Data extraction
We extracted the following data from each of the included prediction modelling studies adapted from the CHeck list for critical Appraisal and data extraction for systematic reviews of prediction Modelling Studies (CHARMS) checklist: (i) authors name and publication year; (ii) population characteristics including donor type (living/deceased), recipients age range and transplantation period; (iii) definition of graft failure (i.e. return to or initiation of dialysis, re-transplantation and/or death), and if applicable, any mention of competing risks in the paper (i.e. competing risk by death with functioning graft for death-censored graft failure, and competing risk by dialysis/re-transplantation for death with functioning graft), and time horizon of prediction; (iv) the data and the methods used to develop the model including centres, years of transplantation, sample size, predictors and their time of measurement with respect to transplantation, and the type of statistical model (e.g. Cox regression model, logistic regression model); and (v) the data and the methods used to validate the model including external and internal validation methods, and the prediction performance criteria including indicators of overall performance, discrimination, calibration and reclassification (all explained below). We also retrieved the number of citations of the article in Google Scholar, as well as whether the predictive performance of the proposed tool was compared to other existing tools. R.K. performed the data extraction, the result of which was evaluated and discussed with K.L. until a consensus was reached.
Overview of methodological approaches
Validation methods of prognosis models have been extensively discussed [10–12]. The overall performance of a model reflects the distance between predicted and observed outcome [13, 14]. Discrimination reflects the ability of the model to distinguish patients who will experience the event from patients who will not. It is usually measured using the concordance (C) statistics, i.e. the area under the receiver operating characteristic (ROC) curve (AUC) [10, 11], which should account for censoring [15, 16] and competing risks [17, 18] if applicable. Calibration reflects the capacity of the model to correctly estimate the probability of the event at an individual level and can be assessed graphically [19]. Reclassification quantifies the improvement of a new risk prediction model compared with an existing one, in terms of classification of patients into those who will experience the event and those who will not [20, 21].
All performance indicators can be calculated on the same data set as the one used to develop the model (internal validation), and/or a separate data set (external validation). Internal validation includes single random split-sampling (e.g. 50% development sample, 50% validation sample) and resampling methods (cross-validation or bootstrapping) based on many repeated splits of the data set [22]. External validation includes temporal (using patients transplanted in different years), spatial (using patients transplanted in different centres) and fully external validation (using patients from different countries) [23].
RESULTS
Topic . | Specific terms used . | Position in article . |
---|---|---|
Kidney transplanted population | ‘transplant OR transplantation OR graft OR grafting’ | TITLE-ABSTRACT |
AND | TITLE-ABSTRACT | |
‘kidney OR renal’ | ||
AND | ||
Prediction | ‘prediction OR predict OR predictive OR probability OR prognosis OR prognostic OR prognostication OR score OR scores’ | TITLE-ABSTRACT |
AND | ||
Modelling | ‘model OR models OR regression OR equation OR equations OR modeling OR modelling’ | TITLE-ABSTRACT |
AND | ||
Outcome | ‘failure OR loss OR Death OR mortality OR survival’ | TITLE-ABSTRACT |
AND | ||
Prediction study design | ‘prediction OR predict OR predictive OR predicting OR validation OR validity OR validated OR cross-validation OR selection OR calibration OR discrimination OR discriminates OR ROC’ | TITLE-ABSTRACT-KEYWORDSa |
Topic . | Specific terms used . | Position in article . |
---|---|---|
Kidney transplanted population | ‘transplant OR transplantation OR graft OR grafting’ | TITLE-ABSTRACT |
AND | TITLE-ABSTRACT | |
‘kidney OR renal’ | ||
AND | ||
Prediction | ‘prediction OR predict OR predictive OR probability OR prognosis OR prognostic OR prognostication OR score OR scores’ | TITLE-ABSTRACT |
AND | ||
Modelling | ‘model OR models OR regression OR equation OR equations OR modeling OR modelling’ | TITLE-ABSTRACT |
AND | ||
Outcome | ‘failure OR loss OR Death OR mortality OR survival’ | TITLE-ABSTRACT |
AND | ||
Prediction study design | ‘prediction OR predict OR predictive OR predicting OR validation OR validity OR validated OR cross-validation OR selection OR calibration OR discrimination OR discriminates OR ROC’ | TITLE-ABSTRACT-KEYWORDSa |
Further search in keywords in Scopus database only.
Topic . | Specific terms used . | Position in article . |
---|---|---|
Kidney transplanted population | ‘transplant OR transplantation OR graft OR grafting’ | TITLE-ABSTRACT |
AND | TITLE-ABSTRACT | |
‘kidney OR renal’ | ||
AND | ||
Prediction | ‘prediction OR predict OR predictive OR probability OR prognosis OR prognostic OR prognostication OR score OR scores’ | TITLE-ABSTRACT |
AND | ||
Modelling | ‘model OR models OR regression OR equation OR equations OR modeling OR modelling’ | TITLE-ABSTRACT |
AND | ||
Outcome | ‘failure OR loss OR Death OR mortality OR survival’ | TITLE-ABSTRACT |
AND | ||
Prediction study design | ‘prediction OR predict OR predictive OR predicting OR validation OR validity OR validated OR cross-validation OR selection OR calibration OR discrimination OR discriminates OR ROC’ | TITLE-ABSTRACT-KEYWORDSa |
Topic . | Specific terms used . | Position in article . |
---|---|---|
Kidney transplanted population | ‘transplant OR transplantation OR graft OR grafting’ | TITLE-ABSTRACT |
AND | TITLE-ABSTRACT | |
‘kidney OR renal’ | ||
AND | ||
Prediction | ‘prediction OR predict OR predictive OR probability OR prognosis OR prognostic OR prognostication OR score OR scores’ | TITLE-ABSTRACT |
AND | ||
Modelling | ‘model OR models OR regression OR equation OR equations OR modeling OR modelling’ | TITLE-ABSTRACT |
AND | ||
Outcome | ‘failure OR loss OR Death OR mortality OR survival’ | TITLE-ABSTRACT |
AND | ||
Prediction study design | ‘prediction OR predict OR predictive OR predicting OR validation OR validity OR validated OR cross-validation OR selection OR calibration OR discrimination OR discriminates OR ROC’ | TITLE-ABSTRACT-KEYWORDSa |
Further search in keywords in Scopus database only.
Characteristics . | n . | % . |
---|---|---|
Type of studies | ||
Development and validation of a new model | 34 | 87 |
Validating an existing model | ||
Without any model updating | 2 | 5 |
With model updating | 3 | 8 |
Publication years | ||
2005–07 | 6 | 15 |
2008–10 | 14 | 36 |
2011–13 | 14 | 36 |
2014–15 | 5 | 13 |
Number of citations in Google Scholar | ||
<6 | 9 | 23 |
6–15 | 9 | 23 |
16–29 | 12 | 31 |
≥30 | 9 | 23 |
Recipient age | ||
All ages | 1 | 3 |
≥16 years | 3 | 8 |
Adults | ||
≥18 years | 15 | 38 |
≥65 years | 1 | 3 |
Age range unspecified | 15 | 38 |
Not reported | 4 | 10 |
Recipient kidney transplant type | ||
Deceased or living donors | 9 | 23 |
Living donors only | 4 | 10 |
Deceased donor only | 14 | 36 |
Unspecified | 12 | 31 |
Data used to develop the model | ||
National registry | 17 | 43 |
Single centre | 14 | 36 |
Multicentre | 8 | 21 |
Characteristics . | n . | % . |
---|---|---|
Type of studies | ||
Development and validation of a new model | 34 | 87 |
Validating an existing model | ||
Without any model updating | 2 | 5 |
With model updating | 3 | 8 |
Publication years | ||
2005–07 | 6 | 15 |
2008–10 | 14 | 36 |
2011–13 | 14 | 36 |
2014–15 | 5 | 13 |
Number of citations in Google Scholar | ||
<6 | 9 | 23 |
6–15 | 9 | 23 |
16–29 | 12 | 31 |
≥30 | 9 | 23 |
Recipient age | ||
All ages | 1 | 3 |
≥16 years | 3 | 8 |
Adults | ||
≥18 years | 15 | 38 |
≥65 years | 1 | 3 |
Age range unspecified | 15 | 38 |
Not reported | 4 | 10 |
Recipient kidney transplant type | ||
Deceased or living donors | 9 | 23 |
Living donors only | 4 | 10 |
Deceased donor only | 14 | 36 |
Unspecified | 12 | 31 |
Data used to develop the model | ||
National registry | 17 | 43 |
Single centre | 14 | 36 |
Multicentre | 8 | 21 |
Characteristics . | n . | % . |
---|---|---|
Type of studies | ||
Development and validation of a new model | 34 | 87 |
Validating an existing model | ||
Without any model updating | 2 | 5 |
With model updating | 3 | 8 |
Publication years | ||
2005–07 | 6 | 15 |
2008–10 | 14 | 36 |
2011–13 | 14 | 36 |
2014–15 | 5 | 13 |
Number of citations in Google Scholar | ||
<6 | 9 | 23 |
6–15 | 9 | 23 |
16–29 | 12 | 31 |
≥30 | 9 | 23 |
Recipient age | ||
All ages | 1 | 3 |
≥16 years | 3 | 8 |
Adults | ||
≥18 years | 15 | 38 |
≥65 years | 1 | 3 |
Age range unspecified | 15 | 38 |
Not reported | 4 | 10 |
Recipient kidney transplant type | ||
Deceased or living donors | 9 | 23 |
Living donors only | 4 | 10 |
Deceased donor only | 14 | 36 |
Unspecified | 12 | 31 |
Data used to develop the model | ||
National registry | 17 | 43 |
Single centre | 14 | 36 |
Multicentre | 8 | 21 |
Characteristics . | n . | % . |
---|---|---|
Type of studies | ||
Development and validation of a new model | 34 | 87 |
Validating an existing model | ||
Without any model updating | 2 | 5 |
With model updating | 3 | 8 |
Publication years | ||
2005–07 | 6 | 15 |
2008–10 | 14 | 36 |
2011–13 | 14 | 36 |
2014–15 | 5 | 13 |
Number of citations in Google Scholar | ||
<6 | 9 | 23 |
6–15 | 9 | 23 |
16–29 | 12 | 31 |
≥30 | 9 | 23 |
Recipient age | ||
All ages | 1 | 3 |
≥16 years | 3 | 8 |
Adults | ||
≥18 years | 15 | 38 |
≥65 years | 1 | 3 |
Age range unspecified | 15 | 38 |
Not reported | 4 | 10 |
Recipient kidney transplant type | ||
Deceased or living donors | 9 | 23 |
Living donors only | 4 | 10 |
Deceased donor only | 14 | 36 |
Unspecified | 12 | 31 |
Data used to develop the model | ||
National registry | 17 | 43 |
Single centre | 14 | 36 |
Multicentre | 8 | 21 |
Studied populations
Most studies (n = 31) targeted adult recipients [24–26, 28–55], including one devoted to elderly recipients [35]. No study specifically targeted paediatric recipients, although four studies included all ages or recipients older than 16 years. In total, 4 studies (10%) did not indicate age range of recipients, 14 (36%) studies developed/validated models for recipients of deceased donors only and 12 (31%) did not report the donor type. In total, 17 (43%) studies used data from national registries to develop their model, 14 (36%) used data from a single centre and 8 (21%) from several centres (Table 2).
Predicted outcomes
Nine articles investigated several types of events including different definitions of graft failure (Table 3). Among the 30 studies predicting one event type only, dialysis or re-transplantation (whichever comes first) was investigated in 10 papers, death was included in the graft failure definition in 11 papers and 9 studies predicted only death. Among these nine studies, one indicated that it was death with functioning graft [61] and four that it was death before or after dialysis/re-transplantation [27, 35, 37, 62], so with or without functioning graft. Four papers just mentioned ‘any cause of death’ [29, 38, 39, 58], which did not indicate whether it was death with functioning graft only, or death with or without functioning graft. Outcomes were mostly predicted at long-term time horizon (i.e. at least 5 years post-transplant) (Table 3), and no study clearly reported if and how competing risks were accounted for when applicable (Supplementary Appendix 2).
Study characteristics . | n . | % . |
---|---|---|
Predicted events | ||
Graft failure (dialysis/re-transplantation/death with functioning graft) | 11 | 28 |
Death censored graft failure (dialysis/re-transplantation) | 10 | 25 |
Death with or without functioning graft | 4 | 10 |
Death with functioning graft | 1 | 3 |
Death of ‘any cause’ (without specifying with or without functioning graft) | 4 | 10 |
Several predicted events | 9 | 23 |
Predicted time horizon | ||
Short term (1–4 years) only | 7 | 18 |
Long term only | ||
5–10 years | 15 | 38 |
Years unspecified | 3 | 8 |
Both short-term and long-term time horizon | 11 | 28 |
Not reported | 3 | 8 |
Timing of predictors measurement | ||
At transplantation only | 25 | 64 |
After transplantation | 14 | 36 |
Statistical modelling method | ||
Cox regression only | 23 | 59 |
Logistic regression only | 5 | 12 |
BBNs only | 1 | 3 |
Based tree model only | 3 | 8 |
Linear regression model only | 1 | 3 |
Several types of statistical model used | 5 | 12 |
Calculation of score index by formula | 1 | 3 |
Final form of the model used for validation and future use | ||
Original model | 17 | 44 |
Score | 19 | 49 |
Nomogram | 1 | 2 |
Several forms | 2 | 5 |
Study characteristics . | n . | % . |
---|---|---|
Predicted events | ||
Graft failure (dialysis/re-transplantation/death with functioning graft) | 11 | 28 |
Death censored graft failure (dialysis/re-transplantation) | 10 | 25 |
Death with or without functioning graft | 4 | 10 |
Death with functioning graft | 1 | 3 |
Death of ‘any cause’ (without specifying with or without functioning graft) | 4 | 10 |
Several predicted events | 9 | 23 |
Predicted time horizon | ||
Short term (1–4 years) only | 7 | 18 |
Long term only | ||
5–10 years | 15 | 38 |
Years unspecified | 3 | 8 |
Both short-term and long-term time horizon | 11 | 28 |
Not reported | 3 | 8 |
Timing of predictors measurement | ||
At transplantation only | 25 | 64 |
After transplantation | 14 | 36 |
Statistical modelling method | ||
Cox regression only | 23 | 59 |
Logistic regression only | 5 | 12 |
BBNs only | 1 | 3 |
Based tree model only | 3 | 8 |
Linear regression model only | 1 | 3 |
Several types of statistical model used | 5 | 12 |
Calculation of score index by formula | 1 | 3 |
Final form of the model used for validation and future use | ||
Original model | 17 | 44 |
Score | 19 | 49 |
Nomogram | 1 | 2 |
Several forms | 2 | 5 |
Study characteristics . | n . | % . |
---|---|---|
Predicted events | ||
Graft failure (dialysis/re-transplantation/death with functioning graft) | 11 | 28 |
Death censored graft failure (dialysis/re-transplantation) | 10 | 25 |
Death with or without functioning graft | 4 | 10 |
Death with functioning graft | 1 | 3 |
Death of ‘any cause’ (without specifying with or without functioning graft) | 4 | 10 |
Several predicted events | 9 | 23 |
Predicted time horizon | ||
Short term (1–4 years) only | 7 | 18 |
Long term only | ||
5–10 years | 15 | 38 |
Years unspecified | 3 | 8 |
Both short-term and long-term time horizon | 11 | 28 |
Not reported | 3 | 8 |
Timing of predictors measurement | ||
At transplantation only | 25 | 64 |
After transplantation | 14 | 36 |
Statistical modelling method | ||
Cox regression only | 23 | 59 |
Logistic regression only | 5 | 12 |
BBNs only | 1 | 3 |
Based tree model only | 3 | 8 |
Linear regression model only | 1 | 3 |
Several types of statistical model used | 5 | 12 |
Calculation of score index by formula | 1 | 3 |
Final form of the model used for validation and future use | ||
Original model | 17 | 44 |
Score | 19 | 49 |
Nomogram | 1 | 2 |
Several forms | 2 | 5 |
Study characteristics . | n . | % . |
---|---|---|
Predicted events | ||
Graft failure (dialysis/re-transplantation/death with functioning graft) | 11 | 28 |
Death censored graft failure (dialysis/re-transplantation) | 10 | 25 |
Death with or without functioning graft | 4 | 10 |
Death with functioning graft | 1 | 3 |
Death of ‘any cause’ (without specifying with or without functioning graft) | 4 | 10 |
Several predicted events | 9 | 23 |
Predicted time horizon | ||
Short term (1–4 years) only | 7 | 18 |
Long term only | ||
5–10 years | 15 | 38 |
Years unspecified | 3 | 8 |
Both short-term and long-term time horizon | 11 | 28 |
Not reported | 3 | 8 |
Timing of predictors measurement | ||
At transplantation only | 25 | 64 |
After transplantation | 14 | 36 |
Statistical modelling method | ||
Cox regression only | 23 | 59 |
Logistic regression only | 5 | 12 |
BBNs only | 1 | 3 |
Based tree model only | 3 | 8 |
Linear regression model only | 1 | 3 |
Several types of statistical model used | 5 | 12 |
Calculation of score index by formula | 1 | 3 |
Final form of the model used for validation and future use | ||
Original model | 17 | 44 |
Score | 19 | 49 |
Nomogram | 1 | 2 |
Several forms | 2 | 5 |
Predictors used
Statistical models used
A total of 23 studies (59%) used the Cox model only (Table 3). Logistic regression was used in five papers [27, 28, 30, 32, 33] for both short- and long-term prediction, although the use of this regression model assumes that all patients are followed-up over the entire period of prediction. Other less frequently used statistical approaches included decision tree methodology [36, 43, 63], Bayesian belief networks (BBNs) [57] and artificial neural network [24].
Proposed prediction tool for clinical use
Prediction models, in their original form, allow the derivation of the probability for a given patient to lose the graft before a given time after transplantation. The derived probability should help decision-making, before or after transplantation of the patient. However, the derivation of the probability should be implemented via a software or an online calculator, which are easy to use in clinical practice. Moreover, a probability alone may be difficult to interpret, in particular if it is not compared to probabilities of other patients. To facilitate future use of the proposed prediction model in clinical practice, 15 studies presented it as a score and 1 as a nomogram (Table 3). Among them, 11 proposed some thresholds to facilitate decision-making (Supplementary Appendix 2). The thresholds were obtained using median [31], tertiles [37, 38], quartiles [42, 52] or quintiles [39, 45] of the score distribution in the development data set. Some thresholds were also based on cluster analysis [51], or on optimal cut-offs for sensitivity/specificity [30, 33, 49]. Some predictive tools, e.g. the Kidney Transplant Failure Score (KTFS), which predicts the risk of dialysis after the first year post-transplantation [33], have been implemented in an online calculator (Supplementary Appendix 2) with a current formal evaluation of its use in clinical practice [64]. The Recipient Risk Score (RRS) that was originally proposed to improve deceased donor renal allocation [25] has recently been updated to incorporate creatinine at 1 year post-transplantation (1-year RRS) and also implemented in an online calculator [65].
Evaluation of the performance of the risk prediction model
The performance indicators were estimated using internal validation techniques only in 20 studies (Table 4). Among them, eight used single random split-sampling only [27, 34, 37–39, 43, 49, 52] and nine used resampling techniques only [26, 28, 29, 32, 36, 45, 48, 55, 62]. External validation was only performed in six studies (two studies used spatial validation [53, 60], two temporal validation [30, 42] and two fully external validation [58, 59]). Both internal and external validation was performed in 13 studies [24, 31, 33, 35, 41, 44, 46, 47, 50, 51, 57, 61, 63].
. | n . | % . |
---|---|---|
Method used to test the performance of the model | ||
External validation only (n = 6) | 15 | |
Spatiala | 2 | |
Temporala | 2 | |
Fully externala | 2 | |
Internal validation only (n = 20) | ||
Split-sample only | 51 | |
50% development, 50% validation | 5 | |
60% development, 40% validation | 1 | |
66% development, 33% validation | 2 | |
Cross-validation only | ||
5-fold cross-validation | 1 | |
10-fold cross-validation | 4 | |
Unspecified | 3 | |
Bootstrapping only | 1 | |
Several internal validation methods used | 1 | |
Internal validation unspecified | 2 | |
Both internal and external validation (n = 13) | 34 | |
Method used to quantify the performance of the model | ||
Overall performance (n = 2) | 5 | |
R2 (Coefficient of determination) | 2 | |
Discrimination (n = 34) | 87 | |
AUC (C statistic) only | 26 | |
Classification (sensitivity and specificity) only | 3 | |
Both criteria | 5 | |
Calibration (n = 22) | 56 | |
Hosmer–Lemeshow test only | 5 | |
Calibration slope only | 2 | |
Agreement between predicted and observed graft failures only | 12 | |
Several calibration methods | 3 | |
Reclassification | 8 | |
NRI | 3 | |
Other criteria (n = 1) | 3 | |
IPEC score | 1 |
. | n . | % . |
---|---|---|
Method used to test the performance of the model | ||
External validation only (n = 6) | 15 | |
Spatiala | 2 | |
Temporala | 2 | |
Fully externala | 2 | |
Internal validation only (n = 20) | ||
Split-sample only | 51 | |
50% development, 50% validation | 5 | |
60% development, 40% validation | 1 | |
66% development, 33% validation | 2 | |
Cross-validation only | ||
5-fold cross-validation | 1 | |
10-fold cross-validation | 4 | |
Unspecified | 3 | |
Bootstrapping only | 1 | |
Several internal validation methods used | 1 | |
Internal validation unspecified | 2 | |
Both internal and external validation (n = 13) | 34 | |
Method used to quantify the performance of the model | ||
Overall performance (n = 2) | 5 | |
R2 (Coefficient of determination) | 2 | |
Discrimination (n = 34) | 87 | |
AUC (C statistic) only | 26 | |
Classification (sensitivity and specificity) only | 3 | |
Both criteria | 5 | |
Calibration (n = 22) | 56 | |
Hosmer–Lemeshow test only | 5 | |
Calibration slope only | 2 | |
Agreement between predicted and observed graft failures only | 12 | |
Several calibration methods | 3 | |
Reclassification | 8 | |
NRI | 3 | |
Other criteria (n = 1) | 3 | |
IPEC score | 1 |
‘Spatial external validation’ means validated in other centre(s) in the same source population; ‘temporal external validation’ means validated in recent years in the same source population; ‘fully external validation’ means validated in another source population.
. | n . | % . |
---|---|---|
Method used to test the performance of the model | ||
External validation only (n = 6) | 15 | |
Spatiala | 2 | |
Temporala | 2 | |
Fully externala | 2 | |
Internal validation only (n = 20) | ||
Split-sample only | 51 | |
50% development, 50% validation | 5 | |
60% development, 40% validation | 1 | |
66% development, 33% validation | 2 | |
Cross-validation only | ||
5-fold cross-validation | 1 | |
10-fold cross-validation | 4 | |
Unspecified | 3 | |
Bootstrapping only | 1 | |
Several internal validation methods used | 1 | |
Internal validation unspecified | 2 | |
Both internal and external validation (n = 13) | 34 | |
Method used to quantify the performance of the model | ||
Overall performance (n = 2) | 5 | |
R2 (Coefficient of determination) | 2 | |
Discrimination (n = 34) | 87 | |
AUC (C statistic) only | 26 | |
Classification (sensitivity and specificity) only | 3 | |
Both criteria | 5 | |
Calibration (n = 22) | 56 | |
Hosmer–Lemeshow test only | 5 | |
Calibration slope only | 2 | |
Agreement between predicted and observed graft failures only | 12 | |
Several calibration methods | 3 | |
Reclassification | 8 | |
NRI | 3 | |
Other criteria (n = 1) | 3 | |
IPEC score | 1 |
. | n . | % . |
---|---|---|
Method used to test the performance of the model | ||
External validation only (n = 6) | 15 | |
Spatiala | 2 | |
Temporala | 2 | |
Fully externala | 2 | |
Internal validation only (n = 20) | ||
Split-sample only | 51 | |
50% development, 50% validation | 5 | |
60% development, 40% validation | 1 | |
66% development, 33% validation | 2 | |
Cross-validation only | ||
5-fold cross-validation | 1 | |
10-fold cross-validation | 4 | |
Unspecified | 3 | |
Bootstrapping only | 1 | |
Several internal validation methods used | 1 | |
Internal validation unspecified | 2 | |
Both internal and external validation (n = 13) | 34 | |
Method used to quantify the performance of the model | ||
Overall performance (n = 2) | 5 | |
R2 (Coefficient of determination) | 2 | |
Discrimination (n = 34) | 87 | |
AUC (C statistic) only | 26 | |
Classification (sensitivity and specificity) only | 3 | |
Both criteria | 5 | |
Calibration (n = 22) | 56 | |
Hosmer–Lemeshow test only | 5 | |
Calibration slope only | 2 | |
Agreement between predicted and observed graft failures only | 12 | |
Several calibration methods | 3 | |
Reclassification | 8 | |
NRI | 3 | |
Other criteria (n = 1) | 3 | |
IPEC score | 1 |
‘Spatial external validation’ means validated in other centre(s) in the same source population; ‘temporal external validation’ means validated in recent years in the same source population; ‘fully external validation’ means validated in another source population.
Among the 31 studies that reported AUC (Table 4), the highest AUC value (0.94) was obtained for an artificial neural networks model predicting dialysis/re-transplantation/death at 5 years in patients transplanted from living donors and who survived beyond 3 months post-transplant [24] (Supplementary Appendix 2). The result of the Hosmer–Lemeshow test was reported in seven papers [24, 32, 35, 44, 53, 54, 58]. The agreement between predicted and observed outcomes was evaluated in 12 studies, but calibration plots were shown in 5 studies only [28, 43, 49–51]. The calibration slope was reported in two studies [29, 41]. Reclassification performance using the Net Reclassification Improvement (NRI) index was reported in three studies [44, 46, 53]. Overall performance using the coefficient of determination was reported in two studies [29, 51]. The integrated prediction error curve (IPEC) score, which evaluates the predictive performance in the context of high-dimensional survival data [66, 67], was used in one study [47].
DISCUSSION
A relatively large number of prognostic models have been proposed and validated in the last 10 years to predict graft failure in adult kidney transplant recipients. However, we found substantial variability in data collected and methods used for model development and validation. Notably, the definition of graft failure was particularly variable and included death in about half of the studies. A large number of different predictors have been used in the various models, including a number of predictors measured after transplantation such as serum creatinine [33, 34, 38, 40], eGFR [41, 44, 50], proteinuria [33], acute rejection [33, 39], acute tubular necrosis [38], carotid-femoral pulse wave velocity [29] and use of immunosuppressant [38, 39]. About half of the studies validated their model without external data, using various internal validation techniques including mostly single-split sampling. Most studies reported the discrimination performance of their model using AUC only.
The large variability in the quality of the methods used to develop and validate the risk prediction model, as well as in the quality of reporting, has been observed in many other systematic reviews of prognosis studies conducted in other contexts [68–70]. Such observation led to the development of clear recommendations for assessing the performance of prediction models [10] as well as for reporting the results [71, 72]. Recommendations indicate for example that both discrimination and calibration should be reported, internal validation should be corrected for optimism using resampling techniques, external validation could be useful to assess the generalizability to other similar populations, and performance indicators for survival outcomes should account for censoring and competing risks [10]. While internal validation correcting for over-optimism is absolutely necessary as a first step to evaluate the performance of the model in the population used to develop the model, external validation is essential for subsequent use in clinical practice. Indeed, the overall population of kidney recipients is large and diverse, and clinicians may not be confident enough in a tool that has not been validated in different kidney recipient populations. Our results indeed show that among studies that performed both internal and external validation, the AUC tended to be substantially lower when based on external validation. The AUC also varied a lot depending on the time horizon of prediction, even for a given prediction model. Some studies did not report this time horizon, which makes it impossible to compare the reported AUC to other published AUC.
Methodological issues need to be considered when deciding to use or not a proposed predictive tool in clinical practice. For example, censoring and competing risks are of major importance in our setting since graft failure is a time-to-event (survival) outcome that is systematically subject to censoring because not all patients are usually followed-up over the entire period of prediction. Graft failure may also be subject to competing risks depending on its exact definition. Indeed, death with functioning graft is a competing risk for death-censored graft failure, and dialysis or re-transplantation are competing risks for death with functioning graft [8]. It is also important to distinguish prediction of death with functioning graft from prediction of death after dialysis or re-transplantation, since both dialysis and re-transplantation modify the risk of death. Dialysis and re-transplantation should thus systematically be considered as competing events of death. Yet four studies mixed both death before and after dialysis or re-transplantation, and four just specified ‘any cause of death’ without clarifying whether it was death with functioning graft only, or whether it included death after dialysis/re-transplantation. While the Cox model censoring at competing events can be used directly to estimate hazard ratios [8, 73], its use to derive predicted probabilities of events and corresponding model performance indicators in the presence of competing risks requires specific methods and software [17, 74–76]. Yet, none of the studies that we reviewed clearly reported if and how competing risks were accounted for when applicable. Thus, it is difficult to assess the accuracy of the reported values of the performance indicators such as AUC, which may be biased if these issues were not correctly accounted for. The bias may be negligible if only few patients experienced the competing event in the population, but potentially important in populations where a non-negligible proportion of patients experience the competing event. A predicting tool that does not correctly account for important competing risks in the population of interest should thus be used with caution in clinical practice, since the individual probabilities derived from this tool may be biased.
For the practical use of predictive tools, it is also important to distinguish the interest of clinicians in predicting the risk of graft failure from transplantation for a new patient given its baseline characteristics, or in predicting the risk of graft failure from a given follow-up visit after transplantation, given the evolution of the patient’s condition since transplantation. Our review found 14 studies that used predictors measured during follow-up after transplantation such as serum creatinine. However, some of these studies did not clearly explain how post-transplant predictors were accounted for, although this is not so straightforward and thus prone to methodological pitfalls. The most popular approach to account for such time-dependent predictors in risk prediction models is the landmark approach [77]. This consists in (i) choosing a landmark time (i.e. a time origin), for example, 1 year after transplantation if one wants to incorporate information on predictors within the first year, (ii) developing the model using all patients that are still followed-up and at risk of graft failure at the landmark time and (iii) predicting the probability for each patient to lose the graft and/or die between the landmark time and a given time horizon. The risk prediction model developed this way can incorporate any measure of the predictors that have been assessed between transplantation and the landmark time. The advantage of this approach is that it allows the use of standard regression models for survival or competing risk data. The limitation is that, if several measurements of the predictors are available between transplantation and the landmark time, one should generally use summary statistics of these repeated post-transplant measures in the model, such as the last measured value or the percentage decrease of eGFR between transplantation and the landmark time [44]. This was probably the approach used in most of the 14 papers using post-transplant predictors, although only a few of them have clearly reported that only patients who had not experienced the event before the landmark time were used to develop the prediction model [33, 40, 41, 44, 50, 53]. An alternative approach, which would not require summarizing repeated values of quantitative post-transplant predictors (e.g. serum creatinine), would be to use a joint modelling approach, which consists of simultaneously modelling the whole trajectory of the post-transplant predictor and the risk of graft failure [78, 79]. Joint models have recently been used to illustrate their advantages for investigating the association between eGFR trajectories and initiation of renal replacement therapy [80] or death in end-stage renal disease patients [81]. They also have been used to assess the association between serum creatinine trajectories and risk of kidney failure [82]. However, to the best of our knowledge, in spite of their potential for providing accurate dynamically updated predictions, joint models have still not been used to predict graft failure in kidney transplant recipients.
The potential limitations of our study were the restriction to English, German and French languages and exclusion of grey literature (i.e. materials and research produced by organizations outside of the traditional academic publishing and distribution channels). However, we believe that this did not impact our main findings regarding the heterogeneity of methods used.
CONCLUSION
This review demonstrates that several prediction models for kidney graft failure have been developed and validated in the past 10 years, with substantial variability in the definition of graft failure, included predictors, and methods used for model development and validation. This makes it difficult to specifically recommend the use of a particular proposed predictive model in clinical practice, although some models have shown good predictive performance and could easily be used routinely, e.g. the KTFS for patients with functioning graft at 1 year [33]. However, we can make some strong recommendations for the development and validation of future risk prediction models in kidney transplantation, or for validation of existing ones. First, we specifically recommend clearly defining the types of events included in graft failure definition. If the definition implies competing risks, we strongly recommend adequately accounting for them in model development and in the estimation of performance indicators, as well as to systematically report the method used to account for them. We also recommend clearly stating the time origin (e.g. transplantation or 1 year post-transplantation) and the time horizon (e.g. 5 or 10 years) at which prediction performances are evaluated. If post-transplant predictors are used, repeated measurements of post-transplant predictors should be adequately accounted for in model development and validation. We indeed believe that dynamic predictions are of great potential interest for improving the monitoring of patients, but the statistical model used to compute such dynamic predictions needs to be adequately developed and validated. We also recommend validating the predictive tool using resampling methods, and reporting prediction performance for different time horizons. External validation should also be strongly encouraged to assess the generalizability in a larger population of kidney recipients. Finally, predictive models that are developed for routine clinical practice should be implemented via a tool that is easy to use in this setting.
SUPPLEMENTARY DATA
Supplementary data are available online at http://ndt.oxfordjournals.org.
Funding
RK was supported by a PhD research grant from the French ministry of Higher Education and Research.
CONFLICT OF INTEREST STATEMENT
None declared.
Comments