Machine learning models in predicting graft survival in kidney transplantation: meta-analysis

Abstract Background The variations in outcome and frequent occurrence of kidney allograft failure continue to pose important clinical and research challenges despite recent advances in kidney transplantation. The aim of this systematic review was to examine the current application of machine learning models in kidney transplantation and perform a meta-analysis of these models in the prediction of graft survival. Methods This review was registered with the PROSPERO database (CRD42021247469) and all peer-reviewed original articles that reported machine learning model-based prediction of graft survival were included. Quality assessment was performed by the criteria defined by Qiao and risk-of-bias assessment was performed using the PROBAST tool. The diagnostic performance of the meta-analysis was assessed by a meta-analysis of the area under the receiver operating characteristic curve and a hierarchical summary receiver operating characteristic plot. Results A total of 31 studies met the inclusion criteria for the review and 27 studies were included in the meta-analysis. Twenty-nine different machine learning models were used to predict graft survival in the included studies. Nine studies compared the predictive performance of machine learning models with traditional regression methods. Five studies had a high risk of bias and three studies had an unclear risk of bias. The area under the hierarchical summary receiver operating characteristic curve was 0.82 and the summary sensitivity and specificity of machine learning-based models were 0.81 (95 per cent c.i. 0.76 to 0.86) and 0.81 (95 per cent c.i. 0.74 to 0.86) respectively for the overall model. The diagnostic odds ratio for the overall model was 18.24 (95 per cent c.i. 11.00 to 30.16) and 29.27 (95 per cent c.i. 13.22 to 44.46) based on the sensitivity analyses. Conclusion Prediction models using machine learning methods may improve the prediction of outcomes after kidney transplantation by the integration of the vast amounts of non-linear data.


Introduction
Artificial intelligence (AI) consists of computerized algorithms designed to mimic and elaborate human thought patterns or actions. Machine learning (ML), one of the major branches of AI, is the study of algorithms that learn from sample data or past experience without being specifically programmed to perform a particular task 1 . ML techniques are progressively being applied in many disciplines to solve clinical and health-related problems 2 . The global market value of AI/ML has been predicted to grow from 4.3 billion Euros in 2020 to 42.4 billion Euros by 2026 3 . This is due to ML/AI's ability to swiftly analyse large amounts of complex and non-linear data, act as a potential adjunct for clinical diagnosis, and accurately predict outcomes compared with traditional statistical methods 4,5 .
The four commonly used methods in ML are supervised, unsupervised, semi-supervised, and reinforcement learning 6,7 . The most common supervised learning methods such as neural networks and classification-based models recognize patterns in the training data set and help make predictions by identifying similar patterns in future data sets. In contrast, unsupervised models aim to identify hidden patterns in a data set and are not trained in a previous data set. Semi-supervised learning is a bridge between supervised and unsupervised learning that is trained using a smaller fraction of labelled data and a significantly larger set of unlabelled data. Reinforcement learning is a technique in which the algorithm automatically learns from feedback in a data set in a trial and error manner, thereby closely mimicking human learning.
Common ML models include neural-based models such as artificial neural networks (ANN), convolutional neural networks (CNN), decision trees (DT), random forest (RF) and support vector machines (SVM). Multiple hybrid models combining many aspects of these basic models have been developed and used in healthcare and are discussed in this review. Neural-based models such as ANN are inspired by neurons and contain 'nodes' that communicate with other nodes via connections based on their ability to perform a specific task. A CNN is a subtype of ANN that is predominantly used in image recognition algorithms as it preserves the spatial relationship between pixels in an image. A CNN relays parts of data to specific nodes with a view to preserve the spatial orientation of the feature extracted 8,9 . A DT is a non-parametric supervised learning technique used for classification tasks. It is similar to a flow chart, starting from a root node and splitting into multiple branches and nodes. Each node represents a test on a particular attribute, each branch represents the outcome, and the terminal node holds the class label. An RF is an extension of this where an ensemble method produces multiple DT 10 .
Despite the advances in kidney transplantation (KT), the accurate prediction of graft survival (GS) after transplantation using standard statistical modelling continues to be a challenge [11][12][13][14] . Existing risk prediction models such as donorrecipient pairing or the kidney donor risk index (KDRI) have a limited ability to predict outcomes for kidney transplant recipients with receiver operating characteristic (ROC) scores of 0.6-0.7 [14][15][16] . Although ML has been used to predict GS and various other outcomes after solid organ transplantation, there is significant inconsistency regarding the accuracy and effectiveness of these prediction models [17][18][19][20] .
The purpose of this paper is to systematically review and determine the current status of AI/ML models in the prediction of GS after KT.

Search strategy
The authors aimed to identify all published studies in which ML models were used to predict GS after KT. They searched the MEDLINE, Elton Bryson Stephens Company Information (EBSCO) and Embase databases from their earliest date until 14 November 2022 by using pre-specified key words (Supplementary material).
Article screening and extraction was performed using the Covidence online screening and data tool 21 . The reference lists of the retrieved articles and similar review articles in the field were also searched to identify additional papers. All studies written in English and focusing on the clinical prediction of GS after renal transplantation by ML-based models were included. Case reports, non-English papers, editorials/commentaries, conference abstracts, pre-print articles, reviews, letters, and papers with limited data on methodology were excluded. The study was registered in the PROSPERO database (CRD42021247469) and was performed according to PRISMA guidelines 22 .

Data extraction
The key details regarding the method and results were recorded on a data extraction sheet. Data extraction was conducted by two independent reviewers (B.R. and N.S.). Discrepancies were resolved by discussion amongst the authors and a tiebreaking vote from the authors not involved in the screening process (K.K., U.H., and P.C.).
Data elements extracted included study name and year of publication, country, method of feature selection, ML method used, validation methods, study population, the type of input variables (pre-transplant, intraoperative, and/or posttransplant), size of the training and validation data sets, results, and follow-up interval.

Quality and risk-of-bias assessment
The methodological quality of the studies included in the review was assessed using the AI/ML-specific quality assessment tool introduced by Qiao 23 . This instrument proposes the following categories: unmet need or limits in current non-ML approach, robustness, reproducibility, generalizability, and clinical significance. The risk-of-bias assessment of the studies was carried out using the PROBAST tool 24 . The risk-of-bias assessment and quality assessment figures were produced with the help of the interactive online web application, 'robvis' 25

Meta-analysis
The suitability of pooled analyses was considered via interpretation of heterogeneity based on the I 2 statistic and P value for the χ2 test. Given the significant heterogeneity in the included studies, ML models, methodology, and the index test used to evaluate ML model performance, the single point estimate for the overall model was calculated by meta-analysis of the area under the ROC (AUROC) curve, calculation of summary estimates of sensitivity and specificity, and subsequent construction of a hierarchical summary ROC (HSROC) curve. The performance of the ML-based models and regression-based models in the prediction of short-term (less than 1 year) GS and long-term (greater than 3 years) GS were analysed. The calculations were based on the random effects bivariate binomial model of Chu and Cole 26 and the HSROC curve parameters were calculated based on the equations drawn from Harbord et al. 27 . The HSROC curve was constructed using the online MetaDTA tool 28,29 and the meta-analysis of the AUROC curve was performed based on the method outlined by Zhou et al. 30 . A sensitivity analyses was performed that included studies with no significant methodological concerns and low risk of bias and studies that validated their data set in a separate or external data set.

Results
Out of the 1667 studies identified, 31 studies met the inclusion criteria for the systematic review. The exclusion of all the other studies is outlined in Fig. 1 in accordance with the PRISMA reporting guidelines. The quality assessment of these studies as assessed by the criteria set by Qiao 23 revealed that 12 studies did not perform feature selection engineering (FSE), only five studies validated the model in an external data set, and eight studies reported an instability of result (Fig. 2a). Eighteen studies, however, used a separate subset of data to validate their models. The risk-of-bias assessment using the PROBAST tool indicated that five studies had a high risk of bias and three studies had an unclear risk of bias (Fig. 2b).
Most studies included a large number of preoperative donor and recipient input variables. Nineteen studies used FSE to identify the most relevant clinical variables prior to modelling. The studies in the review used 29 different ML methods including ANN, recurrent neural networks (RNN), DT, SVM, Bayesian belief networks (BBN), gradient boost (GB), adaptive      boosting, and various hybrid models to develop their predictive models. Fourteen studies used more than one method to develop the models in their paper. Nine studies compared the performance of regression-based models versus ML-based models. A summary of all the included studies is included in Table 1.
The data from primarily deceased donor transplantation were used in seven studies and the data from exclusively living donor transplantation were used in two studies. All the other papers included data from both living and deceased donor transplantation. Fifteen studies used data from national or international registries such as the United States Renal Data System (USRDS), the United Network for Organ Sharing (UNOS), the Australia and New Zealand Dialysis and Transplant Registry (ANZDATA), or Euro-transplant data.
There was a substantial heterogeneity with respect to the study methodology, sample size, input variables, ML model performance, and the index test used to assess model performance. The main outcome measures used to assess the performance of ML models were AUROC curve, sensitivity, specificity, accuracy, and concordance index (C-index). Twelve studies evaluated their model performance mainly using sensitivity, specificity, and accuracy. The data from these studies were analysed to perform an HSROC curve analysis and the calculation of the summary estimates of sensitivity and specificity. Eighteen studies evaluated their model performance using ROC curve analysis. Seventeen studies used ML-based models to predict GS beyond 3 years, 11 studies used ML models to predict GS at 1 year or less and three studies did not mention the time interval.
The area under the HSROC curve was 0.82 for all the studies and 0.85 based on the sensitivity analyses solely including the studies of good methodological quality, low risk of bias, and validation of data in a separate data set (Fig. 3) (Fig. 4). The meta-analysis of the AUROC curve revealed that ML-based models performed marginally better than regression-based models.

Discussion
This systematic review aims to summarize the current evidence surrounding the predictive ability of ML models in graft outcomes after KT. A total of 31 studies were included in the review and meta-analysis out of which approximately one-sixth of the included studies had a high risk of bias and 14 studies had some methodological concerns. The predictive ability of 29 different supervised ML models were evaluated in this review and significant heterogeneity was noted in the included studies with respect to the methodology and models used. Despite these limitations, ML-based models had a significantly higher HSROC curve, a higher diagnostic odds ratio, and AUROC curve of 0.82, which has thus far not been achieved by many traditional statistical models 37 .
Several attempts have been made to use ML-based algorithms to predict long-term GS. These attempts have included the use of DT, ANN, SVM, and BBN [56][57][58][59] . However, the best ML method used to develop a suitable model to predict outcomes after KT continues to be controversial and a widely discussed topic [17][18][19]43,60,61 . Innovative approaches to data mining could potentially improve the accuracy of these outcomes of organ transplantation by considering the non-linear association between the various factors 54,55,62 . These predictive models are limited by various factors including the reliance on pre-transplant clinical parameters, such as age, BMI, cold or warm ischaemia time, and type of dialysis; variable assessment of immunological factors; smaller sample sizes used to build the models; and the complex and non-linear interrelationship and failure to accurately use censored patient data.
In this review the authors noted significant heterogeneity in many domains of the ML models used, including the method of FSE, the non-ML limits, the platforms used to generate these models, and the index test used to predict model performance. There was a significant divergence with respect to the clinical data used including the number of patients in each study and the type of variables used to construct the ML models. The majority of studies used preoperative donor and recipient variables and very few studies incorporated intraoperative and postoperative predictors to construct the ML model. Many studies aimed to predict GS at 1, 3, or 5 years after transplantation and also reported a significant difference in the ability of these models to predict GS at various time points as indicated by the prediction intervals 63 . Only one study explored the use of ML-based models in prediction of the 'time to event' 43 .
Although the best ML model to predict graft outcomes continues to be controversial, the results of this review suggest that hybrid ML models, variations of SVM, and RF-based models performed the best in the prediction of GS. There was a significant difference in the performance of ANN and tree-based models in the prediction of GS. Despite the reported heterogeneity in performance among the ML models, their overall and individual model performance is reportedly better than the prediction ability of the currently available gold-standard prediction tools such as the KDRI. The reported C-index of KDRI is 0.63 14,64 and most ML models in this review report a better performance in the prediction of GS.
There is limited evidence available regarding the minimum sample size required to develop a sound ML-based predictive model. Although it has been noted in this review that ML models developed using both single-centre small-scale data and large-volume database data have reported similar predictive abilities, larger sample sizes resulted in better model performance. It is also noteworthy that model performance is highly dependent on the volume of clinical data and their linear or non-linear relationship, the complexity of the model, and the ML method used. It has also been noted that a higher number of events per variable was associated with better model stability and higher predictive accuracy 65 .
The predictive accuracy of the ML model used also depends on the integration of clinically significant variables in the model 66 . Hence, identification of the relevant clinical variables that need to be incorporated into the model is a key step in model development. FSE is a common and well recognized method to identify these variables and is critical to avoid overfitting. Overfitting occurs when too many clinical variables are incorporated into the model, which triggers the model to adapt to irrelevant details, eventually leading to poor predictive performance 67 . Twenty-six studies have used cross-validation or its variations to circumvent this problem. Cross-validation is the best tool in assessing the effectiveness of a model, particularly in cases that need to mitigate overfitting. It also helps in determining the hyper parameters of the ML model, to achieve the lowest test error. To generalize that AI/ML has applicability in the field of KT and is reliable at predicting GS, external validation of these ML models is crucial because ML models perform very well in cohorts in which they have been trained 68 . However, many published prediction models are either not externally validated at all or poorly externally validated 69 .
Cox models and logistic regression are less suited to handling complex or non-linear relationships between predictors/ covariates and outcomes as they assume that variables are independent of each other 53,70,71 . Thus, the accurate prediction of complex outcomes such as allograft survival using these statistical techniques continues to be a challenge 72 . The currently available NHS Blood and Transplant (NHSBT) risk prediction tool is modelled based on traditional statistical models with limited preoperative donor and recipient variables. The ideal prediction model should ideally include preoperative donor and recipient variables, intraoperative variables, and postoperative variables to accurately predict GS. ML models trained on data-sets with a large amount of clinically irrelevant data can result in unintended biases 73 . Representative biases can occur in clinical databases or genetic databases and can be potential pitfalls irrespective of the type of ML method used. ML models trained on retrospective/historical data sets may not necessarily reflect current practice and therefore model testing on large data sets with clinically relevant variables is vital for good predictive performance 74,75-79 . Limitations of this review include the methodological shortcomings arising out of the substantial heterogeneity within the included studies and the difficulty in arriving at a single summary estimate of overall model performance. The evidence presented in this review has also suggested that most model predictions are based on a very large amount of retrospective data with limited external validation. In an ideal ML setting, a specific prospective data entry based on clinical experience should be combined with graft outcome data and then analysed over an interval of time in conjunction with actual outcomes. The most recent study comparing the predictive abilities of ML models versus conventional statistical models performed in a large database has suggested that AI/ML models are not significantly superior to conventional regression-based models 37 . Although the authors agree that no substitute can replace human intelligence or clinical experience, the results of this review have demonstrated that the best prediction of difficult outcomes such as GS, which takes into account numerous preoperative, operative, and postoperative outcomes, is difficult with human intelligence alone. An informed and well guided decision is best taken by combining clinical experience and a well designed prediction model. These well designed prediction models can only be developed by ML tools given the vast amounts of data required to build a reliable model. This review summarizes the current available evidence, identifies the best ML models suited for these outcomes, and the key challenges that need to be addressed to accurately guide future research. Whilst the use of AI/ML in KT is still in its infancy, such models have a significant future role, not only in the prediction of GS, but, the authors believe, also in organ matching, diagnostics, and management pathways.

Funding
The authors have no funding to declare.