Multicenter Validation of the CamGFR Model for Estimated Glomerular Filtration Rate

Abstract Important oncological management decisions rely on kidney function assessed by serum creatinine–based estimated glomerular filtration rate (eGFR). However, no large-scale multicenter comparisons of methods to determine eGFR in patients with cancer are available. To compare the performance of formulas for eGFR based on routine clinical parameters and serum creatinine not calibrated with isotope dilution mass spectrometry, we studied 3620 patients with cancer and 166 without cancer who had their glomerular filtration rate (GFR) measured with an exogenous nuclear tracer at one of seven clinical centers. The mean measured GFR was 86 mL/min. Accuracy of all models was center dependent, reflecting intercenter variability of isotope dilution mass spectrometry–creatinine measurements. CamGFR was the most accurate model for eGFR (root-mean-squared error 17.3 mL/min) followed by the Chronic Kidney Disease Epidemiology Collaboration model (root-mean-squared error 18.2 mL/min).


Data description and filtering
Raw data that were received from the seven different centres, which are identified by their location (Cambridge, Edinburgh, London-Barts, Manchester, Melbourne, Southampton and Wales). Data were individually processed and cleaned prior to being used for analyses. This step involved exploring the data for manual transcription errors such as the height and weight variable being swapped and removing unrealistic values for some variables. During this process all data were converted to the same units to enable data pooling and analyses across the different centres and subgroup categories.
Next, patients with data outside the inclusion criteria were removed. The inclusion criteria were: • Serum creatinine value between 18 µmol/L (0.2 mg/dL) and 400 µmol/L (4.5 mg/dL) • Age of 18 years or older • Creatinine and GFR measurements performed within 30 days of each other • If a patient had multiple GFR measurements, only the first one would be included The creatinine cut-off values were chosen because the lower value is the typical detection threshold on the measurement assay and the upper value is 3-4 times the upper limit of normal in most centres, thus corresponding to a value at which most clinicians would consider kidney function severely impaired. The 30 days was selected to minimise errors caused by changing creatinine. We confirmed that a 30 day window before and after mGFR was reasonable for creatinine measurements by comparing accuracy in seven different windows relating to the difference in measurement dates. We detected no change in accuracy ( Figure S11).
All creatinine data were either measured using a non-IDMS traceable measurement or the measurement procedure was not known, but likely to be non-IDMS traceable due to the dates of measurements. Table S1 summaries the information on the creatinine measurement for each centre.  Figure S1: Barplot of the time difference between GFR and serum creatinine measurements for each patient. A negative difference indicates that the serum creatinine was measured after the GFR. Some centres only reported time ranges for the creatinine, i.e. that it was measured within 30 days of GFR, and not specific dates. Data from these centres were not included for this barplot.  Figure S1 shows the difference in time between the GFR and creatinine measurements days. Most patients had their creatinine measured on, or in, the week after the nuclear medicine GFR, with this being the case for all patients from Manchester. Table 1 in the main manuscript provides the number of patients in different categorical groups and the summary statistics for the continuous variables (GFR, age, serum creatinine, body surface area (BSA), height and weight) for all patients. Here we provide the summary statistics for patients split into different centres (Table S2).    To facilitate data comparison graphically, Figure S2 displays the data from Table S2 in several boxplots. Patients from Southampton and London-Barts have a higher GFR, height and weight and a lower age than patients from other centres. This is attributable to the fact that the data from these centres were from patients with seminoma only and  hence were typically from young men. The panel of log(creatinine) illustrates the inter-centre variability. Figure S3 show boxplots for log(creatinine) split by centre for all patients and patients with selected diagnoses respectively. As patients with similar diagnoses should be approximately matched for the other variables, the plots facilitate further comparison of serum creatinine differences between centres. Figure S4 takes the same format as Figure S2 but displays subgroups of patients based on diagnosis. Subgroups with more than 50 patients were included.  Figure S5 shows boxplots for age, BSA, mGFR and log(creatinine) for patients from the ethnic subgroups. As white patients were the largest subgroup, t-tests were performed to look for differences between this subgroup and each other subgroup. There were no significant differences in the serum creatinine levels for patients from different ethnicities. However, there were significant differences in the other variables and therefore the subgroups were not matched. Given the correlations between these variables, it is difficult to interpret the creatinine comparisons precisely. To determine whether the average creatinine values differ between white and black patients when the other variables are similar, we sampled 22 white patients from the population that matched the set of black patients with regards to sex and age ten times. We then performed t-tests for these ten samples for log(creatinine), measured GFR and BSA. No significant differences were found (Table S4).

Comparison of models
We compared the CamGFR 1 model with the the following published models: CKD-EPI 2 , MDRD (186 version) 3  where height is measured in metres and weight is measured in kg. Some of these models (Cockroft-Gault and Jelliffe) were developed to estimate creatinine clearance and not GFR, where creatinine clearance is itself a biased estimator of GFR. These models are included as they are still used to estimate GFR by some centres, clinical trial groups, and in other settings.
The equations were compared using three metrics: the root-mean-squared-error (RMSE), median residual value, and the residual interquartile range (IQR). These three metrics are estimators for a model's accuracy, bias and precision respectively. A 95% confidence interval was calculated for each of these metrics using the bootstrap resampling procedure. Specifically, 2,000 resamples with replacement (where the sample size was the same as the number of data points) were taken from the data used to calculate the metric. The metric was then calculated for each of these 2,000 samples and using the normal approximation a confidence interval was constructed 9 . The same seed was used for different metrics.
The main results of this analysis are reported in Figure 1 (results for patients split by centre) in the main manuscript along with Figures  Additionally, we compared the performance of the CamGFR and CKD-EPI models with the aid of a Bland-Altman plot (or MA plot) and residual vs fitted plots ( Figure S6). The Bland-Altman plot is used to contrast accuracy, whereas the fitted vs residual plot is used to to assess the performance of a model. The feature which is most noticeable from Figure  S6 is that for low eGFR (approximately less than 50 ml/min), CamGFR estimates larger values than CKD-EPI. The inverse is true for high eGFR, culminating in a sight overall bias for eGFR to be higher in the CKD-EPI than CamGFR. The variability between the two models increases as eGFR increases. This can be partly explained by the fact that CamGFR and CKD-EPI model GFR on a square root and log scale respectively. The fitted vs residual plots are similar for the two models.
We analysed performance when patients were split into groups by their age or BSA. Figures S8, S9, and S10 show the performance of each model for equally sized subgroups of age, BSA, and creatinine respectively. The accuracy of all models tends to increase with increasing age and tends to increase with increasing BSA. These observations can be largely explained by the fact that GFR is modelled on the square root scale in the CamGFR model and on the log scale for the CKD-EPI or MDRD models. Patients who are young or have a large BSA also tend to have a higher GFR and patients with a higher GFR will have a less accurate estimated GFR due to modelling on a square-root or log scale. As discussed in the main manuscript, CamGFR is not the most accurate for the subgroups split by serum creatinine. For the lower two groups CamGFR, is the most accurate model for all subgroups split by serum creatinine. For the two groups with lower creatinine, CamGFR is the most accurate model. In particular, for the group of patients with the lowest serum creatinine values all other models are biased to overestimate GFR.
P-values for the differences between a statistic for a pair of equations were calculated using a bootstrap procedure described by Efron 10 . In this procedure we have two samples z and y of lengths n and m respectively, from possibly Results for those diagnostic subgroups that had more than 50 patients are shown. First row: the residual (measured GFR -estimated GFR) median, which is a measure of a model's bias, is displayed. Second row: the residual interquartile range (IQR), which is a measure of a model's precision, is displayed. Third row: the root-mean-squared error (RMSE), which is a measure of a model's accuracy, is displayed. Accuracy is a combination metric of bias and precision. Fourth row: The proportion of patients who have an absolute percentage error more than 20% (1-P20), which reflects clinical robustness by illustrating the proportion of patients with a clinically relevant error, is displayed. The best result would be closest to zero for the residual median, and the smallest value for IQR, RMSE, and 1-P20. All error bars are 95% confidence intervals calculated using bootstrap resampling with 2,000 repetitions and a normal distribution approximation. different probability distributions F and G. In our case F and G were the probability distributions of the residuals for the two equations of interest. We tested the null hypothesis H 0 : F = G, against the alternative hypothesis H 0 : F ̸ = G. Let x = [z, y], and chose a test statistic t(x), which had the form t(x) = f (z) − f (y), where the function f (w) was either RMSE, median or IQR. Using the following procedure an approximate p-value was calculated by: • For r = 1, ..., R let x * r be a sample with replacement of size n + m from the pooled vector x = [z, y]. Let z * r be the first n observations of x * r and y * r the remaining m observations.
• Let • Calculate the approximate p-value for the hypothesis test as: is the observed value of the statistic and ⊮ A is the indicator function which returns 1 if A is true and 0 otherwise. The absolute values of t are considered so as to perform a two-sided test.
It should be noted that that the hypothesis test is strictly testing the null hypothesis H 0 : F = G and not the null hypothesis H 0 : f (Z) = f (G). When the approximate p-values are calculated with the statistic 1-P20 (the proportion of patients with an absolute percentage error more than 20%), z and y would be the vectors of percentage differences between the fitted and measured dose values. Approximate p-values (with R = 10, 000) are given in