## Abstract

**Motivation:** DNA microarrays allow the simultaneous measurement of thousands of gene expression levels in any given patient sample. Gene expression data have been shown to correlate with survival in several cancers, however, analysis of the data is difficult, since typically at most a few hundred patients are available, resulting in severely underdetermined regression or classification models.

Several approaches exist to classify patients in different risk classes, however, relatively little has been done with respect to the prediction of actual survival times. We introduce CASPAR, a novel method to predict true survival times for the individual patient based on microarray measurements. CASPAR is based on a multivariate Cox regression model that is embedded in a Bayesian framework. A hierarchical prior distribution on the regression parameters is specifically designed to deal with high dimensionality (large number of genes) and low sample size settings, that are typical for microarray measurements. This enables CASPAR to automatically select small, most informative subsets of genes for prediction.

**Results:** Validity of the method is demonstrated on two publicly available datasets on diffuse large B-cell lymphoma (DLBCL) and on adenocarcinoma of the lung. The method successfully identifies long and short survivors, with high sensitivity and specificity. We compare our method with two alternative methods from the literature, demonstrating superior results of our approach. In addition, we show that CASPAR can further refine predictions made using clinical scoring systems such as the International Prognostic Index (IPI) for DLBCL and clinical staging for lung cancer, thus providing an additional tool for the clinician. An analysis of the genes identified confirms previously published results, and furthermore, new candidate genes correlated with survival are identified.

**Availability:** The software is available upon request from the authors.

**Contact:**kaderali@zpr.uni-koeln.de

## INTRODUCTION

Expression profiles of tumors, measured by DNA microarrays, provide new insights into cancer ethiology. One important goal is to develop new tools to diagnose cancer (Golub *et al*., 1999) and predict disease and treatment outcome more accurately based on gene expression profiles (van De Vijver *et al*., 2002; Rosenwald *et al*., 2002), aiding the clinician with the decision on whether and how to commence treatment of a given patient. Classification of tumor samples to certain risk classes has been demonstrated to be a first step (Rosenwald *et al*., 2002) however, predicting actual survival times and distributions thereof for the individual patient is a widely desired goal.

Furthermore, the identification of genes that correlate with survival may provide new information on pathogenesis and ethiology, and may further aid in the search for new targets for drug design.

A serious problem in microarray data analysis comes from the fact that several thousand gene expression levels are measured simultaneously, while typically only several dozen to at most a few hundred patient samples are available (Reid *et al*., 2005). Mathematically, this yields models with thousands of unknowns, but only a very limited number of data points for parameter estimation. This main statistical problem of microarray experiments may explain in part the often diverging results in different studies on the same entity [e.g. (Rosenwald *et al*., 2002; Shipp *et al*., 2002) or (Bhattacharjee *et al*., 2001; Beer *et al*., 2002)]. This is why data reduction methods such as principal component analysis (PCA) or clustering are normally used to reduce data dimensionality considerably before further analysis is carried out.

A problem of such two-step analysis using unsupervised learning methods is, that features identified by, for example, clustering or PCA may show large variability with clear subgroups, which are, however, unrelated to survival. The approach presented in this paper combines dimension reduction and regression in one single step, thus selecting the most discriminatory genes for the regression model automatically and optimizing regression parameters simultaneously. Several important questions can then be answered: First, which genes are related to survival, and how relevant are the individual genes? Second, continuous survival times or distributions over survival times conditioned on the gene expression values can be predicted, instead of merely classifying patients in two or three different risk classes. Finally, instead of grouping patients based on their gene expression levels and then relating those groups to survival, the method presented in this paper uses survival time as the main predictive variable, and directly looks for genes related to survival.

We used two examples of studies with more than 100 cases to test whether our algorithm is able to predict survival time and to yield meaningful genes responsible for the differences in survival. We present results on a dataset derived from patients with adenocarcinoma of the lung (LC). The main example discussed in this paper concerns diffuse large B-cell lymphoma (DLBCL), a cancer of B lymphocytes.

## MATERIALS AND METHODS

Two publicly available datasets were analyzed. The dataset published by Bhattacharjee *et al*. (2001) on LC contains 125 samples with 12 600 expression values, measured on the U95A platform (Affymetrix). The second dataset published by Rosenwald *et al*. (2002) on DLBCL consist of 240 samples, with 7399 expression values measured on a cDNA array (lymphochip).

Prior to analysis by our method, data were subjected to preprocessing steps. On the LC dataset, first, microarray features with total variance <2500 were removed. This same step was carried out by Bhattacharjee *et al*. (2001) in their analysis. Data from the LC study are of much higher dimensionality than the DLBCL dataset, while at the same time only half as many samples are available. Therefore, we expected the analysis of the LC data to be much more difficult than the analysis of the DLBCL data. To address this issue, we reduced dimensionality by clustering features together using standard hierarchical clustering, with Pearson's correlation coefficient as similarity measure. Clusters were represented by their mean, and clustering was interrupted when similarity dropped below 0.5 for the first time, at which point the microarray features are clustered in 926 distinct clusters. Data were then normalized by subtracting the mean and changing the variance to one. Owing to the low sample number, leave-one-out crossvalidation was used on this dataset during analysis by our method.

The DLBCL data were split in training and validation subsets. To be able to compare results with previous work, the same split as in the original publication was used. A total of 160 patients were used for training, the remaining 80 for validation. Genes with >40% missing values or variance <0.45, determined on the training dataset only, were filtered out, leaving ∼3000 features in the computation. The data were then normalized as described for LC, without any further dimension reduction.

### Survival time regression

Regression models over survival times are complicated because of the presence of censoring in the data. In a typical clinical study, performed over a predetermined time, patients enter the study as they are diagnosed, and leave the study either when the end point occurs (e.g. recurrence of disease, patient's death), when the study ends before the end point occurs or for reasons such as relocation or commencing a different treatment. In the latter two cases, survival times are right-censored, and this needs to be dealt with explicitely when carrying out regression analysis.

Let *T* be a random variable, representing the (potentially unobserved) survival time of a patient of interest. A popular regression model over survival times is the Cox regression model (Cox, 1972), postulating that the hazard, i.e. the immediate risk of failure at time *t*, given survival until *t*, is given by

**x**is a vector of covariates (e.g. the measured gene expression levels), θ is the vector of regression parameters and λ

_{0}(

*t*) is an unspecified baseline hazard function, describing the hazard for some standard conditions

**x**= 0. Note that, given

**x**and θ, the hazard λ(

*t*) completely specifies the survivor function

*F*(

*t*) =

*P*{

*T*≥

*t*} and the probability density

*f*(

*t*) = −d

*F*(

*t*)/d

*t*.

The regression objective then is to estimate the regression parameters θ from given data $D$ = {(**x**^{(1)}, *t*^{(1)}, δ^{(1)}), …, , (**x**^{(L)}, *t*^{(L)}, δ^{(L)})}, where **x**^{(i)} and *t*^{(i)} are the vector of gene expression values and the censored survival time for patient *i*, respectively, and δ^{(i)} is an indicator variable that is 1 when censoring acts on the survival time for patient *i*, and hence it is only known that the true survival time is greater than *t*^{(i)}, and 0 if *t*^{(i)} is an actual survival time.

Under the condition of random censoring over time and under mild independence requirements (independence of the distinct censoring times and independence of the censoring process from the survival times and from θ), the maximum likelihood solution to the regression problem is the parameter vector θ maximizing

### Automatic relevance determination

The Bayesian approach to classification and regression problems assumes that we are given a hypothesis space of functions *h* mapping from the covariates **x** to survival time *t* and containing the true hypothesis *h*^{true} underlying the data. For the Cox regression model, this hypothesis space is parameterized by θ, and one could use, for predictions, the expected value

*f*(

*t*∣

**x**, θ) is the probability density function of the Cox regression model, and is easily derived from Equation (1). The probability of any given hypothesis

*h*

_{θ}, given the data $D$, can then be calculated according to Bayes' theorem as

As prior distribution *p*(θ) on the regression parameters θ, we chose independent, mean-zero normal distributions with variances $\sigma i2$ on the components θ_{i}, implying the assumption that the parameters θ_{i} should not become too large. Choosing the mean zero corresponds to the assumption that most genes will not be related to survival, and thus their weights θ_{i} should be close to zero. Only if the data warrant it, will the weights be driven away from zero. The parameters σ_{i} control the width of the distributions over the θ_{i}. Since we expect very few of the microarray features to be linked to survival, for most features, the distribution *p*(θ_{i}) should be strongly peaked around zero.

The proper Bayesian approach to deal with unknown parameters such as σ_{i} is to specify a prior distribution over them, and then integrate out the parameters. That way, one can avoid having to specify one fixed value for the hyperparameters σ_{i}, which will normally be difficult to estimate, but instead just has to provide a general distribution, which is often much easier to do.

The expectation of a ‘sparse hypothesis’ *h*_{θ}, with only a small subset of the genes significantly affecting survival, can be modelled using independent gamma distributions on the σ_{i},

*a*and

*r*are positive parameters controlling the shape of the distribution. Note that, to avoid the case σ

_{i}= 0, only values of

*r*> 1 should be used.

Figure 1 shows the resulting prior

**Fig. 1**

**Fig. 1**

Models with similar hierarchical prior distributions have been introduced by Neal (1999) and Mackay (1992) in the context of neural networks, and have become popular as ‘Automatic Relevance Determination’ models.

Inserting Equations (2) and (6) into (4), taking the negative logarithm and dropping terms independent of θ, maximization of *p*(θ∣$D$) is equivalent to minimizing

_{0}(

*t*) ≕λ

_{0}is constant with respect to time

*t*. The first term in Equation (7) comes from the prior distribution, integrating over the normal distribution over the weights θ and the gamma distribution over the standard deviation σ, the second term comes from Equation (2) by inserting the survivor and density functions of the Cox regression model. Equation (7) can be minimized with respect to θ using conjugate gradient descent, see Supplementary Material for details on algorithm and running time. The obtained parameter $\theta ^$ can then be used for predictions on new patients with Equation (3), and the magnitudes of the weights $\theta ^$ provide information on the relevance of individual genes. We have termed this approach CASPAR, for CAncer Survival Prediction using Automatic Relevance determination.

### Time-dependent ROC analysis

Owing to censoring, prediction errors cannot be calculated directly. Therefore, for evaluation of our method, we will use time-dependent receiver operator characteristic (ROC) analysis, as suggested by Heagerty *et al*. (2000). Let ψ_{i}(*t*) be a {0, 1}-variable, indicating whether patient *i* has had an event prior to or at time *t* (ψ_{i}(*t*) = 1) or not (ψ_{i}(*t*) = 0). Furthermore, let *h*_{θ}(**x**) be the predicted survival time of a patient with gene expression values **x**. One then defines the time dependent sensitivity and specificity as

*c*is a parameter that can be used to control the trade-off between sensitivity and specificity. Using these definitions, time-dependent ROC curves can be drawn by varying

*c*and plotting sensitivity against specificity. ROC curves can be summarized by computing the area under the ROC curve (AUC), and the AUC can then be plotted with respect to time.

## RESULTS

### Caspar correctly predicts overall survival in lung cancer

We applied CASPAR to the publicly available dataset from Bhattacharjee *et al*. (2001) on LC. Details on the data and on data preprocessing are given in Materials and Methods. Since the dataset contains only 125 patient samples, we used leave-one-out crossvalidation to test our method.

We attempted to address CASPAR's performance with respect to three aspects: First, the accuracy of predictions of survival on the overall dataset was analyzed. Then, we compared results with surgical staging. Finally, CASPAR was used to subdivide stage I patients, for which such a subdivision is particularly desirable.

Training was carried out using leave-one-out crossvalidation. Parameters used in training were *a* = 5 and *r* = 1.001, and constant baseline hazard λ_{0} = 0.19. Prediction results were collected for the 125 crossvalidation runs, the data were then split in short-(≤3.5 years) and long-term (>3.5 years) survivors according to the predictions made. The median survival time of patients in the study is 3.5 years. Figure 2 (left plot) shows the resulting survivor functions for the two subgroups. A logrank test of equality yields a *P*-value of 0.057. Figure 3 shows the ROC curve for 2 years (left), and the AUC for the time interval from 0 to 10 years.

**Fig. 2**

**Fig. 2**

**Fig. 3**

**Fig. 3**

We compared predictions of CASPAR with surgical pathological staging. Staging information is provided on 111 of the 125 patients in this study. The median survival times of stage IA and stage IB patients in the study are 7.5 and 6 years, respectively. This distinction is reflected by our results, predicting an average survival time of 8.7 and 8 years for patients in these stages, respectively. The true median survival time for stage II patients in the study is 2.2 years, our method predicts an average overall survival of 4.9 years for stage 2 patients, which is significanty lower than the predicted 8.3 years for stage I patients. Only 13 patients in the study fall into stages III and IV, with a median survival of just 1.3 years. The number of cases here is too small for training, accordingly, patients from this group are not recognized well by the method.

For the physician, it is of great interest to have a predictor of survival for stage I adenocarcinoma patients. Stage I patients are normally treated by surgical resection of their tumor, and do not receive any additional chemotherapy or radiation therapy. Although patients diagnosed with stage I adenocarcinoma have a 5 year-survival rate of 63%, nearly 35% will relapse after surgical resection, thus portending a poor prognosis (Chen *et al*., 2003). An identification of these high-risk patients with resectable early-stage disease would allow therapeutic intervention on these patients, possibly leading to increased survival among this group.

Figure 2 (right plot) shows the survival curves for stage I patients predicted to survive >5 years versus stage I patients predicted to survive ≤5 years (*P* = 0.13, *n* = 75). As can clearly be seen, CASPAR can subdivide stage I patients further, providing a meaningfull classification in low- and high-risk stage I patients.

Summarizing the above findings, CASPAR clearly distinguishes stage I and stage II patients based on their gene expression levels. These patients benefit most from a prognosis of their disease, since several treatment options exist and could be chosen from based on an assessment of disease aggressiveness. Overall, however, a larger number of patients in the study would clearly be desirable and could yield improved results.

### Validation of survival predictions of CASPAR in DLBCL

As a second test, we applied CASPAR to the publicly available dataset on DLBCL published by Rosenwald *et al*. (2002). This dataset contains almost twice the number of samples available in the LC study. The data were preprocessed as described in Materials and Methods. CASPAR was trained on the training data only, with *a* = 5 and *r* = 1.001, and constant baseline hazard λ_{0} = 0.1, and then used to predict the expected survival time according to Equation (3) for each patient in the validation set. Predictions made for survival using CASPAR were used to split patients in short (≤5 years) and long term survivors (>5 years), and the survivor functions shown in Figure 4 were calculated based on the true survival time for the two subgroups separately. The threshold of 5 years was chosen since this is the median overall survival time in DLBCL. A logrank test of the null-hypothesis *H*_{0} of equality of the two survival curves was made, it yields a *P*-value of 0.0000342, which is highly significant for rejecting *H*_{0}, indicating that CASPAR successfully separated short from long-term survivors.

**Fig. 4**

**Fig. 4**

#### ROC analysis reveals high performance of CASPAR

Figure 5 shows the time-dependent ROC curve for time *t* = 5 years for CASPAR predictions on DLBCL, and the area under the curve with respect to changing time. The plot shows that the AUC is ∼0.772 for all *t* (recall that higher AUC values are indicative of better performance). As a comparison, Li and Gui (2004) published AUC plots for partial Cox regression models, involving repeated least squares fitting of residuals and Cox regression fitting, on the same dataset we used in our work and with the same split of the data in training and validation subsets. For the best model reported in their manuscript, the AUC is between 0.62 and 0.67, depending on time *t*. Lossos *et al*. (2004) derive the 6-gene predictor score (−0.0273 × *LMO2*) + (−0.2103 × *BCL6*) + (−0.1878 × *FN1*) + (0.0346 × *CCND2*) + (0.1888 × *SCYA3*) + (0.5517 × *BCL2*). Using this predictor on the Rosenwald data, the AUC is ∼0.6. The dashed lines in Figure 5 correspond to results obtained using the Lossos *et al*. (2004) predictor.

**Fig. 5**

**Fig. 5**

#### CASPAR enables subdivision of grous within the IPI in DLBCL

The IPI is a clinical scoring system widely used for predicting outcome in DLBCL. It is based on a variety of factors, including patient age, tumor stage, performance status, number of extranodal sites of involvement and serum LDH level (Gascoyne, 2002). The IPI scores patients on a scale from 0 to 5, with those having a low IPI score (0,1) demonstrating 5 year overall survival rates of 73% versus patients with high IPI scores (4, 5) with only 26% 5 year overall survival. Importantly, in most studies, almost half of the patients fall into the intermediate-risk categories (IPI scores of 2–3), whose survival characteristics mimic those of DLBCL overall, with ∼50% of the patients surviving beyond 5 years (Gascoyne, 2002). It is thus of high interest to obtain an independent predictor that is able to further subdivide patients in the medium risk segment.

Figure 6 shows separate survival curves for the patients in the low (left) and medium risk (right) groups as determined by CASPAR on the validation dataset. A logrank test of the null hypothesis of equality of the two curves in each plot was made, giving *P*-values of 0.0135 for low IPI patients and 0.0000747 for medium IPI patients. As clearly demonstrated, CASPAR further subdivides patients from the same IPI group, thereby providing additional information, which could be used clinically as a complementary staging strategy to the IPI.

**Fig. 6**

**Fig. 6**

We also performed the inverse test. When calculating survival time predictions using CASPAR, patients are confirmed to belong to different risk groups, in accordance with the IPI. Patients belonging to the group with low IPI score (0, 1) are predicted to have average survival time of 13.2 years, patients belonging to the medium risk groups (IPI 2, 3) show average predicted survival time of 9.2 years, whereas patients belonging to the high risk groups (IPI 4, 5) are predicted to have an average survival time of 6.7 years. Note that the values reported here are average survival times, not median survival, and hence cannot be related directly to the Kaplan–Meier plots in Figures 4 and 6.

#### Performance of CASPAR is independent of sample splitting

To further validate the results, training was repeated 50 times, with the data split differently and randomly in training and validation subsets in each run. A total of 45 runs show good separation of the Kaplan–Meier curves for the predicted short (≤5 years) and long-term (>5 years) survivors on the validation dataset, with an achieved significance level of 10%. A total of 43 runs are significant at the 5% significance level, and 27 runs even at the 1% significance level. Only one of the 50 runs fails at the 25% significance level, with a *P*-value of 0.453. This run is characterized (randomly) by an unequal split in training and validation data, the validation data containing a high proportion of short survivors and no uncensored long-term survivors, and the training data containing few short- and many long-term survivors.

Overall, the average AUC value over all 50 runs and all timepoints *t* is 0.64.

#### Direct association of biologically relevant genes to survival

Interestingly, different genes receive high weights in the 50 runs. This shows a typical problem in microarray data analysis, where frequently different genes are identified in different studies (compare Table 1). One possible explanation for this observation might be the high correlation between genes measured. O'Neill and Song (2003) pointed out that the level of information redundancy in large gene expression assays is unknown, and that different, mutually exclusive gene sets may be equally useful predictors. This point is confirmed by Alon *et al*. (1999), who discarded the 1500 genes indicated by cluster analysis as most discriminatory in their study of colon cancer, and, upon reclustering, found their diagnosis unimpaired. Hence, if several features are highly correlated, each one of them could in principle function as part of a predictor. The prior distribution *p*(θ) used on the weights in our approach, however, will ensure that redundant information is removed from the predictor. Which feature is chosen in the predictor will depend on effects such as the randomness involved in splitting the data in training and validation subsets.

**Table 1**

‘None’ indicates gene was not assigned to any cluster at the *d* = 0.5 cutoff level.

Correlated genes in the data can be identified by clustering. For this purpose, genes were clustered using standard hierarchical clustering, with Pearson correlation as similarity measure. Clustering was continued until the similarity between joined clusters dropped below 0.5. At this point, the 7399 microarray features are clustered in 2366 different clusters. Table 1 shows genes that were reported in several studies [compare the summary in (Lossos *et al*., 2004)], and the respective clusters they were assigned to by the hierarchical clustering algorithm.

A more detailed analysis of the genes identified in the 50 repetitive runs reveals that many of those genes are highly correlated to genes listed in Table 1. High weights are given to genes from cluster 1525 (containing CD38), 4743 (containing HLA-DQA1 and HLA-DRA), 4939 (BCL2, CCND2, CD44, CFLAR, SLA, IRF4, PDE4B), cluster 4990 (containing FN1, SCYA3, ICAM1/CD54, PRDM1, SLAM, PLAU), cluster 5000 (containing genes BCL6, PAX5, Ki-67, BIRC5, WASPIP, PMS1, NPM3, MYC, EEF1A1L4) and cluster 5029 (CD10, HGAL, MYBL1/A-MYB, PIK3CG).

Several additional genes are reported by CASPAR. WASF3 receives high weights in 35 runs, elevated expression of this gene is predicted to correlate negatively with survival. A total of 29 runs show an association with BMP6. BMP6 is one of the main prognostic genes identified by Rosenwald *et al*. (2002) in their study, however, Lossos *et al*. (2004) later excluded BMP6 from their meta analysis, arguing that the gene was not associated with survival in independent data analyses. Our results indicate that expression of BMP6 correlates negatively with survival. TP63 receives high weights in 21 runs. Expression of this gene is predicted to correlate positively with prolonged survival. Even though TP63 has been reported to be associated with DLBCL before (Akahoshi *et al*., 2003), it is not as prominent a candidate as other genes, and may deserve further attention. RAFTLIN receives high weights in 29 runs, MNDA in 18 runs, TCF7 in 14 runs and LC_31372 receives high weights in 12 runs. Elevated expression of all of these genes is predicted to correlate with prolonged survival.

Taken together, CASPAR yields good predictions of overall survival on DLBCL, with high rates of sensitivity and specificity. It complements the IPI, and can refine predictions made using the IPI. Furthermore, CASPAR identifies biologically relevant genes and points to several new candidates related to survival.

## DISCUSSION

In this paper, we have presented CASPAR, a new method to correlate gene expression data with survival, based on a combination of the Cox regression model with a Bayesian automatic relevance determination (ARD) approach. The method automatically weights genes with respect to their relevance concerning patient outcome, while penalizing models with too many genes, thus encompassing dimension reduction and regression in one single step. The method uses a hierarchical prior distribution on the regression parameters, specifically designed to deal with the high dimensionality, low sample number setting characteristic of microarray studies. This allows the application of a multivariate Cox regression model on datasets with several hundred to several thousand genes, with only a small number of patients compared with the number of genes available, and avoids the problem of loosing relevant information in a separate dimension reduction step carried out before the actual regression.

The price paid for the ability to process high dimensional data are assumptions to be made through the prior distribution on the regression parameters θ. While *p*(θ) enables the method to control the curse of dimensionality, it biases the predictor towards solutions where only few of the genes have large weights, and where θ_{i} do not become too large. Through the choice of parameters for the prior, one trades variance of the predictions made against bias of the predictor. The strength of prior required depends on the number of genes and on the number of samples available for training, and needs to be chosen carefully. If the prior is too strong, it will determine results. On the other hand, if the prior is too weak, results will be random owing to the high dimensionality of the data. The higher the dimensionality of the data, the stronger the prior assumptions required for the parameters θ. Careful engineering is required for the choice of parameters *a* and *r* controlling this trade-off.

Many studies of gene expression data in cancer cluster patient samples together based on their genetic profiles, and associate clusters identified with survival differences in retrospect. Characteristic genes for each cluster are then sought, and used to classify new patients in one of the cluster groups. While this clearly is a valid approach, its disadvantage lies in the fact the the clustering process itself may be dominated by genes with large variability, that are actually unrelated to survival. Our approach directly tries to identify genes and combinations of genes correlated to survival on the full dataset, which are then used to assign a new patient to one of several risk classes directly based on predicted survival time distributions.

We demonstrated the utility of our method on two publicly available datasets. On both datasets, CASPAR is able to separate long from short survivors, even though separation is significantly better in the larger dataset from the DLBCL study. One possible explanation is the lower sample number especially in those stage groups (stage III, *n* = 10) in the LC study that did not perform well. Alternatively, with increasing stage of disease, one might also consider that gene expression profiles in LC might not be as informative for survival as previously suggested. This might also be explained by the biology of the disease. While stage III lung cancer is characterized by extrapulmonary extension of the tumor without evidence of distant metastasis, stage IV tumors are defined by distant metastasis. This change in biology is in fact correlated with differences in survival (Etzioni *et al*., 2003) and as previously described also with changes in gene expression (Ramaswamy *et al*., 2003). Our correct predictions of short survival of stage IV is therefore in line with earlier observations. Certainly, these important issues need to be addressed in future clinical studies in lung cancer.

The finding that many of the genes in the DLBCL data are strongly correlated suggests to cluster the data first, and then carry out the gradient descent training. This brings the additional advantage that the gradient descent can be carried out in a lower-dimensional space. However, our results indicate that prediction results are actually worse when the data are clustered before training (data not shown). Apparently, useful information is getting lost in the clustering process. At the same time, the analysis of the 50 separate runs with the data split differently in training and validation subsets indicates that there is a high degree of correlation between different genes, and that clustering may be useful to avoid correlated genes. This may appear contradictory at first sight. The problem is that it is not clear when to stop clustering. If clustering is interrupted too early, there will still be a high degree of redundancy in the data. On the other hand, if the data are clustered too long, much useful information may get lost. This is a delicate trade-off. We therefore suggest to use CASPAR on the full data, and carry out clustering in retrospect to find correlations.

In both, the lymphoma and the lung cancer study, our approach is able to refine predictions made using clinical scores widely used, particularly on low- and medium-risk patients in DLBCL and on stage I patients in lung adenocarcinoma. These groups benefit particularly from such refined predictions, since they are relatively heterogeneous with respect to survival. In DLBCL, survival characteristics of patients with medium IPI score (2, 3) mimic those of DLBCL overall, with ∼50% of patients dying in the first 5 years after diagnosis. In LC, nearly ∼35% of stage I patients relapse after surgical resection. Using predictions made by the approach presented, these patients may be identified and could benefit from more adjuvant therapies.

*Conflict of Interest*: none declared.

## Comments