Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel

Motivation: Recent large-scale omics initiatives have catalogued the somatic alterations of cancer cell line panels along with their pharmacological response to hundreds of compounds. In this study, we have explored these data to advance computational approaches that enable more effective and targeted use of current and future anticancer therapeutics. Results: We modelled the 50% growth inhibition bioassay end-point (GI50) of 17 142 compounds screened against 59 cancer cell lines from the NCI60 panel (941 831 data-points, matrix 93.08% complete) by integrating the chemical and biological (cell line) information. We determine that the protein, gene transcript and miRNA abundance provide the highest predictive signal when modelling the GI50 endpoint, which significantly outperformed the DNA copy-number variation or exome sequencing data (Tukey’s Honestly Significant Difference, P <0.05). We demonstrate that, within the limits of the data, our approach exhibits the ability to both interpolate and extrapolate compound bioactivities to new cell lines and tissues and, although to a lesser extent, to dissimilar compounds. Moreover, our approach outperforms previous models generated on the GDSC dataset. Finally, we determine that in the cases investigated in more detail, the predicted drug-pathway associations and growth inhibition patterns are mostly consistent with the experimental data, which also suggests the possibility of identifying genomic markers of drug sensitivity for novel compounds on novel cell lines. Contact: terez@pasteur.fr; ab454@ac.cam.uk Supplementary information: Supplementary data are available at Bioinformatics online.


7
Distribution of respective maximum and minimum RMSE test (a,b) and R 2 0 test (c,d) values for the complete data set. Average maximum and minimum values of 1.42/0.35 and 0.96/-0.96, were obtained respectively for RMSE test / R 2 0 test with the simulated data. The performance of the 10fold CV PGM models on the test set was in agreement with the uncertainty of the experimental measurements, as mean RMSE test and R 2 0 test values of 0.40 +/-0.00 pGI 50 unit and 0.83 +/-0.00 (with n = 10 models) were obtained. These values are between the two extreme, maximum and minimum, theoretical RMSE test and R 2 0 test values. Distribution of respective maximum and minimum RMSE test (a,b) and R 2 0 test (c,d) values for the uncorrelated bioactivities 0.5 data set. Average maximum and minimum values of 1.90/0.54 and 0.94/-0.90 were obtained respectively for RMSE test / R 2 0 test with the simulated data. The performance of 10-fold CV PGM models was in agreement with the uncertainty of the experimental measurements, as mean RMSE test and R 2 0 test values of 0.58 pGI 50 unit and 0.79 were obtained. These values are between the two extreme, maximum and minimum, theoretical RMSE test and R 2 0 test values.
Interpolating compound bioactivities to novel cell lines, tissues, and chemical clusters. (a) Cell 13 Supplementary Figure S7.
Learning curves. Mean (+/-std) RMSE test (a) and R 2 0 test (b) values were calculated for the observed against the predicted bioactivity values on the test set calculated with n=3 models obtained using training sets covering an increasingly higher fraction of the complete data set.
Models trained on 5% of the data set exhibited a mean RMSE test value of 0.52 pGI 50 unit, which decreased till 0.39 pGI 50 unit when 95% of the data-points were included in the training set.

Supplementary Figure S10
Validation of conformal prediction. For each confidence level (ε), represented in the x-axis, the number of data-points in the test set which true value lay within the predicted interval is calculated, y-axis. The high Spearman's r s is likely due to the large size of the test set (188,366 data-points) and to the fact that the CI produced by conformal prediction are always valid (Norinder et al., 2014). These data indicate that the modeling framework combining PGM models and conformal prediction is more information rich than what would be possible with only point prediction algorithms.  Figure S11.
Consistency between the pathway-drug associations calculated with the experimental and the  Correlation between in vitro drug sensitivity data from the NCI60 and CCLE. The subset of the NCI60 data used in our study and the CCLE share 44 cell lines and 8 drugs, namely: Erlotinib, Lapatinib, Nilotinib, Sorafenib, Paclitaxel, Irinotecan and Topotecan. We could retrieve bioactivity data from both data sets for a total of 208 compound-cell line pairs. (a) The RMSE value for (i) the pGI 50 values from the NCI60 data set, against (ii) the pIC 50 values from the CCLE is 0.87 log 10 units. This low concordance was expected given the different assays used to screen the NCI60 and CCLE panels, namely sulforhodamine B (SRB) and CellTiter-Glo® Luminescent Cell Viability Assay from Promega, respectively. Therefore, three cases are possible when comparing data from the NCI60 and the CCLE data sets. In the first case, low compound concentration is sufficient to stop cell proliferation whereas high compound concentration is required to decrease cellular metabolic activity: this case is labeled with number 1 in red in the Figure. In the second case, cell proliferation and cellular metabolic activity are correlated and similar IC 50 values are observed using both assays: this case is labeled with number 2 and 3 in red in the Figure. In the third case, low compound concentration is required to decrease cellular metabolic activity whereas high compound concentration is required to stop cell proliferation: this case is labeled with number 4 in red in the Figure novel chemical structures was assessed by randomly dividing the compounds into 8 sets. A model was trained on all data-points comprising compounds from 7 sets. The trained model was then used to predict the bioactivities for the held-out data. This process was repeated 8 times, each time holding out the data from a different compound set. In this setting, which is similar to LOCCO except for the fact that compounds are not grouped based on a similarity, it is likely that the distribution of IC50 values for a given compound set spans a wide range of values, thus permitting to obtain high R 2 values for the observed against the predicted bioactivities (Supplementary Figure 14). By contrast, the range of IC50 values is likely to be much narrower for individual compounds across the cell line panel. Therefore, the R 2 values obtained with Leave-One-Compound-Out validation with PGM models are likely to be smaller than those obtained with LOCCO for the same accuracy in prediction, quantified with the RMSE value for the observed against the predicted bioactivities. Hence, it is important to note that although the We note in particular that we did not apply the same validation as Ammad-ud-din et al., 2014, namely partitioning the data set in 8 compound sets, as the composition of the 8 different sets was not reported by the authors.

Preparation of the CCLE data set
Gene transcript levels (Affymetrix U133+2 arrays), RMA-processed and normalized using quantile normalization, and compound IC50 values (µM) were downloaded from the CCLE and R 2 test =0.75 +/-0.12) and Leave-One-Compound-Out (RMSE test =1.62+/-1.32 and R 2 test =0.18 +/-0.15). All models were trained using: (i) 256-bit hashed Morgan fingerprints in count format using a maximum substructure radius of 2 bonds, and (ii) the transcript levels for the 1,000 genes displaying the highest variance across the cell line panel. The results for these models and for previous studies are given in Supplementary Table S15.