ChemProt-3.0: a global chemical biology diseases mapping

ChemProt is a publicly available compilation of chemical-protein-disease annotation resources that enables the study of systems pharmacology for a small molecule across multiple layers of complexity from molecular to clinical levels. In this third version, ChemProt has been updated to more than 1.7 million compounds with 7.8 million bioactivity measurements for 19 504 proteins. Here, we report the implementation of global pharmacological heatmap, supporting a user-friendly navigation of chemogenomics space. This facilitates the visualization and selection of chemicals that share similar structural properties. In addition, the user has the possibility to search by compound, target, pathway, disease and clinical effect. Genetic variations associated to target proteins were integrated, making it possible to plan pharmacogenetic studies and to suggest human response variability to drug. Finally, Quantitative Structure–Activity Relationship models for 850 proteins having sufficient data were implemented, enabling secondary pharmacological profiling predictions from molecular structure. Database URL: http://potentia.cbs.dtu.dk/ChemProt/

the procedure used for training the QSAR models. One QSAR model will be trained for each combination of classification algorithms and chemical descriptor. In total 15 different QSAR models will be produced (5 descriptors types * 1 algorithms * 3 cutoffs for splitting data) for each protein in the dataset. The performance of each model was estimated in a 5--fold cross--validation scheme as outline in Figure S1, and used for weighting the prediction of each model when calculating the "wisdom of the crowd" score. Each dataset were balanced i.e. the same number of positive and negative (binders/non--binders) compounds were included in each dataset, by sampling the number of negative data points from the negative dataset corresponding to the number of positive data points present in the dataset. If not enough negative data were available random chemicals from ChemProt3.0 were included as negative data. Note that the final models used for prediction are trained on all data available. Training'of'QSAR'models' Feature' selec7on' Figure S1. Outline of describe method. First, a 5--fold cross--validation scheme applied on the training data is used to determine the unbiased performance of each QSAR model. Next, all training data are used to generate QSAR models used for prediction of potency of new chemicals. Each prediction score (for each QSAR model) are weighted according to the model performance estimated in the 5--fold cross--validation scheme.
As the features space used to describe chemical structures is large compared to the available data, a features selection algorithm was employed to select a subset of features for model generation. A random forest approach, using the scikit--learn ExtraTreesClassifier with 100 trees, provided a consistence selection of features in a 5-fold cross--validation scheme on the hERG dataset. 100 trees were chosen to reduce running time as 15*6=90 feature selections have to be completed for each protein.
Features were selected based on their average Gini--importance using the mean average Gini--importance as cutoff. The features selection are applied only to the training dataset in the cross--validation scheme, thus no bias is introduced towards descriptors general applicable to the dataset are introduced.

Performance measure
The accuracy score was used as measure of model performance for each of the 15 models generated for each dataset of interest ( Figure S1). It converges to the Jaccard similarity score when the output is binary (classification) and gives the ratio between correctly classified instances and, correct + non--correct classification (total number of data points). Hence a complete random model will take the value of ACC = 0.5. As the dataset used are totally balanced the accuracy gives a reasonable estimation of the model performance and do not suffer from over--optimistic results biased against either negative or positive instances, as when applied to non--balanced datasets. Predicting the potency of novel chemicals -"Wisdom of the crowd framework" The scores were weighted based on model performance relative to the performance of a dummy model always outputting the average of training activity scores. As the dataset was balanced the model performance of the dummy model is always ACC = 0.5: The overall score was then calculated by weighting the predicted scores by the cross-validated performance as described in equation 2: where w m is the cross--validated performance (see Figure S1) and S m is the predicted score (0 or 1) for each model:

Chemical Descriptors
As the chemical space are multi--dimensional and infinitely large, directly using chemical structures to build predictive models are not feasible. Instead descriptors are used that describe different features of the molecules, thereby transforming the structure into features space. Multiple types of descriptors exist describing different molecular features. Here, the focus is on topological fingerprints (Daylight like fingerprints) [1] and Morgan fingerprints (also called circular fingerprints) [2]. Topological fingerprint will be called "daylight". Morgan fingerprint will be called ECFP and FCFP for the atom and feature based version respectively to emphasize that the fingerprints utilize atom invariants connectivity information similar to those used for the well known ECFP family of fingerprints and feature--based invariants, similar to those used for the FCFP fingerprints. It was chosen not to include pharmacophore fingerprints and 1D and 2D physical/chemical descriptors to keep the number of generated models at a reasonable level. Furthermore, the feature based Morgan fingerprint included is somewhat related to the 2D pharmacophore features as these describe the pharmacophore features around each atom in the chemical. All fingerprints were calculated using RDkit (www.rdkit.org) implemented in python. Figure  S2 gives and overview of the chemical descriptors used in the presented work.

Prediction of the hERG-binders
To evaluate different settings such as the choice of prediction algorithm, number of bits in fingerprints and the added values of an ensemble approach, the hERG dataset obtains from [3] was used. A 5--fold cross--validation scheme was considered to estimate the performance by splitting the dataset randomly into 5 different partitions and iteratively using 4 partitions for training and 1 for evaluation until all data points have been evaluated once (not to be confused with the cross--validation performed when training the ensemble models). The described ensemble approach was applied to each training partition i.e. training and evaluation dataset are kept totally separated during the training and evaluation. Thus no bias towards descriptors, classifier algorithms or training datapoints has been introduced, hence allowing selection of the best settings (classifier and number of bits) based on the cross--validated performance. The IC50 (uM units) values used in the study were multiplied by --1 to reverse the scale and the cutoff --100, --10 and --1 uM were considered as the low, medium and high binder threshold respectively, except for single models - here --40 uM were used in agreement with the original study. Chemicals with an IC50 value below 40 uM were considered binders whereas chemicals with an affinity value above were considered non--binders. Some of the included performance measures require a binary classification, thus predicted value above 0.6 was regarded as binders for calculation of these performance measures.
Several different combinations of fingerprints types and lengths (number of bits) were tested as described in table S2. Using only a single fingerprint type reduces the performance of the model, however the FCFP and daylight fingerprints using 2048 bits still show reasonable performances. The length of the fingerprints (512, 1024 or 2048 bit) seems to influence performance slightly, whereas using a single fingerprint type consistently reduces the performance. However, inclusion of models trained using different classification algorithms (separately), boost performance (to a minor degree) with the non--linear algorithms performing the best (setting 16--19 in Table S2). Thus, choosing a single algorithm might be sufficient as long as it enables higher order correlations. Comparing the single models (setting 20--39) to the ensemble reveals that using an ensemble approach significantly enhanced prediction power. All ensemble models have improved prediction statistics compared to models only containing a single classification model (one single descriptor, one threshold and one classifier algorithm), even though the threshold used for splitting the training dataset into binders/non--binders for single models are the same used for the evaluation.  Table  S2. Cross-validated performance on the hERG binders. "daylight" denominates the topological fingerprint implemented in RDkit (essentially the same as daylight fps) and "_bXXXX" the number of bits used in the fingerprint. ECFP and FCFP is the Morgan circular atom and feature based fingerprints, the "_bXXXX" the number of bits used and "rX" the radius used in the circular fingerprint. Note that the performance values reported here are from the external cross--validation and not the cross--validation performed when training the ensemble of predictors described in Figure S1.

Comparison to the Similarity Ensemble Approach
The other prediction method implemented in ChemProt3.0 is the similarity ensemble approach (SEA) [4]. To compare the "new" QSAR implementation a dataset of 179 proteins of particular interest when investigating off--target effects were compiled (see Table S2). 143 of these had sufficient data available in ChemProt3.0 to train QSAR models and were used as the basis for comparing performance of the QSAR models to the SEA implementation. The dataset for the 143 proteins were spitted in 5--partition and a 5--fold cross--validation scheme were used to assess performances by using 4 partitions for training using the ensemble approach explained above and 1 partition for validation at a time. The partitions were spitted randomly. For both, the ensemble QSAR model and SEA outputs float values spearman correlation coefficient (SCC) was used to compare performances. SCC is a parameter free coefficient (essential the PCC of ranked--values), which ensure a reasonable comparison even though the two methods output is on different scales. A SCC = 1 reflects perfect ranked-correlation between predicted and true values, 0 is random and --1 reflects an inverse correlation. Table  S2 list the performances for the 143 proteins for both the ensemble QSAR predictive models and the SEA. Using a one--sided paired T--test and the null-hypothesis that the SCC QSAR == SCC SEA and the alternative hypothesis that SCC QSAR > SCC SEA the null--hypothesis could be rejected with a p--value of: 2.2e--16.