-
PDF
- Split View
-
Views
-
Cite
Cite
Ying-Ying Xu, Fan Yang, Yang Zhang, Hong-Bin Shen, Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning, Bioinformatics, Volume 31, Issue 7, April 2015, Pages 1111–1119, https://doi.org/10.1093/bioinformatics/btu772
- Share Icon Share
Abstract
Motivation: There is a long-term interest in the challenging task of finding translocated and mislocated cancer biomarker proteins. Bioimages of subcellular protein distribution are new data sources which have attracted much attention in recent years because of their intuitive and detailed descriptions of protein distribution. However, automated methods in large-scale biomarker screening suffer significantly from the lack of subcellular location annotations for bioimages from cancer tissues. The transfer prediction idea of applying models trained on normal tissue proteins to predict the subcellular locations of cancerous ones is arbitrary because the protein distribution patterns may differ in normal and cancerous states.
Results: We developed a new semi-supervised protocol that can use unlabeled cancer protein data in model construction by an iterative and incremental training strategy. Our approach enables us to selectively use the low-quality images in normal states to expand the training sample space and provides a general way for dealing with the small size of annotated images used together with large unannotated ones. Experiments demonstrate that the new semi-supervised protocol can result in improved accuracy and sensitivity of subcellular location difference detection.
Availability and implementation: The data and code are available at: www.csbio.sjtu.edu.cn/bioinf/SemiBiomarker/.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Knowing the subcellular locations of proteins in human cancer tissues can improve the understanding of protein functions and cancer pathogenesis (Chou and Shen, 2008; Pierleoni et al., 2006). It has been demonstrated that the translocation of protein might be a signal of cancer (Hanash et al., 2008; Hung and Link, 2011). The cyclin D1 protein is an example: it shuttles between the nucleus and cytoplasm in a healthy cell and the reduction of exportation from the nucleus can lead to overexpression in the nucleus and the inactivation of the tumor-suppressing protein retinoblastoma (Benzeno et al., 2006; Gladden and Diehl, 2005). Accurately detecting protein translocations in human cancer tissues can thus be of important help for clinical diagnosis and treatment. Because traditional wet lab experiments are expensive in time and costs (Eliceiri et al., 2012; Winski et al., 2002), automated methods are highly desired for handling the increasing amounts of biomedical data.
Despite its importance, only a few studies have reported automated methods to detect translocation details in cancerous tissues until now. One reason is that sequence-based analysis by itself is not sensitive enough for detection of protein translocation as translocation can be strongly effected by mutations outside the target sequence. For example, mutations in nucleoporin complexes can have dramatic effects on the nuclear localization of multiple other proteins (Hung and Link, 2011). Due to recent advances in microscopic imaging, image-based pattern analysis methods have gained popularity due to the intuitive and detailed information the images contain. For example, the Murphy group discussed the potential applications of their models based on automated analysis of fluorescence microscopy images to the analysis and classification of skin cancers (Glory and Murphy, 2007; Murphy, 2004). Rizzardi et al. (2012) compared the abilities of automated image analysis and pathologist visual examination in quantifying protein expression in ovarian cancer. Recently, our group developed a multilabel subcellular location predictor, iLocator, and identified several translocated proteins as potential cancer biomarkers (Xu et al., 2013).
To compare the localization difference of a protein in normal and cancerous tissues, we have to know its subcellular locations in both normal and cancerous states first. This can be achieved either through wet-lab experiments or computational predictions. Since image data with experimentally annotated subcellular locations in cancerous states are rare, prediction models have been used instead, especially in the large-scale screening. Due to the lack of location labels for proteins in cancerous states, however, most of the existing methods employed an approach named transfer learning, where models are first trained on proteins in normal tissues and then used to predict the localization of proteins in cancerous tissues (Eliceiri et al., 2012; Xu et al., 2013). The performance of these approaches is poor, where one reason is the subtle differences in subcellular location patterns between cancer and normal states, which are influenced by cell mutations and morphological changes.
In fact, there are a large number of images of proteins with cancerous tissues. The Human Protein Atlas (HPA, version 11, http://www.proteinatlas.org/) (Uhlen et al., 2010) database, for example, currently contains more than 1 million immunohistochemistry (IHC) microscopy images of proteins in cancerous tissues. But due to the lack of explicit subcellular annotations, no attempt has been made in using these images from cancerous tissues for constructing supervised models for cancer localization prediction.
To address the issues, we present a heuristic semi-supervised learning framework for subcellular location prediction by taking advantage of the unannotated cancer samples in developing predictors. The key advantage of the proposed semi-supervised method, in comparison to the traditional supervised learning algorithms, is that it can train prediction models with only a few labeled image samples and a large pool of unlabeled samples (Hady and Schwenker, 2013). An iterative and incremental strategy was designed to select unlabeled samples into the training set. To choose the most discriminative samples, we developed three different training modes: a single-training model consisting of only one classifier (McLachlan, 1975), a co-training model consisting of two classifiers (Cohen, 2002) and a tri-training model consisting of three classifiers (Zhou and Li, 2005). Also, as the incorporation of prior knowledge can improve the performance of semi-supervised methods (Liston and Stone, 2008), we took the location information from the corresponding normal tissues as prior knowledge to guide the selection process.
Another advantage of the proposed semi-supervised framework is that the training samples become typically much more enriched compared with the traditional supervised learning. First, it selected useful lower-quality images from normal tissues for training. In general, researchers prefer using well-stained images in the training set (Newberg and Murphy, 2008; Xu et al., 2013). But selecting only high-quality images may introduce bias into modeling because the number of high expression level images in the HPA is relatively small (Fig. 1A). Therefore, instead of being discarded, some images of normal tissues with weak expression levels were selected for use in training by the semi-supervised strategy used in this study. Then, also the large cancer dataset was used for model construction by using the semi-supervised strategy of this study, which results in a much larger dataset useable for model construction. The final predictor by the semi-supervised training can be used for images from both normal and cancer tissues. We have tested the method on an independent cancer biomarker dataset composed of translocated or mislocated proteins, which have been confirmed by biological experiments. Comparing the prediction results from models trained with and without data from cancerous tissues shows that using the cancer data improves the sensitivity of detecting protein translocations or mislocations in human cancer tissues.

Data collection. (A) Process of collecting normal datasets. The pie chart (left) shows the percentages of normal protein images with different levels of expression reliability in HPA version 11. Protein images with high and medium reliability corresponding to six subcellular locations in 11 tissues were collected (Supplementary Table S1). The overlapping part of two circles represents overlap on the protein level because some proteins have different reliability levels in different tissues. For example, ornithine carbamoyltransferase is one such protein because its reliability of expression in liver is high while in the colon it is medium. The IDN is randomly selected from the non-overlapping proteins and avoids protein overlap with the training set. The ADN and BDN are composed of the remaining images with high and medium reliability levels, respectively. Note that IDN has intersection with neither ADN nor BDN at the protein level. (B) Some examples of protein images with different reliability levels and subcellular locations. (C) Summary of all the datasets used in this study. The CDC is built by images of 348 proteins in cancerous tissues, where the 348 proteins are proteins whose images in corresponding normal tissues are of high reliability of protein expression. The IBD contains 10 proteins that were reported being translocated in human cancers by the literatures (Supplementary Table S2). In the column of expression reliability, H means high and M means medium
2 Methods
2.1 Datasets
Our image data were extracted from the HPA database, where the reliability of the annotated protein expression data is scored as high, medium, low and very low quality, depending on the consistency of the expression profile with the available literature (Uhlen et al., 2010). To compromise between image quality and model generality, we used the top two categories of IHC images, i.e. high and medium reliability levels (Fig. 1). Three normal datasets with high and medium reliability levels were used, where the datasets ADN and BDN are for training and the independent dataset (IDN) is for testing. In the experiments, we evaluated different supervised and semi-supervised algorithms on the IDN, which is not contained in the training set for all the training stages. It should be noted that not all of the medium quality images in the BDN dataset were used. Only those that are capable of improving model performance were selected according to our semi-supervised strategy.
The cancer dataset (CDC) contains 21 920 images, which were selectively added into the training set to improve prediction performance for proteins in cancerous tissues. One hundred and forty-seven images corresponding to 10 biomarker proteins in normal and cancerous tissues were retrieved from the HPA database and composed the independent biomarker dataset (IBD) dataset. This dataset was used to validate whether the sensitivity of detecting the subcellular location difference between normal and cancer statuses is improved by incorporating the cancer data into training.
All these datasets are from 11 human tissues, i.e. breast, colon, liver, lung, lymph node, ovary, pancreas, prostate, kidney, thyroid gland and urinary bladder. They involve six major cellular organelles: cytoplasm, endoplasmic reticulum, Golgi apparatus, mitochondria, nucleus and vesicles. Among all the proteins in our datasets, 26% are multilabel proteins that belong to two or three organelles simultaneously. It should be noted that the label of each protein was obtained from the annotation of its immunofluorescence (IF) images with the same antibodies.
2.2 Image preprocessing and feature extraction
Because each original HPA image is the fusion of DNA and protein, the linear spectral separation method was used to separate DNA and protein channels (Xu et al., 2013). Then we extracted the Haralick texture features, DNA distribution features and local binary patterns (LBP) features from these two channels (Nanni et al., 2010; Tahir et al., 2012; Xu et al., 2013). Each of 10 Daubechies filters can generate 836 Haralick features. They are used to create separate feature sets referred to as db1 through db10. The dimensions of DNA distribution and LBP features are 4 and 256, respectively. A feature vector of 1096 components is used to represent the image in each Daubechies filter space. Many previous studies have demonstrated that feature selection from the high-dimensional vector is useful, so we used stepwise discriminant analysis as it has been demonstrated to work well in this field (Newberg and Murphy, 2008; Xu et al., 2013).
2.3 Incremental semi-supervised learning
We prepared three datasets, i.e. ADN, BDN and CDC, to construct classifiers. Among them, ADN and BDN are normal datasets with different levels of reliability of protein expression, and CDC is a cancer dataset. All of the ADN dataset were used in our experiments because this dataset has the best quality. Then the samples in BDN and CDC datasets were selectively added to the training set by semi-supervised learning. A flow chart of the proposed method is shown in Figure 2A.

Incremental process of iteratively adding candidate samples into the training set. (A) Flow chart of the iterative process. The initial training set is either the entire ADN dataset or the entire ADN dataset plus a selected subset of the BDN dataset, while the candidate samples to be included are images in either the BDN or CDC datasets, respectively. As the iterations proceed, the training set grows and the number of candidate samples decreases. (B–D) Results of iteratively adding BDN into the training set (initially ADN) using the proposed protocol with db7 features. (E–G) Results of iteratively adding CDC into the training set (initially ADN and a selected subset of BDN) using the proposed protocol with db7 features. (D, G) Shows effects caused by updating the training set in each round. The effect (effj) defined by Equation (1) is used for determining the stop condition of iterations. The model is considered stable when effj smaller than a threshold value (<0.01 in this study)
2.3.1 Incorporating new samples
The requirement for a sample to be added to the training set is that its predicted label set is the same as the annotation in HPA and the other classifier(s). This is because such images have more obvious discriminative features for a certain class and they can therefore help to enhance the classification boundary of the current model. Since there is no subcellular location annotation of proteins in cancerous tissues in the HPA, we compare the prediction output to the annotation of corresponding proteins in normal tissues to judge whether a sample in the CDC dataset should be selected or not. This is reliable when considering more than 95% proteins are actually not cancer biomarkers (Glory et al., 2008). Note that when adding the samples from the CDC set, the initial classifier(s) are the resulting classifier(s) after adding the BDN set. This ensures the generality of final predictor for both normal and cancer proteins. To test different strategies, we have implemented three training modes, i.e. single-classifier mode, two-classifier mode and three-classifier mode. Details of their screening criteria to judge which samples need to be added are presented as follows.
The single-classifier mode just constructs one classifier, which will be iteratively updated until the stop condition is reached. Before the iteration process, an initial classier is trained using the entire ADN dataset. In each iteration round, the classifier is used to predict the subcellular locations of the images in the candidate sample set and those images whose predicted subcellular locations are the same as the annotations in HPA are selected and put into the training set. The classifier is then updated based on the new training set, which is ready for the next iteration.
According to the two-classifier mode, a predictor is composed of two classifiers, i.e. C1 and C2, where their initial models are trained on A1 and A2, which are generated from the ADN dataset via the bootstrap sampling method (Efron and Tibshirani, 1994). This sampling method randomly draws n independent samples with replacement from the original pooled set, where n is the number of samples in the pooled set. In this study, we sampled 4224 times with replacement from the ADN space and obtained approximately 63.2% of ADN images after discarding repeated images. This step can ensure and , which guarantee the diversity of the initial models of C1 and C2. The candidate sample set was duplicated to two sets, B1 and B2, which were used for updating C1 and C2, respectively. In each iterative round of training, C1 is firstly employed to predict the subcellular locations of the images in B2, then those images whose predicted subcellular locations are exactly the same as the annotations in HPA were removed from B2 and added to A2 for updating the C2 model. Analogously, A1, the training set of C1, was extended by predicting B1 with C2. As the iterations proceed, the size of B1 and B2 decreases while that of A1 and A2 increases.
The three-classifier mode trains three classifiers, i.e. C1, C2 and C3, by three different training sets, i.e. A1, A2 and A3, which are also initially constructed by using bootstrap sampling. Then the candidate sample set was duplicated to three sets, B1, B2 and B3, for updating the three classifiers, respectively. In each round, C1 and C2 are used to predict the subcellular locations of the images in B3. Those images whose label sets outputted from C1 and C2 are both the same as the HPA annotation were removed from B3 and added to A3 for updating C3. A1 and A2 were updated in an analogous way based on the output from the other two classifiers.
2.3.2 Stopping condition
2.4 Dynamic threshold criterion
Here, we used the support vector machine (SVM) as the classification model, and the LIBSVM-3.17 package is employed (http://www.csie.ntu.edu.tw/∼cjlin/libsvm/). The radial basis function was used as the kernel and its optimal width parameter was calculated by the data-driven calculator GFO (Lei et al., 2012). To deal with multilabel proteins that can coexist in multiple subcellular locations, the binary relevance (BR) multilabel algorithm was used to deal with our datasets (Boutell et al., 2004). According to BR, one binary SVM model was trained for predicting the relevance of test images to one class, so each BR classifier contains six SVM models (Xu et al., 2013). A six-dimensional (6D) score vector [s1, s2, … , s6] will be obtained per test image, where each score component represents the confidence of the input belonging to the corresponding class (six subcellular locations). Based on the outputted real-value confidence score vector, it is important to decide which class or classes should be assigned to a sample.
In a previous work, we investigated the top criterion (T-criterion) and the threshold criterion (S-criterion) to decide the label sets in multilabel classifications (Xu et al., 2013). The T-criterion considers that the label set consists of the labels with positive scores, and if all the scores are negative, the label with the maximum score is considered as the unique label. The assumption of the S-criterion is that the score values corresponding to the real labels are the largest, and, in the case of a multiplex sample, its multiple labels will have similar scores. So in the S-criterion, a threshold is determined to measure whether a score is close enough to the largest one. However, it is a static threshold that is applied to all the images to be classified. A static unified threshold may not fit for all images because the scales of score vectors for different images can be variable, especially for the images in different classes.



Illustration of the process of determining parameters for D-criterion. Two constant parameters, t and θ, are needed in this criterion (Equation 2). Suppose the ith score of a sample outputted from classifier is si. When deciding whether the label i should be assigned to the predicted label set, we defined H1 to denote yes and H2 to denote no. t is set to distinguish H1 and H2, while θ is set to ensure that the labels with high scores are not missed. Both parameters are determined by maximizing posteriori principle, as well as score vectors of training set by 5-fold cross validation. (A) The histogram of tdif1. (B) The histogram of tdif2. tdif1 and tdif2 are tdif values corresponding to H1 and H2, respectively (Equation 3). (C) The fitting curves. The parameter t is obtained as the intersection point. (D) The histogram of si when H1 happens. (E) The histogram of si. (F) The fitting curves. θ is set to ensure the ratio between the two regions of integration is 0.95. This figure is based on the model trained by ADN with db7 features

The statistics of ,, , , and are based on the score vectors obtained by using 5-fold cross validation on the training set. The calculation process is given in Figure 3.
2.5 Evaluation metrics
Due to the fact that we are facing multilabel proteins, five multilabel classification metrics, i.e. subset accuracy, accuracy, recall, precision and average label accuracy were employed to evaluate the performance of the predictors (see Supplementary text for details). Among them, we mainly use the subset accuracy, which is the most stringent one since it requires the predicted label set to be exactly the same as the true label set. In addition, we also measured the sensitivity and AUC of each binary classifier in the models (see Supplementary text for detail).
3 Results
3.1 Baseline supervised model results
As a baseline, the most straightforward supervised method was used to train classifiers for comparison. We took the entire ADN, entire BDN and a combination of them (ADN + BDN), respectively, as training sets to construct classifiers. Then these classifiers were tested on the independent IDN dataset, and generated the results of simple supervised learning for comparison using the T-criterion (Fig. 4A) and the D-criterion (Fig. 4B), respectively.

Results of supervised learning and semi-supervised learning tested on the independent IDN dataset. (A) Results of baseline supervised classifiers trained on ADN, BDN and ADN + BDN datasets, respectively, using the T-criterion. (B) Results of supervised classifiers using the D-criterion. (C) Comparison results of our classifiers trained by adding BDN to ADN using semi-supervised strategy on three modes, with two other semi-supervised classifiers in literature. (D) Ensemble by fusing the classifiers after adding BDN. (E) Results of classifiers trained by subsequently adding CDC to the training set using the semi-supervised strategy on three modes. AsemiB1 and AsemiBC1 mean using one-classifier mode, AsemiB2 and AsemiBC2 mean using two-classifier co-training mode, and AsemiB3 and AsemiBC3 mean using three-classifier tri-training mode. (F) Ensemble by fusing the classifiers after adding CDC. ,, ,, and are ensemble classifiers, and each of them is constructed by fusing 10 single classifiers of db1–db10. AsemiBE is the ensemble of , and. AsemiBCE is the ensemble of , and . (G) Comparison of subset accuracies between ensemble classifiers and single classifiers
It can be seen from Figure 4A and B that: (1)D-criterion outperforms T-criterion, demonstrating the effectiveness of the D-criterion; (2) Overall, the subset accuracies of classifiers trained on ADN are better than those on BDN, indicating that the image quality can affect the model performance; (3) Interestingly, in some cases, the results of ADN + BDN are not better than those only using ADN, indicating that not all of the medium quality images in BDN have a positive effect on performance.
The first observation suggests that a dynamic threshold is better due to the specificity for testing samples, thus we will use the D-criterion in the following experiments. The second and third observations suggest that if we add all the BDN samples into ADN to train a supervised model, the performance does not improve sometimes. The reason could be that not all of the samples in the BDN are complementary to the ADN; furthermore, some low-quality samples in the BDN will degenerate the model. This motivated us to explore a better way to take advantage of the candidate image samples rather than simply employing all of them.
3.2 Improvements by selectively adding medium-reliability data
The entire ADN was used as the initial training set, and then according to the semi-supervised iteration framework, not all of the BDN images, but only those which improve model performance were iteratively selected into the training set. The final results are three semi-supervised predictors, which are denoted as AsemiB1 (one-classifier mode), AsemiB2 (two-classifier mode) and AsemiB3 (three-classifier mode), corresponding to the three training modes, respectively.
The classifier of each round is tested on IDN, and the changes of subset accuracies are shown in Figure 2B. The changes of number of added images, and effects on each iterative round are illustrated in Figure 2C and D. It can be seen that as the round increases, the subset accuracy tends to increase in all modes. All the final subset accuracies when these iterations terminate, i.e. 51, 49 and 51%, are higher than the result of directly adding the entire BDN, which is 46% as shown in Figure 4B. Besides, both the number of added images and effect value in the iteration decrease sharply. This indicates that the influence of the added images on classification decreases as the round increases. At the end of iterations of the db7 model, 56.75, 61.37 and 52.86% images in BDN were chosen and added to the training sets of AsemiB1, AsemiB2 and AsemiB3, respectively. Compare Figure 4C and B, we can see that all the subset accuracies of three semi-supervised modes are higher than those of supervised learning. Adding medium-reliability data into training set not only expands the training sample space, but also validates the effectiveness of the proposed semi-supervised idea.
Considering that different semi-supervised learning methods have been widely used these years (Lee and Madabhushi, 2010; Luo et al., 2013), we also compared our methods with two state-of-the-art semi-supervised algorithms, i.e. low-density separation (LDS) and cost-sensitive semi-supervised SVM (CS4VM). LDS is a graph-based method, which represents each labeled and unlabeled sample as a node and tries to place decision boundaries in regions where there are few data nodes (Chapelle and Zien, 2005). CS4VM incorporates the unlabeled data into the SVM by estimating their label means of misclassification costs (Li et al., 2010). Figure 4C shows the results of LDS and CS4VM when taking ADN as labeled data, BDN as unlabeled data and IDN as testing set. The performances of our proposed methods are better than LDS and CS4VM on the multilabel dataset of this article. One reason can be the multilabel sample classification is much more comprehensive than the single-label case used by the two algorithms. For instance, the LDS might be unable to accurately find the boundaries in a graph built by multilabel data, because some multilabel samples are near the low-density areas and confuse the decisions.
3.3 Incorporating images from cancer tissues to the model
To enhance the performance of predicting subcellular locations of proteins in cancerous tissues, we consider adding some images from cancerous tissues into the training set to eliminate the transfer prediction error caused by the difference between the normal and cancer data. Actually, we conducted an experiment to quantify the differences of patterns between the two states. Based on the proteins in CDC set, we used the correlation coefficient (CC) to measure difference between normal and cancer images, where we assumed that proteins in the CDC set did not change their locations in cancer states. This is reasonable when considering that more than 95% protein images in current HPA database are actually not cancer biomarkers (Glory et al., 2008). Each image was represented by its feature vector, and three CC matrixes were calculated: the first is the intra-CC in the normal images group, the second is the intra-CC in the cancer images group and the third is the inter-CC between normal and cancer sets. Figure 5 shows the averaged CC values based on six subcellular locations. It can be seen that the inter-CC values between normal and cancer images are lower than the intra-CC values in all cases. In addition, we also calculated the P-values with the student t test between normal and cancer dataset, and P-values of all the subcellular locations are <0.05. These results demonstrate that even for the same organelle, there is a difference between the normal and cancer data. This suggests that the transfer method of using normal data as the training set to predict the cancer data may miss some specific features of proteins in the cancer state.

Comparison of intranormal CC, intracancer CC and inter normal and cancer CC values. In the statistics, the high expression level dataset and CDC dataset are used as normal and cancer dataset, respectively. The db7 features are used and the feature dimension is 80
After adding BDN in above section, we obtained three classifiers, i.e. AsemiB1, AsemiB2 and AsemiB3, by semi-supervised learning. Following the incremental selective learning protocol, images from CDC were subsequently added to these classifiers, and we got AsemiBC1, AsemiBC2 and AsemiBC3 (Figs. 2E–G and 4E). It can be seen that the subset accuracies of classifiers on the independent IDN set fluctuate and decline slightly, which is because the added cancer data affected the prediction performance of normal data. This also highlights the difference between normal and cancer data. Nevertheless, the decline in performance is not significant, and the subset accuracies still outperform the baseline results from supervised models.
3.4 Performance of ensemble classifiers
Since an ensemble of multiple classifiers generally achieves better performance, we constructed ensemble classifiers by combining the 10 classifiers with db1–db10 features. The fusion method averages all the score vectors from the 10 single classifiers to get a final six-dimensional (6D) vector for each query image. These ensemble classifiers are tested on IDN to show their effectiveness on the normal dataset (Fig. 4D and F). By comparing the results between the ensemble classifier and the single classifiers, we find that the ensemble classifier outperforms the single classifier on IDN dataset. For example, a 2% improvement of the subset accuracy was observed for the compared with the single classifier AsemiBC2 on db7. The other merit of the ensemble strategy is that it can significantly reduce the negative bias by adding the cancer data to the training set. For instance, the subset accuracy of single AsemiBC1 classifier on IDN with db7 feature is 47.5% (Fig. 4D), which is 4.25% lower than the .
One final ensemble predictor without-adding CDC and one final ensemble predictor adding CDC were created. All the classifiers without-adding CDC were fused to create AsemiBE, and all the classifiers of adding CDC were fused to create AsemiBCE. Both of them could achieve good performance on IDN testing set (Fig. 4G). It is worth pointing out that besides the most stringent metric in multilabel classification, subset accuracy (Fig. 4), we also used other indices to evaluate the AsemiBE and AsemiBCE and their results can be seen in Supplementary Tables S3 and Supplementary Data. For example, the average label accuracy, which indicates the reliability of prediction for single locations, can achieve 87.04% for the final system (Supplementary Table S3), which implies the reliable detection of translocation from or to a specific location.
3.5 Detecting protein translocations of cancer biomarkers
The IBD set containing 10 reported biomarker proteins was used for validating whether the sensitivity of translocation detection can be enhanced by utilizing cancer data in the training phase. We compared the prediction results on the IBD set before and after adding the CDC dataset to see the effects of adding CDC data. The results from AsemiBE and AsemiBCE were compared, where the former did not incorporate the cancer data into training, whereas the latter did. To quantify the sensitivity of detecting the subcellular location changes, in addition to the predicted and reported location labels in the normal and cancer conditions, we also conducted independent sample t tests on the predicted score vectors to evaluate the significance of the location changes (Supplementary Fig. S3). The comparison results and P-values of the changes are shown in Table 1, from where we can see that:
The protein Bax and cyclin D1 prove that adding CDC dataset makes the classifiers more sensitive to detect the location changes occurring during cancer. In detail, protein Bax will partly translocate from the cytoplasm to the mitochondrion when lymphoma occurs (Nechushtan et al., 1999). This translocation cannot be found by the predictors trained only on normal data, but can be picked out by AsemiBCE, which was trained on both normal and cancer data. The protein cyclin D1 normally shuttles between cytoplasm and nucleus locations. However, in ovarian cancer cyclin D1 is found only in the nucleus (Gladden and Diehl, 2005). AsemiBE predicts cyclin D1 its locations in cancer as both the nucleus and mitochondria, while AsemiBCE correctly predicts its cancer location as the nucleus only.
The loss of nuclear localization of PTEN in pancreatic cancer is correctly predicted by both AsemiBE and AsemiBCE (Perren et al., 2000), demonstrating that the machine-learning systems are effective for the detection of protein mislocalization.
AsemiBCE is able to perform prediction better than AsemiBE for the IBD proteins in their normal states. For example, the protein BAG-1 is reported to reside in the nucleus in normal conditions and translocate to the mitochondria during colorectal cancer (Takayama et al., 1998). AsemiBE predicted BAG-1 would localize in both the cytoplasm and nucleus in the normal state, whereas AsemiBCE predicted only a nucleus location, which is experimentally correct. Other examples include NQO1 and GOLGA5.
The P-values also reveal the improved sensitivity for detecting protein translocations by the predictor of AsemiBCE. The lower the P-value, the more significant the change. There are a total of 16 experimentally known changed locations for the 10 proteins. Twelve of them have lower P-values in AsemiBCE with a P-value 0.0003–0.6167 compared with 0.001–0.8560 in AsemiBE. These results suggest that the sensitivity of detecting protein subcellular location changes is enhanced by incorporating the cancer data into the model construction.
Although some improvements can be observed (with lower P-values) by incorporating the cancer images into the classification system construction, there are still considerable room for improvement. For instance, there are still some cases where none of the two predictors can get completely correct prediction. This suggests that tremendous future efforts are needed for further improvement.
Comparison between literature descriptions and the results of predicting IBD by ensemble classifiers
Protein . | Tissue . | Protein translocations from normal to cancer condition . | ||
---|---|---|---|---|
Reported by literature (normal → cancer) . | Prediction by AsemiBE (normal → cancer P-values of changed locations)a,b . | Prediction by AsemiBCE (normal → cancer P-values of changed locations)a,b . | ||
Bax | Lymph node | Cyto. → Cyto.& Mito. | Cyto. → Cyto. Mito.0.6336 | Cyto. → Cyto.& Mito. Mito.0.4402 |
cyclin D1 | Ovary | Cyto.& Nucl. → Nucl. | Cyto. → Nucl.& Mito. Cyto.0.0430 | Cyto. → Nucl. Cyto.0.0319 |
PTEN | Pancreas | Cyto.& Nucl. → Cyto. | Cyto.& Nucl. → Cyto. Nucl.0.3853 | Cyto.& Nucl. → Cyto. Nucl.0.5570 |
BAG-1 | Colon | Nucl. → Mito. | Nucl.& Cyto. → Nucl.& Cyto. Nucl.0.5001, Mito.0.6513 | Nucl. → Nucl.& Cyto. Nucl.0.5944, Mito.0.3463 |
GOLGA5 | Thyroid gland | Gol. → Mito. | Gol.& Mito.& Nucl. → Gol.Gol.0.8560, Nucl.0.0403 | Gol. → Cyto.Gol.0.2699, Nucl.0.5522 |
NQO1 | Lung | Cyto. → Nucl. | Nucl. → Cyto. Cyto.0.0010, Nucl.0.0798 | Cyto. → Cyto. Cyto.0.0003, Nucl.0.0441 |
SOX9 | Breast | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2628, Nucl.0.5170 | Nucl. → Nucl. Cyto.0.0741, Nucl.0.1143 |
p53 | Breast | Nucl. → Nucl.& Cyto. | Nucl. → Nucl. Cyto.0.1315 | Nucl. → Nucl. Cyto.0.0741 |
TOP2A | Lung | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2130, Nucl.0.7945 | Nucl. → Nucl. Cyto.0.1286, Nucl.0.5853 |
IGFBP | Breast | Nucl. → Cyto. | Cyto. → Cyto. Cyto.0.4517, Nucl.0.7419 | Cyto. → Cyto. Cyto.0.6124, Nucl.0.6167 |
Protein . | Tissue . | Protein translocations from normal to cancer condition . | ||
---|---|---|---|---|
Reported by literature (normal → cancer) . | Prediction by AsemiBE (normal → cancer P-values of changed locations)a,b . | Prediction by AsemiBCE (normal → cancer P-values of changed locations)a,b . | ||
Bax | Lymph node | Cyto. → Cyto.& Mito. | Cyto. → Cyto. Mito.0.6336 | Cyto. → Cyto.& Mito. Mito.0.4402 |
cyclin D1 | Ovary | Cyto.& Nucl. → Nucl. | Cyto. → Nucl.& Mito. Cyto.0.0430 | Cyto. → Nucl. Cyto.0.0319 |
PTEN | Pancreas | Cyto.& Nucl. → Cyto. | Cyto.& Nucl. → Cyto. Nucl.0.3853 | Cyto.& Nucl. → Cyto. Nucl.0.5570 |
BAG-1 | Colon | Nucl. → Mito. | Nucl.& Cyto. → Nucl.& Cyto. Nucl.0.5001, Mito.0.6513 | Nucl. → Nucl.& Cyto. Nucl.0.5944, Mito.0.3463 |
GOLGA5 | Thyroid gland | Gol. → Mito. | Gol.& Mito.& Nucl. → Gol.Gol.0.8560, Nucl.0.0403 | Gol. → Cyto.Gol.0.2699, Nucl.0.5522 |
NQO1 | Lung | Cyto. → Nucl. | Nucl. → Cyto. Cyto.0.0010, Nucl.0.0798 | Cyto. → Cyto. Cyto.0.0003, Nucl.0.0441 |
SOX9 | Breast | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2628, Nucl.0.5170 | Nucl. → Nucl. Cyto.0.0741, Nucl.0.1143 |
p53 | Breast | Nucl. → Nucl.& Cyto. | Nucl. → Nucl. Cyto.0.1315 | Nucl. → Nucl. Cyto.0.0741 |
TOP2A | Lung | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2130, Nucl.0.7945 | Nucl. → Nucl. Cyto.0.1286, Nucl.0.5853 |
IGFBP | Breast | Nucl. → Cyto. | Cyto. → Cyto. Cyto.0.4517, Nucl.0.7419 | Cyto. → Cyto. Cyto.0.6124, Nucl.0.6167 |
aThe results have two lines: the first line is the predicted subcellular location labels in normal and cancer conditions, respectively, by the classifier; the second line is the P-values measuring the subcellular location changes when cancer occurs (column 3), which are calculated by the independent sample t test on the predicted scores for normal and cancer images.
bThose translocations that have lower P-values are bold.
Comparison between literature descriptions and the results of predicting IBD by ensemble classifiers
Protein . | Tissue . | Protein translocations from normal to cancer condition . | ||
---|---|---|---|---|
Reported by literature (normal → cancer) . | Prediction by AsemiBE (normal → cancer P-values of changed locations)a,b . | Prediction by AsemiBCE (normal → cancer P-values of changed locations)a,b . | ||
Bax | Lymph node | Cyto. → Cyto.& Mito. | Cyto. → Cyto. Mito.0.6336 | Cyto. → Cyto.& Mito. Mito.0.4402 |
cyclin D1 | Ovary | Cyto.& Nucl. → Nucl. | Cyto. → Nucl.& Mito. Cyto.0.0430 | Cyto. → Nucl. Cyto.0.0319 |
PTEN | Pancreas | Cyto.& Nucl. → Cyto. | Cyto.& Nucl. → Cyto. Nucl.0.3853 | Cyto.& Nucl. → Cyto. Nucl.0.5570 |
BAG-1 | Colon | Nucl. → Mito. | Nucl.& Cyto. → Nucl.& Cyto. Nucl.0.5001, Mito.0.6513 | Nucl. → Nucl.& Cyto. Nucl.0.5944, Mito.0.3463 |
GOLGA5 | Thyroid gland | Gol. → Mito. | Gol.& Mito.& Nucl. → Gol.Gol.0.8560, Nucl.0.0403 | Gol. → Cyto.Gol.0.2699, Nucl.0.5522 |
NQO1 | Lung | Cyto. → Nucl. | Nucl. → Cyto. Cyto.0.0010, Nucl.0.0798 | Cyto. → Cyto. Cyto.0.0003, Nucl.0.0441 |
SOX9 | Breast | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2628, Nucl.0.5170 | Nucl. → Nucl. Cyto.0.0741, Nucl.0.1143 |
p53 | Breast | Nucl. → Nucl.& Cyto. | Nucl. → Nucl. Cyto.0.1315 | Nucl. → Nucl. Cyto.0.0741 |
TOP2A | Lung | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2130, Nucl.0.7945 | Nucl. → Nucl. Cyto.0.1286, Nucl.0.5853 |
IGFBP | Breast | Nucl. → Cyto. | Cyto. → Cyto. Cyto.0.4517, Nucl.0.7419 | Cyto. → Cyto. Cyto.0.6124, Nucl.0.6167 |
Protein . | Tissue . | Protein translocations from normal to cancer condition . | ||
---|---|---|---|---|
Reported by literature (normal → cancer) . | Prediction by AsemiBE (normal → cancer P-values of changed locations)a,b . | Prediction by AsemiBCE (normal → cancer P-values of changed locations)a,b . | ||
Bax | Lymph node | Cyto. → Cyto.& Mito. | Cyto. → Cyto. Mito.0.6336 | Cyto. → Cyto.& Mito. Mito.0.4402 |
cyclin D1 | Ovary | Cyto.& Nucl. → Nucl. | Cyto. → Nucl.& Mito. Cyto.0.0430 | Cyto. → Nucl. Cyto.0.0319 |
PTEN | Pancreas | Cyto.& Nucl. → Cyto. | Cyto.& Nucl. → Cyto. Nucl.0.3853 | Cyto.& Nucl. → Cyto. Nucl.0.5570 |
BAG-1 | Colon | Nucl. → Mito. | Nucl.& Cyto. → Nucl.& Cyto. Nucl.0.5001, Mito.0.6513 | Nucl. → Nucl.& Cyto. Nucl.0.5944, Mito.0.3463 |
GOLGA5 | Thyroid gland | Gol. → Mito. | Gol.& Mito.& Nucl. → Gol.Gol.0.8560, Nucl.0.0403 | Gol. → Cyto.Gol.0.2699, Nucl.0.5522 |
NQO1 | Lung | Cyto. → Nucl. | Nucl. → Cyto. Cyto.0.0010, Nucl.0.0798 | Cyto. → Cyto. Cyto.0.0003, Nucl.0.0441 |
SOX9 | Breast | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2628, Nucl.0.5170 | Nucl. → Nucl. Cyto.0.0741, Nucl.0.1143 |
p53 | Breast | Nucl. → Nucl.& Cyto. | Nucl. → Nucl. Cyto.0.1315 | Nucl. → Nucl. Cyto.0.0741 |
TOP2A | Lung | Nucl. → Cyto. | Nucl. → Nucl. Cyto.0.2130, Nucl.0.7945 | Nucl. → Nucl. Cyto.0.1286, Nucl.0.5853 |
IGFBP | Breast | Nucl. → Cyto. | Cyto. → Cyto. Cyto.0.4517, Nucl.0.7419 | Cyto. → Cyto. Cyto.0.6124, Nucl.0.6167 |
aThe results have two lines: the first line is the predicted subcellular location labels in normal and cancer conditions, respectively, by the classifier; the second line is the P-values measuring the subcellular location changes when cancer occurs (column 3), which are calculated by the independent sample t test on the predicted scores for normal and cancer images.
bThose translocations that have lower P-values are bold.
4 Discussion and conclusions
In this article, we present a new automated bioimage analysis system for sensitively detecting translocated or mislocated proteins in human cancers. The new system is featured with a semi-supervised learning engine, which can help to enlarge the training space by incorporating lower-quality or unlabeled data key to the performance of a statistic model. The other merit of the new system is the capability of predicting proteins that shuttle among multiple subcellular locations, and a new dynamic D-criterion is proposed to deal with the multilabel set determination problem by considering the specificity of each protein. The new developed system has opened a new avenue for bioimage-based automated biomarker detection work, which suits large-scale data analysis and complement research from biological experiments.
We have shown that the strategy of selectively incorporating medium staining normal images with the developed semi-supervised framework is helpful for improving the classification accuracy on the normal images as demonstrated in the independent test dataset. On the other hand, some improvements were also observed when applying the semi-supervised algorithm for adding selected cancer images into training, but they have still considerable space for further improvement. For instance, some translocated or mislocated cancer biomarkers cannot be completely predicted, especially for those multi-label proteins.
To further improve the performance of our system, some efforts will be made in future studies. First, we will aim to improve the multilabel classification algorithm by taking the label correlations into account. Multiplex proteins that may shuttle among more than one subcellular location indicate a complex subcellular protein organization in the cell. The benchmark dataset of this study contains 26% multilabel proteins. This ratio is even much higher to reach approximately 60% according to a recent study of applying IF and fluorescent-protein tagging techniques on mammalian cells (Stadler, 2013). In this article, we transformed the multilabel problem into six binary classification problems, ignoring the correlation among different subcellular locations. It is expected that incorporating correlations, such as proteins coexisting at different locations due to spatial proximity or functional reasons, will be useful for further improving the performance.
Second, our imaging-based studies can be integrated with analysis of non-imaging data, such as proteomics and genomics analyses (Murphy, 2014). Amino acid sequence has been used for predicting protein subcellular locations for many years, and we have developed an efficient sequence-based subcellular location predictor called Cell-PLoc in previous studies (Chou and Shen, 2008; Shen and Chou, 2009). The Cell-PLoc can also deal with multilabel proteins and have wide coverage of subcellular components. Merging prediction results from different resources is a potential effective way for further enhancing the sensitivity for translocated proteins detection. The multiclassifier mode of this study also provides a feasible combination solution, which enables us to cotrain our image-based and sequence-based software to generate a better protein subcellular location prediction system.
Acknowledgements
We are grateful to Dr. Jeffrey Brender and Dr. Richard Jang for reading the manuscript.
Funding
This work was supported in part by the National Natural Science Foundation of China [Nos. 61222306, 91130033, 61175024], Shanghai Science and Technology Commission [No. 11JC1404800], a Foundation for the Author of National Excellent Doctoral Dissertation of People’s Republic of China [No. 201048] and the National Institute of General Medical Sciences [GM083107].
Conflict of interest: none declared.
References
Author notes
Associate Editor: Robert F. Murphy