Human cell structure-driven model construction for predicting protein subcellular location from biological images

Shao, Wei; Liu, Mingxia; Zhang, Daoqiang

doi:10.1093/bioinformatics/btv521

Abstract

Motivation: The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting.

Results: We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method.

Availability and implementation: The dataset and code can be downloaded from https://github.com/shaoweinuaa/.

Contact: dqzhang@nuaa.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

One important task in the research of proteomics is to explore the natural function of proteins in performing and regulating the activities of an organism at cell level (Breker and Schuldiner, 2014). It is widely recognized that the function of a protein is closely associated with its corresponding cellular compartments (Chebira et al., 2007). Proteins can only find their correct interacting molecules at the right place. Thus, subcellular location can provide important clues for understanding the function of a protein. With the breakthrough of genome sequencing and bioimaging techniques, traditional time-consuming and expensive wet-laboratory experimental approaches cannot catch up with the speed of newly known proteins (Zhang et al., 2006). Hence, finding an automatic computational way to determine the subcellular locations of proteins has been becoming a focus in computational biology (Glory and Murphy, 2007). From the perspective of machine learning, this task can be transformed into a multi-class or multi-label classification problem. This is a two-step framework, where we first need to figure out a proper feature representation way for encoding the protein data, which then will be fed into a trained machine learning model for label decision. There are two major research categories depending on how the protein data are represented, i.e. one-dimensional amino sequence and two-dimensional image (Xu et al., 2013).

On one hand, if a protein is represented in amino acid sequence, PseAAC (Chou, 2001), PSSM (Jeong et al., 2011; Pierleoni et al., 2011) and gene ontology (Chi, 2010) are among the applied sequence-based features. In the second step, various machine learning algorithms have been proposed. For instance, researchers in Yoon and Lee (2012) adopted a boosting framework to accomplish the classification task, and in Wang and Li (2013), a random label selection method was presented to learn the label correlations from training dataset to guide the classification for multi-label proteins.

On the other hand, accompanied with the explosive increments in genomic data, we witnessed great advances in automated microscopic imaging in recent years (Peng et al., 2012). Because of the intuitive characteristics of images compared with amino acid sequence, bioimage-based protein subcellular distribution pattern analysis has attracted much attention. For example, it is found that image-based analysis can be successfully used to detect protein biomarkers, which will dynamically change their subcellular locations in the cancerous tissues (Kumar et al., 2014; Xu et al., 2013, 2015).

If proteins are represented with two-dimensional images, e.g. through fluorescent or immunohistochemistry microscopy, the most widely used image features can be grouped into two categories, i.e. global and local features. For global feature, the DNA feature (Boland and Murphy, 2001) is designed to characterize DNA distribution in a cell image. Since there is high co-occurrence of protein and DNA in a protein image, we can infer the relative position of protein according to the DNA distribution. Besides, Haralick feature based on db wavelet is another global feature to describe image texture such as inertia and isotropy, which is demonstrated to be robust to cell rotations and translations (Murphy et al., 2003; Newberg and Murphy, 2008). As to local feature, LBP (Ojala et al., 1996) feature is the most frequently applied descriptor to characterize the spatial structure of images involving flat areas, edges and spots. Some extensions are also reported. Yang et al. (2014) constructed a mixed local feature set by adding two extensive forms of LBP, i.e. LTP (Tan and Triggs, 2010) and LQP (Nanni et al., 2010). Coelho et al. (Coelho et al., 2013) applied the SURF feature to handle the classification problem in cell images.

Considering different features will have their own advantages, a common strategy is to fuse multiple types of features. For instance, different features are concatenated as a long vector to perform the subsequent classification task (Newberg and Murphy, 2008; Xu et al., 2013; Yang et al., 2014). Intuitively, since single type of features cannot reflect all the information of a protein image, fusing multiple types of features together is expected to be a more promising way.

For learning algorithms design, Boland et al. (1998) applied neural networks to classify four protein types; Yang et al. (2014) proposed a probability-based support vector machine (SVM) to predict the subcellular location of proteins in human reproductive system. By considering a high ratio of human proteins co-exist at different locations, Xu et al. (2013) designed a multi-label classification classifier. Other efforts include Chebira et al. (2007) used a multi-resolution approach, and logistic regression algorithm with latent variables was proposed in Li et al. (2012).

Although much progress has been achieved in designing different statistical classifiers, to the best of our knowledge, none of the existing image-based classifiers takes the biological cell structure information into consideration, which has already been demonstrated to be effective in solving biological sorting problems (Lin et al., 2011). The basic hypothesis of existing predictors is to parallelly consider every cellular component class regardless of their organizations in the cell. It is expected that better performance will be achieved when we incorporate the cell nature component organization structure into the model construction.

To enable the learned model to incorporate the subcellular component organization structure, we propose a new classifier learning approach by utilizing the error correcting output coding (ECOC) framework (Dietterich and Bakiri, 1995). This new approach can decompose the multi-class problem into several binary classification problems according to the prior human cell structural information. The final decision will then be made by combining the results of these binary classifiers. In the new structure-driven learning approach, we first construct a codeword matrix to reflect the biological structure of cellular compartments with ECOC. Then, for each binary classifier corresponding to the columns of the ECOC codeword matrix, we use kernel combination method to fuse different types of features rather than the direct combination strategy. Finally, we perform the classifier ensemble by combining multiple classifiers via majority voting. The experimental results show that our method performs much better than several state-of-the-art methods, because the proposed approach has incorporated the cell structure prior knowledge into model generation.

2 Materials and methods

2.1 Dataset

Starting from 2005, the researchers use the antibody-based technique to functional study of the human proteome and build the well-known Human Protein Atlas (HPA) database (Uhlen et al., 2005). In the recent 13th version of HPA database, 86% of human genome is involved. Specifically, 16 ;975 genes with 24 028 antibodies have been covered to 46 different normal human tissues and 20 different cancer types.

In this study, we have generated a collection of 1636 immunohistochemistry images with high validation and objective scores (Ponten et al., 2008) from HPA as our benchmark dataset. It contains 21 proteins related to 46 normal human tissues. And each image belongs to one of the seven most frequently appeared subcellular locations, namely, cytoplasm, Golgi apparatus, mitochondrion, vesicles, nucleus, endoplasmic reticulum and lysosome. Table 1 summarizes the distribution of our dataset.

Table 1.

The distribution of the benchmark dataset

Category	Size
Cytoplasm	391
Golgi apparatus	228
Mitochondrion	319
Vesicles	139
Nucleus	183
Endoplasmic reticulum	216
Lysosome	160
Total	1636

Open in new tab

Table 1.

The distribution of the benchmark dataset

Category	Size
Cytoplasm	391
Golgi apparatus	228
Mitochondrion	319
Vesicles	139
Nucleus	183
Endoplasmic reticulum	216
Lysosome	160
Total	1636

Open in new tab

2.2 Overview of our method

Figure 1 shows the flowchart of our method, which consists of four major steps. First, we extract and select features from the given protein images. Then, we use ECOC method to transform the multi-class classification problem into a series of binary classification sub-problems according to a pre-defined codeword matrix. Here, the codeword matrix comprised 14 bits. The first six bits are designed according to the biological structure of cellular compartments, which can bring more prior information to the learning process. And the other eight checking bits are used to strengthen the error-correcting ability of this ECOC-based model. Next, since db wavelet was employed to get the multi-resolution global feature (i.e. Haralick feature), we can construct 10 different SC-PSorter models based on different sets of features extracted from 10 vanishing moments of db wavelets. Moreover, for each SC-PSorter model, we construct 14 multi-kernel-based SVM classifiers corresponding to the columns of the ECOC codeword matrix. Finally, we perform the classifier ensemble by combining those 10 SC-PSorter-based classifiers via majority voting.

Fig. 1.

Open in new tab Download slide

The flowchart of our proposed method

2.3 Feature extraction and selection

For protein images, different types of features (i.e. global and local features) are expected to provide complementary information (Yang et al., 2014). Therefore, we are encouraged to use both of the two types of features to describe protein images. Specifically, as to global feature, we select Haralick feature with 10 different vanishing moments from 1 to 10, then for each vanishing moment, an 836-dimensional feature can be obtained. In addition, a four-dimensional global DNA feature is also incorporated due to their values in inferring the relative position of protein. As to local feature, we prefer to choose the most widely used LBP feature, which is constituted by a 256-dimensional vector. Hence, for every protein image, we can get a 1096-dimensional descriptor for each db wavelet if we directly combine them together. After that, to reduce computational time cost and avoid overfitting, we select the most distinguishing features by applying the stepwise discriminant analysis method (Huang et al., 2003).

2.4 Error correcting output coding

In this article, our goal is to determine the subcellular location that a protein image belongs to. Since there are seven subcellular locations, this problem can be regarded as a multi-class classification problem. Nowadays, multi-class classification is an important issue in many machine learning domains, such as text classification (Nigam et al., 2000), and medical analysis (Lu et al., 2005). There are two main lines to deal with such multi-class learning problems, including ‘direct multi-class representation’ and ‘(indirect) decomposition design’. The first line aims to design multi-class classifiers directly, such as neural network (Boland et al., 1998) and multi-class SVMs (Yang et al., 2014). In contrast, the second line endeavors to first transform the original multi-class problem into several binary classification problems and then to combine the results of these binary classifiers for making final decision. As a typical indirect decomposition way to deal with multi-class problems, ECOC (Dietterich and Bakiri, 1995; Liu et al., 2015, b) is one of the representative methods. Specifically, there are three main steps in ECOC-based classification system, including (i) encoding, which decomposes the original problem into several binary classification problems; (ii) binary classifier learning and (iii) decoding, which makes a final decision based on the outputs of those binary classifiers. In the following, we will introduce each step in detail.

2.4.1 Encoding

In the encoding procedure, a codeword matrix $M_{k \times l}$ is employed to decompose the original multi-class problem into several binary sub-problems. Here, the rth $(r = 1, 2 \dots k)$ row of $M$ (i.e. $M_{r}$ ⁠) represents the codeword of the rth class, while each column of $M$ denotes the new class label vector for each of original classes. The elements in each column of the codeword matrix can be set as −1, 0, and 1 in ternary ECOC encoding methods and −1 and −1 in binary ECOC encoding methods (Pujol et al., 2006). Below, we briefly introduce two typical ECOC encoding strategies including one-versus-all coding and the forest coding (Escalera et al., 2007).

1) One-versus-all coding

In this approach,

k

different binary classifiers are built, each of which learns to distinguish one class versus the others. In the codeword matrix, all of the diagonal elements are set as 1, while the others as −1. In Equation (1), we show a codeword matrix

M_{k \times k}

using the widely used one-versus-all coding strategy, which transforms the k-class classification problem into k one-versus-all binary classification problems:

M = {(\begin{array}{r} 1 & - 1 & - 1 & - 1 \\ - 1 & 1 & - 1 & - 1 \\ ... & ... & ... & ... \\ - 1 & - 1 & - 1 & 1 \end{array})}_{k \times k}

(1)

2) Forest coding

In this data-dependent coding strategy, the codeword matrix is completely determined by the partition of the dataset by using the decision tree algorithm. Here, when building decision tree, each node corresponds to the best bi-partition of the set of classes by maximizing the mutual information between different types of samples. The process is recursively applied until sets of single classes corresponding to the tree leaves are obtained.

2.4.2 Binary classifier learning

The second step is to train multiple binary classifiers based on the codeword matrix $M$ ⁠. Specifically, a binary classifier is corresponding to a specific column of the codeword matrix, where the samples labeled as 1 are used as positive instances, and samples labeled as −1 are regarded as negative instances. It is worth noting that those instances labeled as 0 will not be used for training the classifier in ternary encoding methods. Given a codeword matrix $M$ that contains l columns, we then learn a total of l binary classifiers. In the literature, these binary classifiers are usually directly taken from many existing classifiers (e.g. SVM) (Escalera et al., 2007).

2.4.3 Decoding

In the decoding stage, given a new instance, we first calculate the output vector from the multiple binary classifiers, i.e.

H (z)= {(h}_{1} (z), h_{2} (z), ..., h_{l} (z))

⁠. Then, we need to find a decoding method that can transform the output vector into a specific target class label, through which the original multi-class classification problem can be solved ultimately. Currently, there are many decoding methods, such as Hamming distance (HD) decoding, Euclidean distance decoding and linear-loss weighted decoding (Escalera et al., 2010). Among various decoding strategies, HD is one of simple and effective decoding method. Accordingly, we use the HD to perform ECOC decoding in this article. Then, the predicted class label y of a testing instance

z

can be estimated through the following

y = \arg \min_{r} \sum_{i = 1}^{l} | h_{i} (z) - M_{r, i} | r = 1, \dots, k

(2)

where the testing sample

z

is assigned to label r if the output vector

H(z)

is much closer to the rth row of the codeword matrix

M

in terms of HD, when comparing with the other rows of

M

⁠.

2.5 ECOC coding with biological structural information

As mentioned in Section 2.4, ECOC-based methods transform the multi-class classification problem into a series of binary classification sub-problems according to a pre-defined codeword matrix. Different designs of codeword matrix may lead to different partitions of original classes, which will affect the classification performance. Hence, the design of the codeword matrix is important for ECOC-based methods. On the other hand, it is highly recognized that the biological structural information (Lin et al., 2011) plays a crucial role in determining protein subcellular location. Intuitively, such structural information can be used to guide the codeword matrix design, which can bring more prior information to the learning process and boost the learning performance. Accordingly, we design a codeword matrix according to the hierarchical structure of cellular compartments. In Table 2, we illustrate the proposed codeword matrix by taking advantage of the cellular compartments structure shown in Figure 2.

Table 2.

Corresponding coding matrix to the biological structure of cellular compartments

Location	h _root	h _intra	h _cyto	h _sp	h _meta	h _modi
Cytoplasm	−1	1	1	0	0	0
ER	1	0	0	−1	0	1
Golgi	1	0	0	−1	0	−1
Lysosome	1	0	0	1	1	0
Mitochondrion	−1	1	−1	0	0	0
Nuclear	−1	−1	0	0	0	0
Vesicle	1	0	0	1	−1	0

Location	h _root	h _intra	h _cyto	h _sp	h _meta	h _modi
Cytoplasm	−1	1	1	0	0	0
ER	1	0	0	−1	0	1
Golgi	1	0	0	−1	0	−1
Lysosome	1	0	0	1	1	0
Mitochondrion	−1	1	−1	0	0	0
Nuclear	−1	−1	0	0	0	0
Vesicle	1	0	0	1	−1	0

Open in new tab

Table 2.

Corresponding coding matrix to the biological structure of cellular compartments

Location	h _root	h _intra	h _cyto	h _sp	h _meta	h _modi
Cytoplasm	−1	1	1	0	0	0
ER	1	0	0	−1	0	1
Golgi	1	0	0	−1	0	−1
Lysosome	1	0	0	1	1	0
Mitochondrion	−1	1	−1	0	0	0
Nuclear	−1	−1	0	0	0	0
Vesicle	1	0	0	1	−1	0

Location	h _root	h _intra	h _cyto	h _sp	h _meta	h _modi
Cytoplasm	−1	1	1	0	0	0
ER	1	0	0	−1	0	1
Golgi	1	0	0	−1	0	−1
Lysosome	1	0	0	1	1	0
Mitochondrion	−1	1	−1	0	0	0
Nuclear	−1	−1	0	0	0	0
Vesicle	1	0	0	1	−1	0

Open in new tab

Fig. 2.

Open in new tab Download slide

Biological structure of cellular compartments

As listed in Table 2, we derive six binary classifiers from this codeword matrix. Starting from roots, we use classifier h_root to distinguish between three intra-cellular compartments and the other four secreted pathway-based compartments. Then, for the splits under inter-cellular compartments, we apply h_intra to discriminate nuclear and cytoplasm internal node, a union of proteins in cytoplasm and mitochondrion. Similar to h_intra, h_cyto is another classifier that is applied to characterize the differences between cytoplasm and mitochondrion. Pointing to the right sub-tree of root node, since the main function of vesicle is to uptake and transport of materials within the cytoplasm, and lysosome is capable of breaking down all kinds of biomolecules, they can be categorized into metabolic functional compartments. Moreover, the main functions of Golgi apparatus and ER are modifying the proteins for cell secretion, so classifier h_sp is applied to distinguish the cellular compartments having either metabolic or modified functions under the node of secreted pathway. Finally, h_meta and h_modi are also constructed to classify the compartments within the nodes of metabolic function and modified function shown in Figure 2. In this work, we will mainly use this coding pattern to predict protein subcellular location under ECOC framework. Moreover, in the work by Lin et al. (2011), they also utilize the biological structure information and then build a tree-based classifier to predict the sequence-based protein subcellular location. Here, we will compare our proposed ECOC-based method with the method by Lin et al. in Supplementary Section S5 in the Supplementary Material.

2.6 ECOC coding by adding checking bits

From the detailed analysis in Supplementary Section S4, it is worth noting that, although the codeword matrix shown in Table 2 follows the biological structure shown in Figure 2, it has no error-correcting ability for the HDs between pairs of codewords are too short (e.g. there is only 1 bit difference between the codewords under the nodes of cytoplasm, metabolic function and metabolic function). So, to strengthen the error-correcting ability of the codeword matrix shown in Table 2, we add eight checking bits for each codeword of cellular compartment (shown in Table 3) to enlarge the HDs between pairs of codewords.

Table 3.

The added eight checking bits to enlarge the HDs between pairs of codewords

Location	c ₁	c ₂	c ₃	c ₄	c ₅	c ₆	c ₇	c ₈
Cytoplam	1	1	0	0	1	0	0	0
ER	−1	0	1	0	0	0	1	0
Golgi	−1	0	1	0	0	0	0	1
Lysosome	0	−1	0	1	0	0	−1	0
Mitochondrion	1	1	0	0	0	1	0	0
Nuclear	0	0	−1	−1	−1	−1	0	0
Vesicle	0	−1	0	1	0	0	0	−1

Location	c ₁	c ₂	c ₃	c ₄	c ₅	c ₆	c ₇	c ₈
Cytoplam	1	1	0	0	1	0	0	0
ER	−1	0	1	0	0	0	1	0
Golgi	−1	0	1	0	0	0	0	1
Lysosome	0	−1	0	1	0	0	−1	0
Mitochondrion	1	1	0	0	0	1	0	0
Nuclear	0	0	−1	−1	−1	−1	0	0
Vesicle	0	−1	0	1	0	0	0	−1

Open in new tab

Table 3.

The added eight checking bits to enlarge the HDs between pairs of codewords

Location	c ₁	c ₂	c ₃	c ₄	c ₅	c ₆	c ₇	c ₈
Cytoplam	1	1	0	0	1	0	0	0
ER	−1	0	1	0	0	0	1	0
Golgi	−1	0	1	0	0	0	0	1
Lysosome	0	−1	0	1	0	0	−1	0
Mitochondrion	1	1	0	0	0	1	0	0
Nuclear	0	0	−1	−1	−1	−1	0	0
Vesicle	0	−1	0	1	0	0	0	−1

Location	c ₁	c ₂	c ₃	c ₄	c ₅	c ₆	c ₇	c ₈
Cytoplam	1	1	0	0	1	0	0	0
ER	−1	0	1	0	0	0	1	0
Golgi	−1	0	1	0	0	0	0	1
Lysosome	0	−1	0	1	0	0	−1	0
Mitochondrion	1	1	0	0	0	1	0	0
Nuclear	0	0	−1	−1	−1	−1	0	0
Vesicle	0	−1	0	1	0	0	0	−1

Open in new tab

As listed in Table 3, the newly added eight checking bits are used for distinguishing different nodes or cellular compartments (e.g. c₁ is used to distinguish between modified function and cytoplasm-based proteins), which cannot reflect the hierarchical structure of cellular compartments in Figure 2. So, after adding these eight checking bits, each cellular compartment is represented by 14 bits and the HDs between pairs of codewords are accordingly enlarged (e.g. the HD between the codewords under the nodes of Cytoplasm is enlarged from 2 to 4 if we add the eight checking bits).

2.7 Kernel combination

As the second procedure in ECOC framework, binary classifier learning is also important for multi-class classification. To better make use of different kinds of features, we adopt a kernel combination method (Wang et al., 2008; Zhang et al., 2011) to design each of multiple binary classifiers in ECOC. Specifically, for each dichotomy, we fuse both the global features (i.e. Haralick feature and DNA feature) and the local features (i.e. LBP) by a multi-kernel-based SVM classifier (Wang et al., 2008; Zhang et al., 2011).

Suppose that we are given

n

protein images. Let

x_{i}^{1}

and

x_{i}^{2}

denote the global and the local feature of the ith sample, respectively, and their corresponding labels belong to

{- 1, 1}

⁠. Multi-kernel-based SVM solves the following primal problem

\begin{array}{l} \underset{w^{(m)}, β_{m}, ε_{i}}{\arg \min} \frac{1}{2} \sum_{m = 1}^{2} β_{m} {‖ w^{(m)} ‖}^{2} + C \sum_{i = 1}^{n} ε_{i} \\ s . t . y_{i} (\sum_{m = 1}^{2} β_{m} ({(w^{(m)})}^{T} ϕ^{m} (x_{i}^{m}) + b)) \geq 1 - ε_{i} \\ ε_{i} \geq 0 (i = 1, 2 \dots, n) \end{array}

(3)

where

w^{(m)}

⁠,

β_{m}

and

ϕ^{m}

represent the normal vector to hyperplane, the weight value and the kernel-induced mapping function of mth type of feature, respectively. For a testing sample, its corresponding label can be obtained by the following

f (x) = sign (\sum_{i = 1}^{n} y_{i} α_{i} \sum_{m = 1}^{2} β_{m} k^{m} (x_{i}^{m}, x^{m}) +b)

(4)

Here,

k^{m}

is the kernel function of the mth type of feature induced by mapping function

ϕ^{m}

⁠, with element.

k^{(m)} (x_{i}^{m}, x_{j}^{m}) = ϕ^{m} {(x_{i}^{m})}^{T} ϕ^{m} (x_{j}^{m})

⁠. From Equations (3) and (4), we can see that the multi-kernel-based SVM is an extension of single-kernel-based SVM, where the kernel matrix for multi-kernel-based SVM is a linear combination of its single kernel matrix on different types of features. The weight value

β_{m} (m = 1, 2, β_{1} + β_{2} = 1)

is applied to balance the importance between the global and the local features, which is determined by a coarse-grid search method (Zhang et al., 2011). Specifically, we first equally split training samples into 10 subsets and utilize nine of them to train a series of models to determine a

β_{m} (m = 1, 2)

that can achieve the highest classification accuracy on the remaining one subset. Then, after getting the optimal

β_{m} (m = 1, 2)

⁠, we can train a model for all of the training samples, thus the final effectiveness of this model can be evaluated by classification accuracy on the testing samples.

2.8 Ensemble classification method

As can be observed from Figure 1, db wavelet has 10 vanishing moments from db1 to db10. Accordingly, we construct 10 SC-PSorter-based classification models, with each one corresponding to a specific type of vanishing moments. Inspired by Liu et al. (2015a), Xu et al. (2013) and Yang et al. (2014), we adopt a majority voting strategy to combine those SC-PSorter-based models together. Specifically, for a testing protein image $z$ ⁠, if the ith (I = 1, 2, $\dots$ , 10) SC-PSorter model predicts that it belongs to the location $c (1 \leq c \leq 7)$ ⁠, the vote for the cth compartment is added by one. Then, $z$ is in the location with the largest vote-based on all of the 10 SC-PSorter models.

3 Experimental results

3.1 Experimental settings

In previous works (i.e. Xu et al., 2013; Yang et al., 2014), researchers use images from the same protein for training and testing via cross-validation. Following these works, in our experiment, we use the same cross-validating strategy. Specifically, we equally divide the images in each protein into five disjoint subsets, with four subsets used for training and the remaining subset used for testing. For all the proposing and comparing methods in this article, the SVM classifier is implemented by using LIBSVM toolbox (Chang and Lin, 2011), with an RBF kernel and the parameter $σ$ is tuned from 0.9 to 2.1 at a step size of 0.1 by using grid search on the training data. Also, for each feature $f_{i}$ in the training set, a common feature normalization scheme is adopted, i.e. the normalized feature $f_{i}^{'} = f_{i} / f_{i}^{\max}$ ⁠, where $f_{i}^{\max}$ is the maximum value of the ith feature in the training set. Also, the $f_{i}^{\max}$ value will be used to normalize its corresponding feature in the test set.

3.2 Results for combining different types of features

To evaluate the efficacy of using different types of features, we first perform experiments by only using one type of features(i.e. the global feature and local feature) and combining different types of features together (i.e. direct combine and kernel combine) to predict the targets of proteins. Here, we choose the one-versus-all coding strategy, with experimental results shown in Figure 3. As can be seen from Figure 3, on one hand, direct combination of different types of features will not lead to the improvements of prediction accuracies. In most cases, the classification accuracies for this combination strategy are between using global or local feature only. On the other hand, using kernel combination method to fuse different types of features is a much more effective way, where classification accuracies consistently outperform those methods based on one single type of features (i.e. global feature or local feature). However, even for the kernel combination strategy, the classification accuracies cannot achieve to 70% for all of the 10 db wavelets, which reminds us to replace the simple one-versus-all coding strategy with other coding strategies for further improving.

Fig. 3.

Open in new tab Download slide

Classification accuracies by using single type of features and combining different types of features together

3.3 Results for different coding strategies

In the second groups of experiments, we test the classification performances for three different coding strategies, namely, one-versus-all, forest and biological structural-based coding strategies (i.e. following the codeword matrix shown in Table 2), which are denoted as OVA-PSorter, F-PSorter and S-PSorter, respectively. Here, we also use kernel combination method to fuse global features (including both Haralick features and DNA features) and local features (i.e. LBP features) due to its superior classification performances in the first group of experiments. Figure 4 shows the classification accuracies by using all of the 10 db wavelets.

Fig. 4.

Open in new tab Download slide

Performance comparison among different coding strategies

As can be seen from Figure 4, our proposed S-PSorter method consistently outperforms the other two methods (i.e. OVA-PSorter and F-PSorter), which shows the advantage of using the biological structure of cellular compartments to design the codeword matrix. On the other hand, Figure 4 indicates that F-PSorter achieves consistently better classification accuracies than OVA-PSorter. This is because F-PSorter constructs codeword matrix by maximizing the mutual information between different classes rather than the simple one-versus-all coding strategy used in OVA-PSorter. Moreover, we also compare the computational efficiency of different coding strategies in Supplementary Section S8.

3.4 Further improvement by adding checking bits

As discussed in Section 2.6, we also add eight checking bits (shown in Table 3) to the codeword matrix of S-PSorter model to strengthen its error-correcting ability. Here, we denote two ECOC-based methods, whose codeword matrices are derived by adding these eight checking bits to the S-PSorter-based codeword matrix and only using these eight checking bits, as SC-PSorter and C-PSorter, respectively. Figure 5 presents the individual classification accuracies of the above two methods when comparing with S-PSorter method for all of the 10 db models.

Fig. 5.

Open in new tab Download slide

Performance comparisons among S-PSorter, C-PSorter and SC-PSorter methods

As can be seen from Figure 5, on one hand, SC-PSorter consistently outperforms S-PSorter on all of the 10 db wavelets. This is because we enlarge the HDs between pairs of codewords, and thus a few mistakes in some bits can be corrected by the decoding procedure. On the other hand, Figure 5 also shows the classification accuracies of C-PSorter method are consistently inferior to the other methods, which is because these eight checking bits are just designed to distinguish different nodes or cellular compartments, and they do not reflect the hierarchical structure of cellular compartments shown in Figure 2. (Detailed classification results are reported in Supplementary Section S4.)

3.5 Ensemble results with different coding strategies

As shown in Figures 4 and 5, for each method (i.e. OVA-PSorter, F-PSorter, S-PSorter and SC-PSorter), the classification accuracies from its individual 10 classifiers are different, which motivates us to utilize an ensemble strategy for better fusing the complementary individual decisions. Here, we use the majority voting strategy introduced in Section 2.8 to combine different classifiers together. Table 4 compares the classification accuracies of the best individual classifier and the ensemble model for these 4 methods (i.e. OVA-PSorter, F-PSorter, S-PSorter and SC-PSorter)

Table 4.

Comparison between individual and ensemble classification for four different coding strategies

Method	Best independent classifier	Ensemble prediction
OVA-PSorter	0.675	0.679
F-PSorter	0.819	0.853
S-PSorter	0.848	0.874
SC-PSorter	0.860	0.890

Open in new tab

Table 4.

Comparison between individual and ensemble classification for four different coding strategies

Method	Best independent classifier	Ensemble prediction
OVA-PSorter	0.675	0.679
F-PSorter	0.819	0.853
S-PSorter	0.848	0.874
SC-PSorter	0.860	0.890

Open in new tab

As listed in Table 4, we can always obtain better classification accuracies when performing the classifier ensemble via majority voting. Table 4 also indicates that the classification accuracy of our proposed SC-PSorter method is improved from the best individual classifier of 0.860–0.890, which is the best overall classification accuracy among all of the four methods.

4 Discussion

4.1 Comparisons with existing works

We also compare our SC-PSorter method with several existing approaches for image-based prediction of protein subcellular location. For example, in Xu et al. (2013), the authors construct k SVM models where k is the number of classes. In training the ith SVM model, examples belonging to the ith class are seen as positive samples, while the other examples as negative samples. For a query protein image $z$ ⁠, the output vector consists of k probabilities related to k different subcellular locations. We say $z$ belongs to the class with the largest probability in the output vector. Different from Xu et al. (2013), the authors in Yang et al. (2014) construct k(k − 1)/2 SVM classifiers where each one is trained from two different classes. Then, the testing sample $z$ is fed into these k(k − 1)/2 SVMs and these classifiers also output a probability denoting which class it belongs to. Here, the SVM classifier is implemented with RBF kernel and the parameter $σ$ is also tuned from 0.9 to 2.1 at a step size of 0.1 by using grid search on the training data.

Figure 6 presents the individual classification accuracies of the above two methods when comparing with our proposed SC-PSorter method. As can be seen from Figure 6, our proposed SC-PSorter method consistently achieves better classification accuracies than the other two approaches. Here, the better classification accuracies of our method are mainly owing to the following two aspects: (i) since the prior biological information is crucial in computational biology, we use the biological structural information to guide the learning procedure and (ii) we apply the multiple kernel combination strategy to fuse different types of features, which is regarded as a much more effective and flexible way than the direct combination strategy applied in the other two methods.

Fig. 6.

Open in new tab Download slide

Classification accuracies achieved by SC-PSorter and the other two methods

We also compare the classification accuracies of the best individual classifier and the ensemble model for the above three methods. As listed in Table 5, for these three methods, we obtain better classification accuracies when performing the classifier ensemble via majority voting than the best individual classifier. On the other hand, Table 5 also indicates that, by performing the ensemble strategy, the SC-Porter method achieves the best classification accuracy among all of the three methods. This result again validates the advantage of our proposed SC-PSorter method for prediction of protein subcellular location.

Table 5.

Comparison between individual and ensemble classification for three different methods

Method	Best independent classifier	Ensemble prediction
Xu et al .	0.817	0.826
Yang et al .	0.768	0.772
SC-PSorter	0.860	0.890

Open in new tab

Table 5.

Comparison between individual and ensemble classification for three different methods

Method	Best independent classifier	Ensemble prediction
Xu et al .	0.817	0.826
Yang et al .	0.768	0.772
SC-PSorter	0.860	0.890

Open in new tab

4.2 Diversity analysis

For the purpose of understanding how our proposed ensemble SC-PSorter works, we endeavor to apply the kappa measure (Rodriguez and Kuncheva, 2006) to plot the diversity-error diagram, which evaluates the level of agreement between the outputs of two individual classifiers. In Figure 7, we show the diversity-error diagrams of ensemble SC-PSorter and the methods in Xu et al. (2013) and Yang et al. (2014) for the task of predicting protein subcellular locations. In our experiment, each ensemble contains 10 individual classifiers, with each corresponding to a specific classifier using different global db features. The value on the x-axis of a diversity-error diagram denotes the kappa diversity of a pair of classifiers in the ensemble, while the value on the y-axis is the averaged individual error of a pair of classifiers. Since a small value of kappa diversity indicates better diversity and a small value of averaged individual error indicates a better accuracy, the most desirable pairs of classifiers will be close to the bottom left corner of the graph. As shown in Figure 7, our proposed ensemble SC-PSorter achieves much lower kappa value and much lower classification errors than the methods in Xu et al. (2013). At the same time, our proposed SC-PSorter method is not as diverse as the method in Yang et al. (2014), but apparently, it has more accurate base classifiers than the method in Yang et al. (2014). It seems that our proposed method can achieve a better trade-off between accuracy and diversity than the compared two methods. That is, it builds a classifier ensemble based on the reasonable diverse but markedly accurate individual components.

Fig. 7.

Open in new tab Download slide

The diversity-error diagrams of classifiers in the task of determining protein subcellular locations

4.3 Slight variations on tree structure

We design another two variants of the hierarchy of cellular compartments (i.e. T1 and T2 are shown in Supplementary Figs. S1 and Supplementary Data in Supplementary Section S1, respectively), which are constructed by making slight variations on the tree structure in Figure 2. Specifically, on one hand, for the tree representation T1, we neglect the hierarchical structure of four cellular compartments (i.e. lysosome, vesicle, Golgi and ER) under the node of secreted pathway. On the other hand, for the tree representation T2, we misuse cytoplasm (which is originally under the node of intra cellular) as a metabolic functional compartment (under the node of secreted pathway).

The classification results in Supplementary Table S3 in the Supplementary Section S1 shows that the slight variations on the tree structure in Figure 2 will lead to decreases in the classification accuracies for all of the 10 db models when comparing with S-PSorter method. Also, as can be seen from Supplementary Table S4 in the Supplementary Section S1, the ensemble classification results for these two tree representations (i.e. T1 and T2) are 0.866 and 0.842, respectively, which are still higher than those of previously published methods (i.e. Xu et al. 2013; Yang et al. 2014), although a bit lower than our original S-PSorter method. These results suggest that the proposed tree representation in Figure 2 reflects the true hierarchy of subcellular compartments.

4.4 Prediction on unseen proteins

As mentioned in Section 3.1, we use images from the same protein to evaluate the performance of SC-PSorter method. However, a strict test, i.e. recognizing subcellular patterns in new protein, is also very important. So we have added one more experiment to compare SC-PSorter method with the other two methods (i.e. Xu et al. 2013; Yang et al. 2014) for predicting proteins that are not included in the training set (detailed in Supplementary Section S3).

As can be seen in Supplementary Table S6, our proposed SC-PSorter method consistently achieves better classification accuracies than the other two approaches for all of the 10 db wavelets. Moreover, when performing the classifier ensemble via majority voting, the classification accuracies for SC-PSorter, Xu et al. (2013) and Yang et al. (2014) methods are 0.809, 0.684 and 0.603, respectively. This result again validates the advantage of our proposed SC-PSorter method for the prediction of protein subcellular location.

5 Conclusion

In this article, we develop and test a novel prediction model, SC-PSorter, for determining image-based protein subcellular locations. Specifically, we first devise a novel codeword matrix by considering the biological structural information under the ECOC framework, and then for each binary classifier corresponding to the columns of the ECOC codeword matrix, we adopt kernel combination method to fuse different types of features. Finally, we develop a classifier ensemble by combining multiple SC-PSorter-based classifiers via majority voting.

In this study, our method has been shown effective in case of each protein corresponding to only one location. However, as a matter of fact, nearly 20% percentage of human proteins co-exist more than two locations (Zhu et al., 2009), and thus we will design a new method to solve this multi-label-based protein classification problem. Also, since different biomarker may provide complementary information for the prediction of protein subcellular location (Breker and Schuldiner, 2014), we will add non-image data (e.g. amino acid sequence) to our image-based predictor for further performance improvement.

Acknowledgements

We thank Prof. Hongbin Shen for his helpful suggestions and the anonymous reviewers for valuable comments.

Funding

This work was supported in part by the National Natural Science Foundation of China (61422204; 61473149); Jiangsu Natural Science Foundation for Distinguished Young Scholar (BK20130034) and NUAA Fundamental Research Funds (NE2013105).

Conflict of Interest: none declared.

References

Boland

M.V.

et al. . (

1998

)

Automated recognition of patterns characteristic of subcellular structures in fluorescence microscopy images

.

Cytometry

,

33

,

366

–

375

.

Boland

M.V.

Murphy

R.F.

(

2001

)

A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells

.

Bioinformatics

,

17

,

1213

–

1223

.

Breker

M.

Schuldiner

M.

(

2014

)

The emergence of proteome-wide technologies: systematic analysis of proteins comes of age

.

Nat. Rev. Mol. Cell Biol.

,

15

,

453

–

464

.

Chang

C.C.

Lin

C.J.

(

2011

)

LIBSVM: a library for support vector machines

.

ACM Trans. Intel. Syst. Technol.

,

2

,

1889

–

1918

.

Google Scholar

Crossref

WorldCat

Chebira

A.

et al. . (

2007

)

A multiresolution approach to automated classification of protein subcellular location images

.

BMC Bioinformatics

,

8

,

5

.

Chi

S.M.

(

2010

)

Prediction of protein subcellular localization by weighted gene ontology terms

.

Biochem. Biophys. Res. Commun.

,

399

,

402

–

405

.

Chou

K.C.

(

2001

)

Prediction of protein cellular attributes using pseudo-amino acid composition

.

Proteins

,

43

,

246

–

255

.

Coelho

L.P.

et al. . (

2013

)

Determining the subcellular location of new proteins from microscope images using local features

.

Bioinformatics

,

29

,

2343

–

2349

.

Dietterich

T.G.

Bakiri

G.

(

1995

)

Solving multiclass learning problems via error-correcting output codes

.

Artif. Intell.

,

2

,

24

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Escalera

S.

et al. . (

2007

)

Boosted landmarks of contextual descriptors and forest-ECOC: a novel framework to detect and classify objects in cluttered scenes

.

Pattern Recognit. Lett.

,

28

,

1759

–

1768

.

Google Scholar

Crossref

WorldCat

Escalera

S.

et al. . (

2010

)

On the decoding process in ternary error-correcting output codes

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

32

,

120

–

134

.

Glory

E.

Murphy

R.F.

(

2007

)

Automated subcellular location determination and high-throughput microscopy

.

Dev. Cell.

,

12

,

10

.

Google Scholar

Crossref

WorldCat

Huang

K.

et al. . (

2003

)

Feature reduction for improved recognition of subcellular location patterns in fluorescence microscope images

.

Proc. SPIE

,

4962

,

307

-

318

.

Google Scholar

Crossref

WorldCat

Jeong

J.C.

et al. . (

2011

)

On position-specific scoring matrix for protein function prediction

.

IEEE ACM Trans. Comput. Bi.

,

8

,

308

–

315

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Kumar

A.

et al. . (

2014

)

Automated analysis of immunohistochemistry images identifies candidate location biomarkers for cancers

.

Proc. Natl. Acad. Sci. USA

,

111

,

18249

–

18254

.

Google Scholar

Crossref

WorldCat

Li

J.Y.

et al. . (

2012

)

Protein subcellular location pattern classification in cellular images using latent discriminative models

.

Bioinformatics

,

28

,

I32

–

I39

.

Lin

T.H.

et al. . (

2011

)

Discriminative motif finding for predicting protein subcellular localization

.

IEEE ACM Trans. Comput. Bi.

,

8

,

441

–

451

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Liu

M.

et al. . (

2015a

)

View-centralized multi-atlas classification for Alzheimer's disease diagnosis

.

Hum. Brain Mapp.

,

36

,

1847

–

1865

.

Liu

M.

et al. . (

2015b

)

Joint binary classifier learning for ECOC-based multi-class classification

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

99

.

doi: 10.1109/TPAMI.2015.2430325

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Lu

J.

et al. . (

2005

)

MicroRNA expression profiles classify human cancers

.

Nature

,

435

,

834

–

838

.

Murphy

R.F.

et al. . (

2003

)

Robust numerical features for description and classification of subcellular location patterns in fluorescence microscope images

.

J. Vlsi. Sig. Proc. Syst.

,

35

,

311

–

321

.

Google Scholar

Crossref

WorldCat

Nanni

L.

et al. . (

2010

)

Local binary patterns variants as texture descriptors for medical image analysis

.

Artif. Intell. Med.

,

49

,

117

–

125

.

Newberg

J.

Murphy

R.F.

(

2008

)

A framework for the automated analysis of subcellular patterns in human protein atlas images

.

J. Proteome Res.

,

7

,

2300

–

2308

.

Nigam

K.

et al. . (

2000

)

Text classification from labeled and unlabeled documents using EM

.

Mach. Learn.

,

39

,

103

–

134

.

Google Scholar

Crossref

WorldCat

Ojala

T.

et al. . (

1996

)

A comparative study of texture measures with classification based on featured distributions

.

Pattern Recognit.

,

29

,

9

.

Google Scholar

Crossref

WorldCat

Peng

H.

et al. . (

2012

)

Bioimage informatics: a new category in Bioinformatics

.

Bioinformatics

,

28

,

1057

.

Pierleoni

A.

et al. . (

2011

)

MemLoci: predicting subcellular localization of membrane proteins in eukaryotes

.

Bioinformatics

,

27

,

1224

–

1230

.

Ponten

F.

et al. . (

2008

)

The human protein atlas—a tool for pathology

.

J. Pathol.

,

216

,

387

–

393

.

Pujol

O.

et al. . (

2006

)

Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

28

,

1007

–

1012

.

Rodriguez

J.J.

Kuncheva

L.I.

(

2006

)

Rotation forest: a new classifier ensemble method

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

28

,

1619

–

1630

.

Tan

X.Y.

Triggs

B.

(

2010

)

Enhanced local texture feature sets for face recognition under difficult lighting conditions

.

IEEE Trans. Image Process

,

19

,

1635

–

1650

.

Uhlen

M.

et al. . (

2005

)

A human protein atlas for normal and cancer tissues based on antibody proteomics

.

Mol. Cell. Proteomics

,

4

,

1920

–

1932

.

Wang

X.

Li

G.Z.

(

2013

)

Multi-label learning via random label selection for protein subcellular multi-locations prediction

.

IEEE ACM Trans. Comput. Biol

,

10

,

436

–

446

.

Google Scholar

Crossref

WorldCat

Wang

Z.

et al. . (

2008

)

MultiK-MHKS: a novel multiple kernel learning algorithm

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

30

,

348

–

353

.

Xu

Y.Y.

et al. . (

2013

)

An image-based multi-label human protein subcellular localization predictor (iLocator) reveals protein mislocalizations in cancer tissues

.

Bioinformatics

,

29

,

2032

–

2040

.

Xu

Y.Y.

et al. . (

2015

)

Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning

.

Bioinformatics

,

31

,

1111

–

1119

.

Yang

F.

et al. . (

2014

)

Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features

.

Neurocomputing

,

131

,

113

–

123

.

Google Scholar

Crossref

WorldCat

Yoon

Y.

Lee

G.G.

(

2012

)

Subcellular localization prediction through boosting association rules

.

IEEE ACM Trans. Comput. Biol.

,

9

,

609

–

618

.

Google Scholar

Crossref

WorldCat

Zhang

D.

et al. . (

2011

)

Multimodal classification of Alzheimer's disease and mild cognitive impairment

.

Neuroimage

,

55

,

856

–

867

.

Zhang

T.L.

et al. . (

2006

)

Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence

.

Comput. Biol. Chem.

,

30

,

367

–

371

.

Zhu

L.

et al. . (

2009

)

Multi label learning for prediction of human protein subcellular localizations

.

Protein J.

,

28

,

384

–

390

.

Author notes

Associate Editor: Robert Murphy

Download all slides

Month:	Total Views:
November 2016	2
December 2016	3
January 2017	4
February 2017	15
March 2017	23
April 2017	2
May 2017	9
June 2017	12
July 2017	2
August 2017	25
September 2017	3
October 2017	8
November 2017	4
December 2017	25
January 2018	22
February 2018	46
March 2018	24
April 2018	34
May 2018	26
June 2018	24
July 2018	19
August 2018	25
September 2018	25
October 2018	15
November 2018	26
December 2018	17
January 2019	33
February 2019	22
March 2019	38
April 2019	36
May 2019	22
June 2019	28
July 2019	22
August 2019	22
September 2019	24
October 2019	20
November 2019	13
December 2019	20
January 2020	26
February 2020	20
March 2020	12
April 2020	31
May 2020	11
June 2020	66
July 2020	36
August 2020	13
September 2020	4
October 2020	17
November 2020	26
December 2020	12
January 2021	14
February 2021	17
March 2021	22
April 2021	16
May 2021	11
June 2021	29
July 2021	55
August 2021	39
September 2021	31
October 2021	28
November 2021	21
December 2021	20
January 2022	16
February 2022	11
March 2022	16
April 2022	19
May 2022	22
June 2022	20
July 2022	23
August 2022	35
September 2022	39
October 2022	46
November 2022	42
December 2022	29
January 2023	31
February 2023	19
March 2023	21
April 2023	35
May 2023	9
June 2023	1
July 2023	3
August 2023	13
September 2023	15
October 2023	15
November 2023	11
December 2023	20
January 2024	35
February 2024	40
March 2024	45
April 2024	32

Article Contents

Human cell structure-driven model construction for predicting protein subcellular location from biological images

Abstract

1 Introduction

2 Materials and methods

2.1 Dataset

2.2 Overview of our method

2.3 Feature extraction and selection

2.4 Error correcting output coding

2.4.1 Encoding

2.4.2 Binary classifier learning

2.4.3 Decoding

2.5 ECOC coding with biological structural information

2.6 ECOC coding by adding checking bits

2.7 Kernel combination

2.8 Ensemble classification method

3 Experimental results

3.1 Experimental settings

3.2 Results for combining different types of features

3.3 Results for different coding strategies

3.4 Further improvement by adding checking bits

3.5 Ensemble results with different coding strategies

4 Discussion

4.1 Comparisons with existing works

4.2 Diversity analysis

4.3 Slight variations on tree structure

4.4 Prediction on unseen proteins

5 Conclusion

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only