Abstract

Assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Due to the existence of degenerate peptides and ‘one-hit wonders’, it is very difficult to determine which proteins are present in the sample. In this paper, we review existing protein inference methods and classify them according to the source of peptide identifications and the principle of algorithms. It is hoped that the readers will gain a good understanding of the current development in this field after reading this review and come up with new protein inference algorithms.

INTRODUCTION

Mass spectrometry-based proteomics is the large-scale study of proteins expressed in an organism. It can provide information that is not readily available from genomic sequence or RNA expression data [1]. An explicit goal of proteomics is the identification of all proteins expressed in a cell or tissue.

Shotgun proteomics is a method of identifying complex protein mixtures using a combination of high-performance liquid chromatography and mass spectrometry. As shown in Figure 1, proteins are enzymatically digested and optionally fractionated from their biological source. The resulting peptide mixtures are then ionized and scanned by tandem mass spectrometry (MS/MS) to obtain a set of MS/MS spectra. Finally, peptides and proteins are identified by computational analysis of the acquired MS/MS spectra [2, 3]. The data generated by shotgun proteomics experiments are highly redundant in the sense that a subset of the peptides is repeatedly and preferentially selected for fragmentation and tandem MS scanning. And those peptides derived from low abundance proteins are more difficult to detect. As a consequence, a large number of fragment ion spectra have to be acquired to increase the identification coverage rate [4–6].

Figure 1:

A general pipeline for the identification of proteins in shotgun proteomics.

Figure 1:

A general pipeline for the identification of proteins in shotgun proteomics.

There are three key computational problems in the protein identification process: peptide identification, protein inference and result evaluation. In peptide identification, fragment ion spectra are assigned to peptide sequences to generate a set of Peptide-Spectrum Matches (PSM) using database search engines, such as Mascot [7], SEQUEST [8] and X!Tandem [9]. In protein inference, identified peptide sequences are assembled into a set of confident proteins [10, 11]. Finally, we need to assess the reliability of these identifications.

Over the past decade, researchers paid much attention to peptide identification. But it is not straightforward to generate a reliable list of proteins from those identified peptides. This occurs mainly due to the existence of degenerate peptides and ‘one-hit wonders’. Degenerate peptides denote the same list of identified peptides shared by multiple proteins in the database. The corresponding ambiguity cannot be easily resolved. ‘One-hit wonders’ denote some proteins that only have one single identified peptide. Theoretically, it is difficult to infer proteins based on one single peptide identification due to the possibility of false-positive identification. As a result, the problem of determining which protein is indeed present in the sample often has multiple solutions and can be computationally intractable. ProteinProphet [12], a probabilistic model, was first proposed to address this challenge. After that, we have witnessed the increase of new problem formulations and new solutions (e.g. [13, 14]). But an up-to-date and comprehensive review is still lacking.

In this article, we aim at providing a comprehensive review of existing protein inference approaches. To achieve this objective, we present two frameworks that can categorize all existing solutions to the protein inference problem. More precisely, we use the dependence on the peptide search engine and the underlying algorithmic technique as the classification criteria.

The remainder of the article is organized as follows. In ‘The Protein Inference Problem’ section, we discuss the key difficulties in solving the protein inference problem. In ‘Peptide Identification Post-processing’ section, we list common methods for peptide identification post-processing. In ‘Peptide Assembly’ section, we summarize available protein inference methods and describe different classes of methods in detail. We discuss how to assess the reliability of protein identifications in ‘Result Validation for Protein Identification’ section. We conduct a comprehensive evaluation and comparison of five typical protein inference methods in ‘Experiment’ section. According to the experiment results, we discuss the future development of protein inference problem in ‘Future Perspective’ section. Finally, we conclude this article in ‘Conclusion’ section.

THE PROTEIN INFERENCE PROBLEM

In shotgun proteomics, the peptide-centric strategy is widely adopted as a means of identifying proteins from biological samples. Since researchers are more interested in which proteins are present in the sample, peptide identification is only an intermediate step for protein identification. After gathering all peptide identifications, we need to infer the existence of proteins in the sample. However, protein inference is a difficult problem due to the existence of ‘one-hit wonders’ and degenerate peptides.

One-hit wonders refer to those proteins who have only one identified peptide. Formally, we use ui to denote the number of identified peptides that could be digested from protein i. Protein i is a one-hit wonder if ui = 1. Since current peptide identification algorithms are still far from perfect, an identified peptide may be a false positive regardless of its uniqueness. That is, the only evidence for confirming the presence of a one-hit wonder is not reliable, even if such peptide is unique throughout the protein sequence database. Thus, compared to those proteins having two or more identified peptides, it is not easy to determine the existence of these one-hit wonders.

Degenerate peptides are peptides that can be shared by multiple proteins. Mathematically, let vj denote the number of proteins that can generate an identified peptide j, we call peptide j a degenerate peptide if vj > 1. Compared to unique peptides that only occur in one protein, degenerate peptides pose computational challenges since it is hard to know which protein or protein group indeed generates the degenerate peptides. This is often caused by the existence of homologous proteins in the database. In this context, it is difficult to distinguish between two possibilities: (i) All of the related proteins are present in the sample; (ii) Only some proteins are truly present.

We can divide the analysis efforts of protein inference into different stages to reduce its complexity. As shown in Figure 2, protein inference consists of three phases: peptide identification post-processing, peptide assembly and result validation. In the forthcoming sections, we will review existing solutions to these three sub-problems listed in Figure 2. Note that the step of peptide assembly is of central importance to the success of the whole procedure. We provide two general categorization frameworks to organize existing assembly approaches. The first one is application-oriented, which categorizes assembly algorithms according to their dependence on the output of some specific peptide search engines. The second one mainly uses the underlying algorithmic technique as the classification criterion. To understand the rationale behind our second framework, we need to first investigate the computational nature of such assembly process.

Figure 2:

The composition of protein inference problem. It consists of three steps: peptide identification post-processing, peptide assembly and result validation.

Figure 2:

The composition of protein inference problem. It consists of three steps: peptide identification post-processing, peptide assembly and result validation.

One can model the relationship between the identified peptides and the proteins in the database as a bipartite graph, as shown in Figure 3. This is the standard input setting for most assembly algorithms. The general problem of inferring correct proteins from this bipartite graph is difficult to solve due to the existence of ‘one-hit wonders’ and degenerate peptides. For example, the first protein and the second protein in Figure 3 have the same set of identified peptides: Peptide 1 and Peptide 2, which are degenerate peptides. If there is no other supporting information, we cannot determine which protein is indeed present in the sample. Furthermore, Protein 3 is a one-hit wonder that has Peptide 3 as its single peptide identification. Compared with Protein 4 having two identified peptides, it is not easy to determine the existence of Protein 3. In addition, the quality of peptide identification also influences the final protein identification results. But if we make some reasonable assumptions, it is possible to reveal the computational nature of such assembly problem.

Figure 3:

The bipartite graph represents the input of a standard peptide assembly problem. The top vertices represent peptide identifications. The bottom vertices are candidate proteins in the sample: proteins with at least one matching peptide.

Figure 3:

The bipartite graph represents the input of a standard peptide assembly problem. The top vertices represent peptide identifications. The bottom vertices are candidate proteins in the sample: proteins with at least one matching peptide.

First, we can assume that all m peptides are true positives. Under this assumption, we can derive the upper bound and lower bound on the number of possible proteins in the sample. Since there are at most n proteins that are associated with all identified peptides, it is easy to see that the upper bound of the number of proteins is n. In other words, it is impossible for us to identify extra proteins under the current problem setting. To date, some existing protein inference methods adopt the strategy of returning all potential proteins without filtering [15]. Obviously, the underlying assumption behind such a strategy is that the sample contains a large portion of homologous proteins.

It is not straightforward to calculate the lower bound of the number of proteins. Recalling that all identified peptides are assumed to be correct, the set of proteins reported by any assembly algorithm should contain all identified peptides. In other words, it is not allowed to miss any identified peptides in the final protein list. Based on above observations, we can formulate the assembly task as a set covering problem [16, 17]: finding a minimum subset of proteins that cover all identified peptides. Here one peptide is said to be covered by some protein if there is an edge between them in the bipartite graph shown in Figure 3. Therefore, the lower bound is the number of proteins in the optimal solution of the corresponding set covering problem. It is well known that the set covering problem is NP-complete, making it is infeasible to obtain the optimal solution in practice. Therefore, algorithms in this category [18, 19] adopt a parsimony philosophy and assume that one peptide is seldom shared by multiple proteins.

Overall, the analysis on upper bound and lower bound provides some hints on the nature of assembly problem: any algorithm under this problem setting is to seek a trade-off between two extreme cases. On the one hand, reporting all n proteins has the risk of including many false positives. The reason is that we search the protein database to obtain all the possible proteins associated with all identified peptides. However, some results may be statistical artifacts that match the identified peptides only by chance. To better clarify this point, here we will use an example to give a comprehensive explanation. Suppose we conduct an experiment using a synthetic mixture of known proteins, e.g. 18 proteins. There can be a protein T in these 18 proteins and another protein F in the database such that they share the same peptide sequence, even though F is not present in the sample. Such cases will occur more frequently in real data with the increase of protein number in the target sample. When the candidate protein is only associated with a peptide that is unique throughout the database, the probability of such a protein being present is very high if the identification confidence of corresponding unique peptide is high. In fact, in many protein inference tools such as ProteinProphet, the protein identification probability is reduced to the peptide identification probability if the protein has only one unique peptide identified [see Equation (1) in the article]. That means the status of the protein (present or absent) is determined by whether its unique peptide is a statistical artifact. Even if such unique peptide is confirmed by more than one MS/MS spectrum, it just indicates that its probability of being absent in the sample is extremely low. High identification confidence cannot be interpreted absolutely as ‘there are no statistical errors’. Hence, it is very likely to include many false positives without any filtering. On the other hand, retaining only a minimum subset of proteins may lose many true positives. Therefore, we have to face this dilemma in the design of effective assembly algorithms. Figure 4 shows that most current peptide assembly algorithms find a subset of proteins whose size is between min (the lower bound) and max (the upper bound).

Figure 4:

The number of inferred proteins under five circumstances. In this figure, ‘Max’ denotes the upper bound (the number of all associated proteins) and ‘Min’ denotes the lower bound (the number of proteins in the optimal solution of a set covering problem) when it is assumed that all identified peptides are true positives. Under this assumption, most existing peptide assembly algorithms can be regarded as the trade-off between two extreme solutions. If we relax the assumption that all identified peptides are correct, a new yet smaller lower bound ‘P’ could be achieved, which corresponds to the number of proteins in the optimal solution of a partial set covering problem. Similarly, we can obtain a new upper bound ‘E’ by introducing additional information from other data sources into the assembly procedure.

Figure 4:

The number of inferred proteins under five circumstances. In this figure, ‘Max’ denotes the upper bound (the number of all associated proteins) and ‘Min’ denotes the lower bound (the number of proteins in the optimal solution of a set covering problem) when it is assumed that all identified peptides are true positives. Under this assumption, most existing peptide assembly algorithms can be regarded as the trade-off between two extreme solutions. If we relax the assumption that all identified peptides are correct, a new yet smaller lower bound ‘P’ could be achieved, which corresponds to the number of proteins in the optimal solution of a partial set covering problem. Similarly, we can obtain a new upper bound ‘E’ by introducing additional information from other data sources into the assembly procedure.

The above analysis is based on the two assumptions. One assumption is that all identified peptides are correct. If we relax this strong assumption and suppose that there are still certain false-positive peptide identifications, the number of proteins may be less than the current lower bound. In Figure 4, we use P to denote the new lower bound when not all m peptide identifications are correct. If we know the exact fraction of incorrect peptide identifications, this new lower bound will be the number of proteins in the optimal solution of a partial set covering model [20]. In addition, it is impossible to extend the upper bound if we use only information in the bipartite graph. To improve the identification coverage, we have to borrow additional information from other data sources. That is, we change the problem setting by introducing extra information as input to the assembly procedure. To date, researchers have used raw MS/MS spectra, single-stage MS data, peptide expression profiles, mRNA expression data, protein interaction network and gene model as additional input to improve the coverage. As a result, it is possible to obtain a new upper bound E on the number of possible proteins, as shown in the right part of Figure 4. More importantly, we can categorize available peptide assembly methods according to different kinds of extra information they have exploited.

Another assumption is that all peptides have the same likelihood of being detected. However, a protein's peptide fragments are not observed with equal probability. Indeed, for any given protein, only a few ‘proteotypic’ peptides are repeatedly and consistently identified using a particular proteomic platform [6, 21, 22]. Proteotypic peptides are widely used in quantitative proteomics. Peptide detectability, a new concept that is based on proteotypic peptides, was introduced into protein inference. Peptide detectability is defined as the probability of observing a peptide in a standard sample by a standard proteomics routine [23, 24]. In other words, peptide detectability can be considered as an intrinsic property of a peptide that is mainly decided by its primary sequence and its parent protein sequence. So a degenerate peptide now can be assigned to each of its parent proteins with a corresponding probability assuming that it comes from that protein. This concept can well explain the assignment of degenerate peptides in principle, compared with the use of weight [12], which presumes that degenerate peptides only can come from one protein or the use of peptides grouping [25] that puts peptide sequences with indistinguishable predicted spectra into the same group.

In summary, the protein inference problem can be divided into three steps: peptide identification post-processing, peptide assembly and result validation. Since peptide assembly is of central importance to the success of the whole procedure, we further investigate its computational nature. Through the analysis, researchers may know the limitations of existing methods and develop new solutions.

PEPTIDE IDENTIFICATION POST-PROCESSING

Before peptide assembly, it is necessary to validate the peptide identification results. One simple approach is to use decoy (reversed, randomized or shuffled) sequence database [26]. An alternative approach is the probabilistic model (such as PeptideProphet [27]).

Zhang et al. [28] use a multivariate non-linear discriminate function to control the quantity of database search results. A Bayesian non-parametric (BNP) model [29] incorporates a probability framework into the randomized database searching method. Both methods are based on the target–decoy database search strategy. The BNP model can incorporate more features into a linear discriminant function if needed. Another robust and non-parametric method [30] uses a novel indicator, the probability ratio, to consider the statistical information provided by the first and second best scores. The probability ratio is a conditional probability indicator that uses the score given by the second match (or subsequent matches) to determine spectral quality. Error rates associated with peptide identification are calculated by a fully automated, non-parametric algorithm.

PeptideProphet [27] automatically validates peptide assignments to MS/MS spectra made by database search programs such as SEQUEST. PeptideProphet first combines multiple scores from a standard search engine (SEQUEST, Mascot, etc.) into a single score, which is called ‘discriminant score’. Such a discriminant score is usually generated from the linear combination of raw scores, in which the weight coefficients are empirically tuned parameters for different search engines. After PeptideProphet calculates the discriminant scores for all the spectra in a sample, it learns distributions of the discriminant scores among correct and incorrect peptides from each dataset, and uses those distributions to compute for each result a probability that it is correct.

Recently, Choi et al. [31] presented two flexible methods, the variable component mixture model and the semiparametric mixture model, to remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. The most recent version of PeptideProphet provides an option to use non-parametric modeling with target–decoy database searches.

At the same time, Choi et al. [32] propose a semi-supervised framework that combines probabilistic modeling approach of PeptideProphet with the decoy strategy. It presents a semi-supervised expectation–maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixture model, extending PeptideProphet to incorporate decoy peptide matches.

PEPTIDE ASSEMBLY

To date, there are many methods for peptide assembly. These methods can be classified into different categories according to different criteria. It is very important to have a systematic framework that is expandable to cover all available methods. With these considerations in mind, we propose two hierarchical frameworks in Figures 5 and 6, respectively. In Figure 5, peptide assembly methods are divided into two classes according to their independence on the peptide search engines. In Figure 6, peptide assembly methods are categorized according to the underlying algorithmic techniques. In the following subsections, we will discuss each category in detail. Furthermore, we provide a comprehensive list of all available methods from two different angles in Table 1.

Figure 5:

The first framework for categorizing peptide assembly methods. The classification criterion is the dependence of assembly algorithms to the peptide search engines. Search engine-dependent approaches require that the identification results are obtained from some specific peptide search engine. In contrast, search engine-independent approaches can perform protein inference using output from any peptide search engines.

Figure 5:

The first framework for categorizing peptide assembly methods. The classification criterion is the dependence of assembly algorithms to the peptide search engines. Search engine-dependent approaches require that the identification results are obtained from some specific peptide search engine. In contrast, search engine-independent approaches can perform protein inference using output from any peptide search engines.

Figure 6:

The second framework for categorizing peptide assembly methods. The classification criterion is based on the underlying algorithmic techniques. Bipartite graph model only utilizes information in the standard bipartite graph representation. To improve the identification coverage, supplementary information model borrows additional information from various data sources. Currently, extra information used in peptide assembly can be either MS-generated data such as peptide expression profile or data from other sources like protein interaction network.

Figure 6:

The second framework for categorizing peptide assembly methods. The classification criterion is based on the underlying algorithmic techniques. Bipartite graph model only utilizes information in the standard bipartite graph representation. To improve the identification coverage, supplementary information model borrows additional information from various data sources. Currently, extra information used in peptide assembly can be either MS-generated data such as peptide expression profile or data from other sources like protein interaction network.

Table 1:

A overall list of available tools for peptide assembly

Assembly method Algorithm Search engines URL 
DTASelect [15Optimistic Model SEQUEST http://fields.scripps.edu/downloads.php 
DBParser [18Parsimonious Model Mascot NA 
IDPicker [19, 45Parsimonious Model Search engine-independent http://fenchurch.mc.vanderbilt.edu/ 
MassSieve [48Parsimonious Model Search engine-independent http://www.ncbi.nlm.nih.gov/staff/slottad/MassSieve/ 
LDFA [24Parsimonious Model Search engine-independent http://darwin.informatics.indiana.edu/applications/ ProteinInference/ 
ProteinProphet [12Non-parametric Model Search engine-independent http://tools.proteomecenter.org/wiki/index.php?title= Software:TPP 
EBP [46Non-parametric Model Search engine-independent http://bioinf.itmat.upenn.edu/ebp/ 
PANORAMICS [25Non-parametric Model Mascot NA 
PROVALT [43Non-parametric Model Mascot http://kiwi.rcr.uga.edu/tcprot 
MSBayesPro [49Non-parametric Model Search engine-independent http://darwin.informatics.indiana.edu/yonli/ proteininfer/ 
Fido [50Non-parametric Model Search engine-independent http://noble.gs.washington.edu/proj/fido 
Qscore [42Parametric Model SEQUEST NA 
PROT___PROBE [51Parametric Model Search engine-independent NA 
ComByne [1Parametric Model Search engine-independent http://bio.parc.com/mass_spec_analysis 
A nested model [56Raw MS/MS Data Search engine-independent NA 
HSM [13Raw MS/MS Data Search engine-independent NA 
Barista [57Raw MS/MS Data Search engine-independent http://noble.gs.washington.edu/proj/crux 
Scaffold [61Raw MS/MS Data Search engine-independent http://www.proteomesoftware.com/ 
PSC [20Single-stage MS data Search engine-independent http://bioinformatics.ust.hk/PSCMixture.rar 
MS and MS/MS combined method [60Single-stage MS data Search engine-independent NA 
PIPER [52Peptide Expression Profile Search engine-independent NA 
MSNet [54Protein Interaction Network Search engine-independent http://aug.csres.utexas.edu/msnet 
CEA [53Protein Interaction Network Search engine-independent http://bioinfo.vanderbilt.edu/cea 
MSpresso [55mRNA Expression Data Search engine-independent http://www.marcottelab.org/MSpresso/ 
PeptideClassifier [58, 59Gene Model Search engine-independent http://www.mop.unizh.ch/software.html 
MIPGEM [14Gene Model Search engine-independent NA 
Assembly method Algorithm Search engines URL 
DTASelect [15Optimistic Model SEQUEST http://fields.scripps.edu/downloads.php 
DBParser [18Parsimonious Model Mascot NA 
IDPicker [19, 45Parsimonious Model Search engine-independent http://fenchurch.mc.vanderbilt.edu/ 
MassSieve [48Parsimonious Model Search engine-independent http://www.ncbi.nlm.nih.gov/staff/slottad/MassSieve/ 
LDFA [24Parsimonious Model Search engine-independent http://darwin.informatics.indiana.edu/applications/ ProteinInference/ 
ProteinProphet [12Non-parametric Model Search engine-independent http://tools.proteomecenter.org/wiki/index.php?title= Software:TPP 
EBP [46Non-parametric Model Search engine-independent http://bioinf.itmat.upenn.edu/ebp/ 
PANORAMICS [25Non-parametric Model Mascot NA 
PROVALT [43Non-parametric Model Mascot http://kiwi.rcr.uga.edu/tcprot 
MSBayesPro [49Non-parametric Model Search engine-independent http://darwin.informatics.indiana.edu/yonli/ proteininfer/ 
Fido [50Non-parametric Model Search engine-independent http://noble.gs.washington.edu/proj/fido 
Qscore [42Parametric Model SEQUEST NA 
PROT___PROBE [51Parametric Model Search engine-independent NA 
ComByne [1Parametric Model Search engine-independent http://bio.parc.com/mass_spec_analysis 
A nested model [56Raw MS/MS Data Search engine-independent NA 
HSM [13Raw MS/MS Data Search engine-independent NA 
Barista [57Raw MS/MS Data Search engine-independent http://noble.gs.washington.edu/proj/crux 
Scaffold [61Raw MS/MS Data Search engine-independent http://www.proteomesoftware.com/ 
PSC [20Single-stage MS data Search engine-independent http://bioinformatics.ust.hk/PSCMixture.rar 
MS and MS/MS combined method [60Single-stage MS data Search engine-independent NA 
PIPER [52Peptide Expression Profile Search engine-independent NA 
MSNet [54Protein Interaction Network Search engine-independent http://aug.csres.utexas.edu/msnet 
CEA [53Protein Interaction Network Search engine-independent http://bioinfo.vanderbilt.edu/cea 
MSpresso [55mRNA Expression Data Search engine-independent http://www.marcottelab.org/MSpresso/ 
PeptideClassifier [58, 59Gene Model Search engine-independent http://www.mop.unizh.ch/software.html 
MIPGEM [14Gene Model Search engine-independent NA 

Note that our main objective in this article is to summarize available peptide assembly methods from an algorithmic aspect. Though some systems that automate data management and processing in MS-driven proteomics analysis do provide the functionality of protein inference, they generally re-use existing methods rather than develop new algorithm. For example, ms_lims [33] reports all the matching proteins for each identified peptide and selects a representative protein as the primary protein hit according to the algorithm published in [34]. Therefore, we will omit these papers in our review.

Search engine-based classification

There are many well-known programs for peptide identification, including database search programs (such as SEQUEST, Mascot, X!Tandem and OMSSA [35]), de novo sequencing programs (such as PEAKS [36] and PepNovo [37]) and hybrid tools (such as GutenTag [38] and InsPecT [39]). Among these methods, SEQUEST and Mascot are the most widely used peptide identification approaches.

SEQUEST [7] uses two scoring functions. The first function computes a preliminary score (Sp) by simply counting the number of common peaks between experimental and theoretical spectrum, which is used to rapidly determine a small subset of peptide candidates for each spectrum. The second function computes a normalized correlation score (Xcorr), which is a scalar dot product between the acquired and theoretical spectrum with a correction factor.

Mascot [8] incorporates a probability-based implementation of the Mowse algorithm [40]. By calculating the distribution of tryptic peptide lengths across the entire search database, a probability P that the observed match between the experimental data and the database sequence is a random event can be calculated [41]. Then, the ion score for an MS/MS match is defined as: −10Log10(P), which is used as the primary score in Mascot.

Due to the popularity of SEQUEST and Mascot, some peptide assembly methods can only handle identification results from SEQUEST or Mascot. Therefore, we divide existing assembly algorithms into two groups: search engine-dependent approaches and search engine-independent approaches. As shown in Fig. 5, current search engine-dependent approaches are specific to either SEQUEST or Mascot.

Search engine-dependent approach

SEQUEST-based approach

Such approach mainly uses XCorr and Sp scores as input to perform protein inference. Typical examples include DTASelect [15] and Qscore [42].

  • DTASelect collates the peptide identifications for each protein, and then applies user-defined criteria (such as thresholds for XCorr scores) to select identification results. DTASelect groups together proteins with identical sets of identified peptides and uses a similarity score to describe the relationship between proteins with overlapped peptide identifications. It removes proteins for which the observed evidence is a subset of the peptides observed for another protein.

  • Qscore is presented specifically for database search results from SEQUEST. But it can be extended to search results from other database search programs. SEQUEST returns a best match peptide as long as at least one peptide from the database falls within the peptide mass tolerance. Up to this point, the quality of this match is not able to be ensured. Therefore, Moore et al. [42] develope a statistically based algorithm to determine the goodness of a protein match. It resembles an approximation to a binomial distribution with protein size, peptide match quality, number of peptides matching a protein and size of spectral data set as parameters.

Mascot-based approach

Since Mascot reports probability-based score for each identified peptide, the Mascot-based assembly algorithms mainly concentrate on how to manipulate these scores to obtain desired output. PROVALT [43], DBParser [18] and PANORAMICS [25] fall into this category.

  • PROVALT chooses a sequence of Mascot score thresholds for proteins that match different numbers of peptides. For example, a protein could be accepted with three scores of at least 30, two scores of at least 35 or one score of at least 50. The thresholds are chosen to meet a user-specified false discovery rate (FDR) by using deliberate false positives, such as reverse sequences [44].

  • DBParser is a web-based application. It is designed to parse Mascot search results. DBParser reports proteins in six hierarchical categories: distinct, differentiable, subsumable, superset, subset and equivalent. Parsimony analysis is employed to derive a sufficient protein list to account for the observed peptide identifications.

  • PANORAMICS is a probability model for peptide assembly that handles peptide identifications provided by Mascot. The algorithm groups proteins having the same sets of matched peptides and calculates reasonable probabilities for grouped proteins with respect to the search database. The probabilities are in line with false positive estimates that can be approximated by reverse database searching.

Search engine-independent approach

When a peptide assembly method, such as ProteinProphet [12], IDPicker [19, 45] and MIPGEM [14], can handle the result from any search engine, we consider it search engine independent and include it in this category. Such independence is very important since we often need to perform protein inference using peptide identification results from different search engines. Note that some of the methods use probabilities rather than raw scores as input, such as ProteinProphet [12] and EBP [46]. Such requirements can be accomplished by the peptide identification post-processing step, as we have discussed in ‘Peptide Identification Post-processing’ section. Since most assembly methods are independent of peptide search engines, we will discuss them in the following subsection in detail.

Algorithm-based classification

Different from search engine-based classification, the algorithm-based classification focuses on the underlying algorithmic techniques used in peptide assembly tools. As we have discussed in ‘The Protein Interference Problem’ section, the standard bipartite graph model deserves certain drawbacks: Given a set of identified peptides, it is impossible to make the protein identification coverage higher than the upper bound if no extra information is included. Therefore, researchers begin to investigate the possibility of utilizing different kinds of supplementary information in protein inference. Figure 6 summarizes these methods.

Since generally there are more than one method in each category, we choose one typical method for each category to describe the mathematical characteristics.

Bipartite graph model

Parsimonious model

In the bipartite graph model, one simple solution for peptide assembly is to employ the principle of parsimony. It applies Occam's razor [47] to deal with homologous proteins and degenerate peptides by assuming that only a small subset of proteins should be sufficient to explain all identified peptides.

Here, the objective is to find a minimum subset of proteins that cover all identified peptides. Due to the NP-Hardness of such problem, a fast greedy algorithm is often used to find a subset of protein vertices that cover all of the peptide vertices (see the top panel of Figure 3). The greedy algorithm iteratively chooses a protein vertex that connects to the largest number of uncovered peptide vertices. When all peptides have been covered, the remaining proteins are eliminated. Proteins that have been selected as the set cover constitute the parsimonious protein list.

DBParser [18], IDPicker [19, 45], MassSieve [48] and LDFA [24] use the parsimony principle to solve the peptide assembly problem.

IDPicker first filters peptide identifications using a user-chosen false-positive rate, which can be estimated from the percentage of decoy matches at every potential cut point. Then, it selects candidate proteins using the greedy algorithm according to the number of peptides matched to each protein. At last IDPicker reports the protein and peptide information via HTML or other formats. IDPicker combines multiple scores on a per sample basis, groups together duplicate peptide identifications. If the user is compiling a report from multiple runs, IDPicker can apply an appropriate hierarchy to organize the data.

DBParser first classifies proteins into six hierarchical categories: distinct, differentiable, subsumable, superset, subset and equivalent. Then parsimony analysis is employed to derive a protein list sufficient to account for all identified peptides. That is, DBParser reports all of the distinct proteins first (distinct and differentiable proteins), and then nondistinct proteins (non-redundant superset and non-redundant equivalent proteins). The same method is also applied to MassSieve [48] for redundancy removal of protein identifications.

LDFA [24] incorporates peptide detectability into the set cover problem formulation. LDFA employs a simple greedy algorithm to assign the peptide with the lowest detectability to the corresponding protein first in order to solve the minimum missed peptide problem, which can be reduced to the set cover problem.

Statistical model

Although deterministic approaches that use parsimony principle are easy to understand, they are not as informative as statistical methods. The statistical methods can assemble large quantities of weak evidence to estimate the probability that a given protein is present. The basic idea is to perform statistical analysis using peptide scores and other related information. These statistical tools can be divided into two categories: non-parametric model and parametric model.

Non-parametric model

Non-parametric (or distribution-free) methods make no or few assumptions about the probability distributions of the variables being assessed. Therefore, they are easier to use and have greater robustness than the parametric methods.

ProteinProphet [12] is the most widely used method for solving the peptide assembly problem. It employs an iterative procedure to estimate the protein probabilities. More precisely, it computes protein probabilities with the assumption that distinct peptide matches are independent, and then recomputes peptide probabilities conditioned on the protein probabilities. The above iteration process continues until convergence.

Mathematically, the protein probability is calculated as:  

(1)
formula
where Pn is the probability that protein n is present in the sample and forumla is the weight of peptide i actually corresponding to protein n. If peptide i has Ns parent proteins, then  
(2)
formula
We iteratively compute the protein probability Pn and the weights forumla using Equations (1) and (2).

In addition,  

(3)
formula
denotes the probability of correct identification of peptide i given its assignment information Di and the number of its sibling peptides forumla in protein n. The number of sibling peptides (NSPs) of peptide i is defined as  
(4)
formula
where p(+|Dm) is the probability of peptide m from the same protein as peptide i given its information Dm. Information D is provided by searching engines, which may include useful information such as matching scores and the number of missed cleavages.

Then, the NSP distributions among correct and incorrect peptide assignments, forumla and forumla, are calculated for each bin B in a similar way:  

(5)
formula
where N is the total number of peptide assignments and p(+) is the prior probability of a correct peptide assignment.

The EBP [46] method is an empirical Bayesian approach for handling the assembly problem when input peptide identifications are obtained from multiple search algorithms and replicate experimental runs. It also uses an iterative process to perform inference.

PANORAMICS [25] is a probabilistic approach that is similar to ProteinProphet. It integrates peptide probabilities with protein probabilities and jointly computes peptide probabilities and protein probabilities in an iterative manner. Specially the algorithm groups proteins with the same sets of matched peptides and calculates reasonable probabilities for grouped proteins with respect to the search database. The probabilities are in line with false-positive estimations that can be approximated by reverse database searching. In PANORAMICS, even peptides with low probabilities are considered in the assembly process. This is in direct contrast to popular assembly methods whereby peptides with probabilities below a threshold are excluded from a protein assembly. Thus, PANORAMICS provides a more accurate measure to assess the confidence of protein identifications. Moreover, compared to popular assembly methods, PANORAMICS has a shorter computing time.

MSBayesPro is proposed for estimating protein probabilities in [49], which places particular emphasis on the peptide degeneracy problem.

All the shared peptides and unique peptides assigned to a group of proteins form a peptide configuration (y1,···,yi,···,yn), where yi= 1 when the peptide i is identified, otherwise yi = 0. Then, the protein inference problem is reduced to find the maximum a posterior (MAP) protein configuration (x1,···, xi,···, xm), which maximizes the conditional probability P(x1,···, xmy1,···, yn)  

(6)
formula
where xi = 1 if protein i is present and xi = 0 otherwise.

The conditional probability is  

(7)
formula
in which P(x1,···, xm) is the prior probability for protein configuration. Suppose proteins are independent of each other, then  
(8)
formula
and  
(9)
formula
where P(yj = 1|xi = 1, xk = 0, k ≠ i, 1 ≤ k ≤ m) is the probability of peptide j being identified if only protein i is present in the sample. This probability is the detectability of peptide j when it comes from protein i, referred to as the standard peptide detectability dij. Substituting Equations (8) and (9) into Equation (7) leads to  
(10)
formula

In addition to this basic model, the authors also propose an advanced model which incorporates the peptide identification scores into the Bayesian model.

Recently, Serang et al. [50] introduce a novel Bayesian method named Fido for computing discriminative and interpretable posterior protein probabilities. The authors build a model using a few relatively simple assumptions and develop fast algorithms. The model uses three parameters, α, β, γ, where α is the probability of generating associated peptides from present proteins, β is the probability of creating peptides from a noise model, and γ is the prior probability of each protein. With respect to the peptide degeneracy problem, this approach automatically apportions information from degenerate peptides during the marginalization procedure, rather than requiring an ad hoc adjustment. They also describe a series of heuristics including partitioning, clustering and pruning, which substantially increase the efficiency of computing posterior probabilities. In contrast to sampling, marginalizing yields an exact, closed-form solution in a finite period of time.

Parametric model

Parametric model assumes that data come from a probability distribution and makes inference about the parameters of the distribution. Since parametric methods make more assumptions than non-parametric methods, they produce more accurate protein probability estimation if those extra assumptions are correct.

Qscore [42] applies a statistical algorithm resembling an approximation to a binomial distribution and uses parameters such as protein size, peptide match quality, number of peptide matches to a protein and size of spectral data set.

PROT_ROBE [51] uses binomial distribution when appropriate probabilities for database matches are available and develops a multinomial model that can be used with any database search results for the general case. In the following, we will mainly introduce the binomial model.

First, PROT_ROBE assumes that protein identification follows a binomial model and each database search result is a random Bernoulli event. The trial has two outcomes: a protein is either identified or not. Two binomial distributions are determined for protein assignment. The probability of the protein identification at each Bernoulli event is determined either from the relative length of the protein in the database (null hypothesis, H0) or from the hypergeometric probabilities of peptides (an alternative hypothesis, H1).

H 1 states that the peptide match distribution is governed by the hypergeometric probabilities determined by PEP_ROBE results. If there are N spectra in the data set and K matches (each with probability pi) to a protein A, then the probability of all combinations leading to K matches (with specified probabilities pi) is estimated as  

(11)
formula
where P(A) is  
(12)
formula
The model can be interpreted as a Bernoulli trial where the probability of matching a protein A is P(A), and the probability of missing this protein is (1 − P(A)).

An alternative hypothesis, H0 states that the distribution of peptide matches to a protein is governed by the binomial distribution determined from the database. The binomial probability of K matches in N trials is  

(13)
formula
where Pr(A) is the ratio of the number of amino acids of protein A to the total number of amino acids in the database.

The maximum likelihood ratio, LR, is the ratio of P1 to P0. Then, PROT_PROBE assigns −log(LR) as a score to every protein.

ComByne [1] takes a p-value approach, modeling the probability that the peptide identifications would arise by chance alone. It makes use of information such as protein lengths, retention times and spectrum-to-spectrum correlation. ComByne considers the peptide assembly problem from the viewpoint of multiple hypothesis testing, in which the possibility of each protein being present in the sample is assessed. This is different from those methods whose aim is to optimally divide the proteins into two groups: present and absent.

Optimistic model

In contrast to parsimonious model, optimistic model adopts the strategy of returning all potential proteins that meet some simple criterion without further filtering. Obviously, the underlying assumption behind such a strategy is that the sample contains a large portion of homologous proteins.

As a typical example of optimistic model, DTASelect [15] reports proteins that have a sufficient number of different peptides or have at least one peptide showing up several times. It groups together proteins with identical sets of identified peptides and removes proteins for which the observed evidence is a subset of the peptides observed for another protein.

Supplementary information model

In the bipartite graph model, the output can not be further improved no matter how ideal the algorithm is, since the input information is limited. In order to improve the identification coverage and accuracy, extra information is employed. Borrowing extra information changes the input of protein inference problem: in addition to the standard input, supplementary information is used as part of the input, such as raw MS/MS data and single-stage MS data. These algorithms use supplementary information to identify proteins that may not be identifiable with high confidence by MS/MS evidence alone, but are nevertheless highly likely to be present as demonstrated by the combination of MS/MS evidence with supplementary information. Based on the type of the extra information, the peptide assembly methods can be divided into two categories: MS-generated data (peptide expression profile, raw MS/MS data, single-stage MS data) and data from other sources (protein interaction network, mRNA expression data, gene model).

Different supplementary information models have their own characteristics. Here, we compare them from two aspects: availability and reliability. Methods that incorporate MS-generated data can be applied to the analysis of any sample since such data are always available. In contrast, approaches that borrow data from other sources can only work when the required supplementary information exists. With respect to reliability, gene model is superior to other kinds of supplementary information exploited in protein inference.

We also have another classification criterion: the data fusion strategy. Standard input and supplementary information are combined based on different fusion strategies: the sequential data fusion strategy and the parallel data fusion strategy: either at data level or at decision level. As shown in Figure 7, the sequential fusion strategy firstly obtains a list of proteins identified from the standard input and then uses the supplementary data to adjust the result. For instance, PIPER [52], CEA [53], MSNet [54] and MSpresso [55] adopt the sequential fusion strategy. As shown in Figure 8, the parallel data fusion strategy handles standard input and supplementary information simultaneously. Such data fusion strategy can be further categorized as low or high-level fusion depending on the processing stage at which fusion takes place. Low-level fusion (fusion at data level) combines standard input and supplementary information to produce new raw data as new input. Existing methods such as the nested model [56], HSM [13], Barista [57], PSC [20] and PeptideClassifier [58, 59] fall into this category. High-level fusion (fusion at decision fusion) combines identification results coming from both standard input and supplementary information to obtain a consensus protein list. One typical example is the peptide mass fingerprinting method using MudPIT-based MS data [60].

Figure 7:

The sequential fusion strategy. Some methods such as CEA, MSNet, PIPER introduce an additional stage to reprocess the protein list identified from the standard input.

Figure 7:

The sequential fusion strategy. Some methods such as CEA, MSNet, PIPER introduce an additional stage to reprocess the protein list identified from the standard input.

Figure 8:

The parallel data fusion strategy: either at data level or at decision level. Data-level fusion, such as HSM, Barista, a nested model, directly combines standard input and supplementary data to generate the final result. The decision-level fusion strategy uses these information to generate two protein lists P1 and P2, respectively, and fuses P1 and P2 to create a high-confident protein list.

Figure 8:

The parallel data fusion strategy: either at data level or at decision level. Data-level fusion, such as HSM, Barista, a nested model, directly combines standard input and supplementary data to generate the final result. The decision-level fusion strategy uses these information to generate two protein lists P1 and P2, respectively, and fuses P1 and P2 to create a high-confident protein list.

MS-generated data

Raw MS/MS data

The MS/MS model mainly takes advantage of the raw MS/MS spectra information. Many algorithms designed for inferring proteins from a collection of PSMs divide the problem into two stages: assessing the quality of the PSMs and then inferring the protein set. Subdividing the protein identification problem in this fashion may result in a significant loss of information during the second stage of the analysis. For example, only a subset of spectra are assigned to a peptide during the peptide identification stage. Thus, information about the unassigned spectra is not available to the peptide assembly algorithms. Furthermore, suppose at most one peptide is assigned to each spectrum, and if for a particular spectrum that assignment happens to be incorrect, then information about the second-ranked, possibly correct peptide is not available during the protein identification stage. Also, if the quality of the match between a peptide and a spectrum is summarized using a single score, such as the probability assigned by PeptideProphet, then detailed information on how the peptide matches the spectrum is lost. In contrast, the raw MS/MS model described below, directly optimizes the number of identified proteins, taking into account all available information to obtain the best possible result.

HSM [13] is a typical example of using raw MS/MS data. It constructs a hierarchical statistical model to assess the confidence of peptides and proteins inferred by tandem MS data. The model is composed of four connected layers (marginal/conditional distributions) as shown in Figure 9. In this framework, HSM starts with the identified proteins and peptides and considers three types of unobserved binary variables: presence/absence status for proteins and peptides, and the correctness of each peptide assignment. HSM derives the probability distribution by incorporating multiple scores of the same peptide and correlations among the uncertainty of peptides and proteins. Then they use an EM algorithm to infer model parameters over all connected components.

Figure 9:

Four layers of the HSM. Layer 1 is a marginal distribution and the other three layers are conditional distributions.

Figure 9:

Four layers of the HSM. Layer 1 is a marginal distribution and the other three layers are conditional distributions.

More precisely, suppose N is the number of proteins with at least one peptide hit and M is the number of peptides assigned to at least one spectrum. Yi is a binary indicator such that Yi = 1 if protein i is in the sample and 0 otherwise. Vi is a binary variable such that Vi = 1 if the number of peptides identified for protein i is beyond a threshold and 0 otherwise. Zj is the binary indicator such that Zj = 1 if peptide j is present in the digested sample and 0 otherwise. Wjk is a binary indicator such that Wjk = 1 if the kth assignment of peptide j to a spectrum is correct and 0 otherwise, with corresponding matching score Sjk(k = 1, 2,…Tj). Cj is the set of proteins that could potentially generate peptide j.

The model is:  

(14)
formula

Then, the Y, Z and W are treated as the unobserved variables and model parameters are estimated via EM algorithm. Finally, the confidence of peptides and proteins can be calculated conditioned on S and V.

A nested model is proposed in Ref. [56], which can also estimate the protein probability and peptide probability simultaneously. It allows for appropriate evidence feedback between proteins and their constituent peptides. The nested model is built on several reasonable assumptions but ignores the peptide degeneracy problem entirely. Similar to HSM, Barista [57] formulates the protein identification problem as an integrated optimization problem. In Barista, the protein identification problem can be represented as a tripartite graph, with layers corresponding to spectra, peptides and proteins (Figure 10). The input to the problem is the tripartite graph, with a fixed set of features assigned to each peptide-spectrum match. In this method, they represent each PSM using 17 features that collectively describe properties of the spectrum and the peptide, as well as the quality of the match between the observed and theoretical spectra. This model uses a machine learning method to directly optimize the total number of proteins identified by the experiment. Furthermore, it does not filter any PSMs at any stage of the analysis, with the observation that low scoring PSMs can imply the protein's presence when other PSMs from the same protein are available.

Figure 10:

Barista. The tripartite graph represents the protein identification problem, with layers corresponding to spectra (bottom), peptides (middle) and proteins (top). Barista computes a parameterized non-linear function on each PSM feature vector. The score assigned to a peptide is the maximum PSM score associated with it. The score assigned to a protein is a normalized sum of its peptide scores.

Figure 10:

Barista. The tripartite graph represents the protein identification problem, with layers corresponding to spectra (bottom), peptides (middle) and proteins (top). Barista computes a parameterized non-linear function on each PSM feature vector. The score assigned to a peptide is the maximum PSM score associated with it. The score assigned to a protein is a normalized sum of its peptide scores.

Scaffold [61] uses a peptide-spectra-protein graph structure for peptide assembly. Under this scheme, the original mass spectral evidence can be assigned to multiple possible peptides in a ‘peptide group’. That peptide group is linked to multiple proteins by their associated peptide sequences. Then Scaffold uses a greedy algorithm to assign the peptide group to the most confidently identified proteins.

Single-stage MS data

While MS-based methods provide wider coverage than MS/MS-based methods, their identification accuracy is lower since MS data have less information than MS/MS data. It is natural to consider the combination of MS data and MS/MS data in a unified model such that the identification performance can be further improved.

Peptide mass fingerprinting (PMF) is a technique used to identify proteins by matching observed peptide masses to theoretical peptide masses [40]. The presumption of PMF is that every protein has a set of unique peptides and thus masses of these peptides can form its fingerprinting [62–64]. Initially, PMF is proposed to identify single purified proteins separated by 2D gel electrophoresis [65] and is only applicable to data generated from high accuracy instruments. Recently, PMF is extended to identify protein mixtures from shotgun proteomics data [20, 66].

Some methods [20, 60] are proposed to combine the MS information and MS/MS data together to identify the proteins in the sample so that the identification performance is further improved, especially for the ‘one hit wonders’.

Peptide expression profile

Proteomics discovery platforms generate both peptide expression information and peptide identification information. In label-free quantitative proteomics studies, peptide expression information, such as peptide intensity information, has been widely used [67]. Recently, researchers also introduce it into protein identification.

PIPER [46] assumes that peptides derived from the same protein are correlated with expression profiles. The peptide-to-protein assignments along with the peptide expression profiles are submitted to PIPER for expression correlation filtering. The output is a list of differentially expressed proteins along with an estimate of the false-positive error rate.

Spectral counting [68, 69] is also a good method for label-free qualification. It involves measuring the abundance of a given protein based on the number of tandem mass spectral observations for all its constituent peptides. Thus, it offers a practical alternative to peak intensity measurements, which relies heavily on computational efforts for chromatogram alignment and peak detection [70]. Using spectral counting for peptide expression estimation to enhance protein identifications would be an interesting problem in future research.

Data from other sources

Besides MS-generated data, we can also use data from other sources to facilitate protein identification. For example, protein interaction network information has been used to enhance protein identification.

Protein interaction network

Protein–protein interactions indicate two or more proteins binding together, when carrying out some biological functions. In most peptide assembly methods, proteins are considered as independent entities. Nevertheless, accumulating evidence suggests that most biological functions are carried out from interactions among proteins, and a discrete biological function is rarely be attributed to an individual protein. Thus, some methods start to take advantage of protein–protein interaction networks.

CEA [53] is a clique enrichment approach to rescue eliminated proteins by incorporating the relationship among proteins as embedded in a protein interaction network. The model is based on the general concept that proteins involved in the same biological process or pathway tend to be close to one another in the protein interaction network [71]. After peptide identification and peptide assembly, proteins in the maximal protein list are grouped into confident proteins and non-confident proteins, and then mapped to the protein interaction network. All maximal cliques are enumerated from the network and evaluated for the enrichment of confident proteins.

MSNet [54] improves protein identification in shotgun proteomics experiments by utilizing protein–protein interaction network information from gene functional network. In MSNet, each protein receives a revised identification score with contributions both from direct MS-based evidence and MS evidence of neighboring proteins in the gene functional network. In this way, MSNet can rescue proteins which are present in the sample but were eliminated in the MS/MS-based protein identification.

mRNA expression data

At the stage of transcription, genes are differentially transcribed due to the chromatin arrangement and the activity of transcription factors. Thus, the stability and distribution of the different transcripts need to be regulated. The regulatory processes include capping, splicing, addition of tail poly(A) and RNA editing. Then regulated mRNA is decoded by the ribosome to produce a specific amino acid chain, or polypeptide, that will later fold into an active protein.

This mRNA information can be applied to peptide assembly process. For example, most of peptide assembly methods assume that all proteins are equally likely to be present. In reality, other information may be readily available to help infer the probability of protein's presence when evidence from the MS/MS experiment is weak. For example, raw MS/MS identification score of protein A falls below a given confidence threshold but its mRNA abundance is high enough. In this case, we have a reason to believe that protein A is present in the sample.

MSpresso [55] re-examines protein identification scores with respect to their mRNA abundance. It boosts the protein identification scores given sufficient mRNA concentration. Proteins are then labeled present if their MSpresso probability is larger than a newly determined cutoff.

More formally, K is a Bernoulli variable where K = 1 is the event that the protein is present in the sample, and P(K = 1) is the probability of that event. The MSpresso score, P(K|S = s, M = m), is the posterior probability that a protein is present in the sample given its associated mRNA abundance M = m, and its raw MS/MS protein identification score S = s. Using Bayes' law,  

(15)
formula
where P(K|M) and P(K|S) are the posterior probabilities of a protein existing in the sample, given only its mRNA abundance M and primary identification score S, respectively. P(K) is the prior probability of the protein being present. The formula can be rewritten as the following:  
(16)
formula

Gene model

Compared to the methods using protein interaction network and mRNA expression data, it is more appealing to employ the gene model because gene model is a relative mature field and it is easier to accurately and quickly obtain the information. In the translation process, alternative transcriptional start sites and alternative splicing sites make mRNA sequences produced from the same DNA different and further lead to different proteins. In other words, a DNA segment can generate multiple mRNA and proteins (only for eukaryotes), which increases the number and variety of the proteins. But the obtained proteins are homologous and difficult to distinguish. Moreover, the degenerate peptides generated by homologous proteins have become a great challenge for protein inference.

According to the gene model–protein sequence–protein identifier (identified peptides) relationships, Grobei et al. [58] classify each peptide sequence. This allows shared peptides to be further distinguished depending on whether the implied proteins could be encoded either by the same or by distinct gene models. They distinguish five peptide evidence classes for Arabidopsis pollen. The same method is applied to PeptideClassifier [59]. But for prokaryotes PeptideClassifier reports three peptide evidence classes and for eukaryotes, to capture potential alternative splice isoforms, it considers three additional evidence classes. Both methods facilitate seamless integration with transcriptomics data.

The recent application of gene model is Markovian Inference of Proteins and Gene Models (MIPGEM) [14]. It is a statistical model that addresses the problem of protein and gene model inference from shotgun proteomics data. In particular, MIPGEM deals with dependencies among peptides and proteins using a Markovian assumption on k-partite graphs (Figure 11) which states that only the neighboring proteins matter in the conditional distribution for the peptides.

Figure 11:

Selected example of connected components of the tripartite graph for gene model imported by MIPGEM.

Figure 11:

Selected example of connected components of the tripartite graph for gene model imported by MIPGEM.

MIPGEM first divides the bipartite graph into different connected components and assumes that different components are independent. For instance, in Figure 3, the first component has two peptides (1,2) and two proteins (1,2). Then the probability distribution of the peptide scores can be modeled as  

(17)
formula
where ξr (with r = 1, 2,…, R) is the set of peptides of the r-th connected component of the bipartite graph. It denotes with zj = 1 or 0 whether a protein j is present or absent in the sample of interest, respectively, and denotes with pi the peptide probability or score for the presence of peptide i. Furthermore, let ξ be the index set of all peptides and {j; j ∈ η} be the list of candidate proteins.

The factors in the product in Equation (17) can be rewritten as  

(18)
formula
where Rr) = {j; j ∈ η and there exists an edge between i and j for at least one i ∈ ξr}, which is the range of ξr. Ne(i) are the neighbors of the peptide i, that is, the set of all the proteins j having an edge to the peptide i.

In addition, P(pi|{zj; j ∈ Ne(i)}) is defined as:  

(19)
formula
with  
(20)
formula
where b1 > 0, b2 ≥ 0 are unknown parameters and l = mini(pi), m = mediani(pi), u = maxi(pi). The density function f1(x) must fulfill  
(21)
formula
One of the parameters b1 or b2 has to be estimated. The second one can then be computed with the constraint on the integral.

The probability that a protein j is present given the peptide scores is computed as:  

(22)
formula
with  
(23)
formula
where dj is the index of the connected component holding the protein j.

In addition, MIPGEM scores the encoding gene models to address the problems of shared peptides and ambiguous proteins. The principle of gene model is the following:

P[gene model occurs] = 1−P[none of its proteins occur]

RESULT VALIDATION FOR PROTEIN IDENTIFICATION

After a protein inference model is developed, how to assess their performance is still a problem. The most straightforward strategy for assessing the performance of different protein identification methods is to generate some synthetic datasets in which the ground-truth proteins are known in advance. To date, there are already some standard data sets, e.g. the standard mix database of 18 proteins [72]. However, such benchmark data sets usually contain no more than 100 proteins, providing nothing like the complexity of most real data problems. As a result, these standard mixtures inevitably provide only a very limited comparison of the performance of different methods. Thus, it is necessary to build some standard data sets that have much more ground-truth proteins for performance test in future research.

An alternative strategy is to use simulation data that is reasonably close to reality and provides a fair testing ground for different methods. To date, there are already some simulation software for LC-MS and LC-MS/MS experiments (e.g. [73]). However, the characteristics of such synthetic data depend heavily on the underlying assumption of simulation model, making it difficult to provide an unbiased comparison.

Another solution is to create an expertly curated large-scale reference data set as a substitute when real benchmark data (with thousands of proteins known in advance) are not available. We call it ‘the reference set approach’. However, building such reference set is not an easy task since it needs to conduct many experiments to obtain a confident protein list with human intervention.

Thus, in most cases, we can only estimate the reliability of protein identifications. At the peptide level, current estimation methods can be classified into FDR approaches and p-value-based approaches.

FDR approaches

FDR [74], the expected fraction of false-positive assignments, has become a widely used measure for assessing PSMs. FDR for PSMs can be estimated by means of decoy database search strategies in which the acquired tandem mass spectra are searched against a target–decoy protein database. As shown in Figure 12, such target–decoy database contains all (target) protein sequences possibly present in the sample and an equal number of decoy sequences by reversing or reshuffling target protein sequences. The FDR can be estimated as the ratio of the number of decoy matches to that of target matches. Researchers usually filter peptide identifications using a chosen FDR threshold (say FDR of 5%), which can be estimated from the percentage of decoy matches at every potential cut point of search algorithms. FDR approaches are particularly appealing since they constitute a generic and independent approach to validate PSMs generated by any type of identification strategy.

Figure 12:

The target–decoy strategy. The FDR can be estimated as the ratio of the number of decoy matches to that of target matches.

Figure 12:

The target–decoy strategy. The FDR can be estimated as the ratio of the number of decoy matches to that of target matches.

Unfortunately, most available FDR methods focus on quality control on the level of PSMs. Deriving FDR for protein identifications is more difficult than determining FDR for PSMs. This is because protein identifications are derived from peptide assembly procedure, errors determined at the PSM level may propagate to the protein identification level. Controlling quality at the level of PSMs does not ensure quality at the level of protein identifications. This issue has so far not been appropriately appreciated since the distinction between PSMs and protein identifications is frequently ambiguous in the literature [75]. The error rate estimation of protein identification has to account for the fact that false and true PSMs distributing differently across the protein database. While false PSMs comparably distribute over all entries in the database [76], true PSMs map exclusively to the smaller subset of proteins being present in the biological sample. As a result, protein identification FDR in practice is larger than the PSM FDR [77]. Therefore, optimizing FDRs for peptides and proteins must be considered as different problems that are best addressed by different approaches. Gupta et al. [78] addressed this issue using single-peptide rule. Recently, Hather et al. [79] proposed a method for estimating the FDR and local FDRs (LFDRs) for both peptide and protein identifications based on randomized database matches and isotonic regression.

However, none of these approaches reliably quantifies the confidence of protein identifications in very large, integrated data sets.

To fulfill this void, a new approach called MAYU [75] is presented to quantify the uncertainty of protein identifications in the context of large-scale data sets. MAYU extends the well established target–decoy strategy from PSM level to protein level (Figure 13). It assembles a user-defined set of PSMs, e.g. the set of PSMs at FDR = 0.01, to a list of protein identifications and estimates the expected number of false-positive identifications. The estimation process is described as follows.

Figure 13:

The process of MAYU. MAYU extends the well-established target–decoy strategy from PSM level to protein level. It assembles a user-defined set of PSMs to a list of protein identifications and estimates the expected number of false-positive identifications. Then protein identification FDR is computed as the ratio of expected number of false-positive protein identifications to the total amount of protein identifications mapping to the target database.

Figure 13:

The process of MAYU. MAYU extends the well-established target–decoy strategy from PSM level to protein level. It assembles a user-defined set of PSMs to a list of protein identifications and estimates the expected number of false-positive identifications. Then protein identification FDR is computed as the ratio of expected number of false-positive protein identifications to the total amount of protein identifications mapping to the target database.

MAYU refers to the set of all protein identifications as H, the subset mapping to the target database Pt as Ht and its complement as Hd. MAYU distinguishes three types of protein identifications, i.e. (i) true positive (TP) identifications, which is denoted by Htp. A protein identification is considered to be TP, if it contains at least one TP PSM. Similarly, a protein identification is considered to be a false positive (FP), if all its PSM are FPs. The second type (ii) covers the set Hfp of FP protein identifications mapping to Pt, the complementing set with its identifications projecting to the decoy database Pd equals Hd. As the third type (iii) MAYU introduces the set Hcf that is composed of all protein identifications in Pt each containing FP PSM. Note that elements of Hcf can be TP as well as FP. Then, E[hfp|ht, hcf, θexp] is the number of FP protein identifications given the proteomics experiment characterized by parameters θexp and its outcome ht, hcf. θexp particularly includes parameters related to the target protein database, such as the number of protein entries N. By application of Bayes formula and by assuming P(htp|hcf, θexp) and P(ht|hcf, θexp) to be uniform and hd = hcf, E[hfp|ht, hd, θexp] evaluates as follows.  

(24)
formula
where htp = ht − hfp.

Then protein identification FDR is computed as the ratio of expected number of false-positive protein identifications to the total amount of protein identifications mapping to the target database (ht).

In Ref. [75], MAYU is compared with two other FDR estimation approaches, ProteinProphet and a naïve target–decoy strategy. They find that estimation results derived with ProteinProphet are too ‘optimistic’ while estimation results from a naïve target–decoy approach are too ‘pessimistic’. The naïve target–decoy strategy estimates protein identification FDR analogously to PSM FDR, i.e. by approximating the expected number of false positive (FP) protein identification by the number of decoy protein identification.

P-value-based approaches

Contrary to FDR approaches which apply a single FDR threshold to all spectra from the dataset, p-value-based approaches report p-values specific to individual peptide-spectrum matches. These methods are primarily based on fitting the distribution of scores with a specific parametric model and use information like properties of spectra and peptide length to increase the accuracy. OMSSA [35] approximates the scores of each peptide-spectrum match with Poisson distribution. Klammer et al. [80] fit the distribution of SEQUEST Xcorr scores with Weibull distribution. However, the approaches based on fitting specific parametric models normally cannot be generalized to other platforms and they assess p-values at peptide level rather than at protein level. To address this issue, Spirin et al. [81] developed a method independent of instrumentation or software scoring method. It assigned a spectrum-specific p-value to each peptide-spectrum match and combined these p-values to assess statistical significance of protein identifications.

First, the probability P that the database search with spectrum i will result in a false identification with score s is given by:  

(25)
formula
where F(s) is the cumulative extreme value distribution function. The parameter αi depends on the relative database size. Parameters μi and βi are specific to the spectrum and depend only weakly on the relative databases size in the asymptotic regime.

Then p-values of individual peptides matching the same protein are combined to cumulative protein p-value using Stouffer's method [82]. For the peptide match with the k-th lowest p-value among all peptides from the same protein, the method converts the p-value Pi into the standard normal deviates Zi using Z-test. The sum of these deviates, divided by the square root of the number of tests k 

(26)
formula
can be shown to follow the standard normal distribution. The minimal value of this probability over all k peptides is the p-value for the protein identification.

The protein p-value strategy outlined by Equation (26) is based on the assumption that p-values resulting from different peptide identifications are independent. Considering peptide correlations when evaluating protein p-value is a very important and open problem.

While the proteomics community often takes great care in evaluating peptide-level error rates, the protein-level result validation is often ignored. It is very important to report protein-level error rate along with protein identification results. Lacking of this checkpoint raises concerns about the validity of studies, such as biomarker discovery, where the number of identified proteins as well as the reliability of each individual identification is of critical importance.

EXPERIMENT

We have summarized existing peptide assembly methods using two systematic and expandable frameworks. Then, one may ask: Do we still need to develop new peptide assembly algorithms? To answer this question, we have to conduct a comprehensive evaluation and comparison to assess the strengths and weaknesses of current available algorithms. More precisely, we examine the overlap among the results of five typical methods to show that the set of proteins identified from the same benchmark data set can vary significantly. This indicates that the performance of current peptide assembly algorithms is still far from satisfactory. Meanwhile, we use two different evaluation methods in the performance comparison: one protein-level FDR estimation method (MAYU) and the reference set approach. In addition, we also describe how protein inference is performed from start to finish using Trans-Proteomic Pipeline (TPP) software [83] as an example. It shows that the final identification performance can change dramatically if one uses different parameter combinations in the analysis pipeline.

In fact, there are already some research efforts that compare the performance of different peptide assembly algorithms. For instance, Xue et al. [84] evaluated the performance of four algorithms: ProteinProphet [12], PROT_PROBE [51], Poisson model [85] and two-peptide hits. They also validated that the probabilities of identified proteins were strongly correlated with several experimental factors including spectra number, database size and protein abundance distribution. However, they only evaluated peptide assembly algorithms from bipartite graph model, leaving supplementary information model uncovered. Here, we compare five typical methods: IDPicker (Parsimonious Model) [45], ProteinProphet (Non-parametric Model) [12], two-peptide hits (Optimistic Model), MSNet (Protein Interaction Network) [54], PeptideClassifier (Gene Model) [59] from both bipartite graph model and supplementary information model. All the five methods we use are search engine-independent. We do not choose DTASelect as the representative of optimistic model since it can only handle peptide identification results generated from Sequest. Alternatively, we use two-peptide hits in the performance evaluation since it returns all potential proteins matching at least two peptides without any further filtering.

Data sets and evaluation methods

We use two data sets that have been used in MSNet [54] in our experiments. The raw data are available at http://www.marcottelab.org/users/MSdata/Data_02/ and http://www.marcottelab.org/MSdata/Data_04/, respectively. Since both data sets are generated from samples that consist of yeast proteins, we use Yeast_D1 and Yeast_D2 to denote them in the remaining parts of this section.

The motivation behind choosing these two data sets is that both of them have a reference set, enabling us to conduct performance comparison with multiple evaluation criteria. An identified protein is labeled as a true positive if it is present in the corresponding reference set. In Yeast_D1, the protein reference set is prepared from four MS-based proteomics data sets and three non-MS-based data sets. We use a list of 4265 proteins observed in either two or more MS-data sets or any of non-MS-data sets as the reference set, which can be obtained from http://www.marcottelab.org/MSdata/gold_yeast.html. In Yeast_D2, a list of 593 proteins that is composed of known ribosomal, translation and ribosome biogenesis proteins is used as the reference set.

In the experiment, we use both MAYU [75] and the number of identified proteins from the reference set to compare different methods. MAYU is strongly dependent on the number of decoy peptides in its FDR estimation procedure. Decoy peptides/proteins are not truly present entities so that they have no corresponding biological counterparts. Thus, some existing methods that utilize supplementary information almost can eliminate all decoy entries in the peptide assembly stage. As a result, target–decoy based FDR estimation methods like MAYU may over-estimate the performance of MSNet and PeptideClassifier in the comparison. Based on this observation, we only use MAYU to evaluate the results from IDPicker, ProteinProphet and two-peptide hits.

Database search and post-processing

We use a sequence file downloaded from http://aug.csres.utexas.edu/msnet/ as the target database, which contains 7248 proteins and 33 contaminants. All tandem mass spectra are searched against the protein sequence database and its randomized version (forward and reverse) with X!Tandem search engine (v2009.10.01.1). We use default search parameters wherever possible, assuming that parameters have already been optimized. Some important parameter specifications are listed in the following:

  • Fragment monoisotopic mass error: 0.4 Da

  • Parent monoisotopic mass error: 100 ppm

  • Minimum peaks: 15

  • Minimum fragment m/z: 150.

We use a tool included in the TPP (v4.4) software to convert tandem result files to pepXML files. If necessary, PeptideProphet is used as a second step to analyze the pepXML files. The input of each peptide assembly method is listed as follows: The peptides with PeptideProphet probability >0.05 are considered as candidate peptides, and the proteins containing at least one candidate peptide are considered as candidate proteins. The probability threshold is set to be 0.8 and 0.6 in ProteinProphet and MSNet, respectively.

  • IDPicker: the pepXML files

  • ProteinProphet: the output of PeptideProphet

  • Two-peptide hits: the output of PeptideProphet

  • MSNet: the proteins and their probabilities reported by ProteinProphet

  • PeptideClssifier: the output of PeptideProphet.

Comparison of peptide assembly algorithms

The protein identification results on two data sets are summarized in Table 2. The number of reported proteins by each algorithm is indicated in bold text. The figures in brackets are the number of correctly identified proteins in the reference set.

Table 2:

The performance of five methods

Data IDPicker ProteinProphet MSNet PeptideClassifier Two-peptide hits 
Yeast_D1 1293 (1152) 1530 (1254) 1376 (1319) 1008 (977) 1112 (978) 
Yeast_D2 154 (116) 322 (162) 291 (166) 232 (145) 317 (133) 
Data IDPicker ProteinProphet MSNet PeptideClassifier Two-peptide hits 
Yeast_D1 1293 (1152) 1530 (1254) 1376 (1319) 1008 (977) 1112 (978) 
Yeast_D2 154 (116) 322 (162) 291 (166) 232 (145) 317 (133) 

Two protein samples undergo MS/MS analysis to generate ten lists of proteins identified by five peptide assembly algorithms. The figures in brackets indicate the number of proteins present in the reference set.

First of all, the results in Table 2 show that the identification performance can vary significantly if we use one method versus another with respect to the number of (correctly) identified proteins. To further investigate such performance variation, we use Venn diagram to describe the overlap among five methods in terms of the number of identified proteins in Figures 14 and 15. For Yeast_D1 (Fig. 14), out of a possible 1888 proteins from the five algorithms (union), 766 proteins are identified by all five algorithms (intersection), while 489 proteins are identified by a single algorithm. For Yeast_D2 (Fig. 15), 117 proteins are identified based on a consensus of all five algorithms (intersection), while 489 proteins are identified by one or more algorithms (union). Clearly, this poor concordance indicates that more research efforts are still needed to develop effective peptide assembly algorithms. Compared to the methods from bipartite graph model, MSNet and PeptideClassifier consistently exhibit good performance. This indicates that the use of supplementary information can boost the identification performance.

Figure 14:

Five-way Venn diagram showing the overlap among five peptide assembly algorithms for Yeast_D1. The number of identified proteins by one or more algorithms is indicated, e.g. 766 proteins are identified based on a consensus of all five algorithms (intersection), whereas 1888 proteins are identified by one or more algorithms (union).

Figure 14:

Five-way Venn diagram showing the overlap among five peptide assembly algorithms for Yeast_D1. The number of identified proteins by one or more algorithms is indicated, e.g. 766 proteins are identified based on a consensus of all five algorithms (intersection), whereas 1888 proteins are identified by one or more algorithms (union).

Figure 15:

Five-way Venn diagram showing the overlap among five the peptide assembly algorithms for Yeast_D2.

Figure 15:

Five-way Venn diagram showing the overlap among five the peptide assembly algorithms for Yeast_D2.

Another interesting observation is that IDPicker (Parsimonious Model) can report more proteins than other methods such as two-peptide hits (Optimistic Model) for Yeast_D1. This is because there are a large number of one-hit wonders (proteins that have only one identified peptide) in Yeast_D1. For instance, there are 724 one-hit proteins in the output of PeptideProphet for Yeast_D1, which is ∼40% of the number of total candidate proteins. In addition, IDPicker uses the output from X!Tandem rather than the output of PeptideProphet as the input. That means the input of IDPicker contains more identified peptides than the input used in two-peptide hits.

We also use the FDR value estimated with MAYU to compare three methods of bipartite graph model: IDPicker, ProteinProphet and two-peptide hits. Table 3 indicates that for Yeast_D2 IDPicker achieves the lowest FDR, while two-peptide hits gives the highest FDR and for Yeast_D1 the opposite is true. We can derive similar results for Yeast_D2 from Table 2. This is because Yeast_D2 has more proteins identified by more than two peptides. Reporting any protein which has more than two identified peptides without filter will greatly increase the number of potential false positives. On the other hand, for Yeast_D1, Tables 2 and 3 show different results. This fact indicates that current performance evaluation methods are not perfect and need further improvement. Meanwhile, it also shows that statistical methods perform most consistently with respect to FDR.

Table 3:

The FDR values estimated with MAYU

Data IDPicker (%) Protein Prophet (%) Two-peptide hits (%) 
Yeast_D1 6.1 5.4 3.2 
Yeast_D2 3.3 5.75 
Data IDPicker (%) Protein Prophet (%) Two-peptide hits (%) 
Yeast_D1 6.1 5.4 3.2 
Yeast_D2 3.3 5.75 

Protein inference analysis pipeline

In this review, we have described three steps of protein inference problem. Here, we conduct some experiments to provide the reader a sense of how protein inference is performed step by step. More importantly, it shows that the use of different parameters in each step may change the performance significantly. Consequently, such uncertainty will propagate to subsequent analysis step. The experiment is completed on TPP software (v4.4), which is a collection of integrated tools for MS/MS proteomics.

We use PeptideProphet as the tool for post-processing peptide identification results from X!Tandem, and use ProteinProphet as the peptide assembly method. The overall number of identified peptide hits and protein identifications from Yeast_D1 are listed in Table 3 when PeptideProphet is executed under some typical parameter combinations. Since there are more than 10 input parameters for PeptideProphet, we only choose four typical parameters in the experiment. To test the influence of PeptideProphet on the final results, we fix the parameters of ProteinProphet using their default values. The results in Table 4 show that the use of different parameter settings in PeptideProphet has a great impact on the identification results. We also adjust two parameters for ProteinProphet when the output of PeptideProphet is obtained from the 5th experiment in Table 4. As shown in Table 5, the parameter adjustment for ProteinProphet can also affect the number of identified proteins.

Table 4:

Performance of protein inference under different parameter combinations of PeptideProphet

Parameter Experiment
 
 
Only use Expect Score as the discriminant Yes Yes Yes No No Yes 
Use gamma distribution to model the negatives Yes Yes Yes No No No 
Use decoy hits to pin down the negative distribution No Yes No Yes No Yes 
Force the fitting of the mixture model No No Yes No No No 
The number of identified peptides 4641 4684 4663 7306 7589 6956 
The number of identified proteins 529 551 529 947 1530 849 
Parameter Experiment
 
 
Only use Expect Score as the discriminant Yes Yes Yes No No Yes 
Use gamma distribution to model the negatives Yes Yes Yes No No No 
Use decoy hits to pin down the negative distribution No Yes No Yes No Yes 
Force the fitting of the mixture model No No Yes No No No 
The number of identified peptides 4641 4684 4663 7306 7589 6956 
The number of identified proteins 529 551 529 947 1530 849 
Table 5:

Performance of protein inference under different parameter combinations of ProteinProphet

Parameter Experiment
 
 
Normalize NSP for Protein Length Yes Yes No 
Use Expected Number of Ion Instances No Yes Yes 
The number of identified proteins 1530 1419 1419 
Parameter Experiment
 
 
Normalize NSP for Protein Length Yes Yes No 
Use Expected Number of Ion Instances No Yes Yes 
The number of identified proteins 1530 1419 1419 

FUTURE PERSPECTIVE

In shotgun proteomics, one important problem is to identify all proteins present in the sample accurately. Unfortunately, such protein inference problem is only partially solved since three technical challenges still remain unsolved: the identification coverage problem, the identification ambiguity problem and the identification validation problem.

  • Identification coverage problem: the protein identification task is subdivided into two stages: first identifying a collection of peptides with a low FDR, and then inferring the protein set from the resulting collection of peptides. However, due to the complexity of MS data and the limitations of current peptide identification algorithms, up to 80–90% of MS/MS spectra in a typical liquid chromatography tandem mass spectrometry experiment cannot be interpreted with high confidence. As a result, only a subset of peptides present in the sample can be detected. This may miss those proteins that have no constituent peptides being identified in protein inference.

  • Identification ambiguity problem: with the advance of MS instruments and peptide identification algorithms, it is reasonable to expect that we are capable of identifying at least one constituent peptide for each protein present in the sample in the near future. Even we have already achieved this objective, we have to face another difficult challenge in protein inference: the identification ambiguity problem. As we have discussed in previous sections, degenerate peptides and one-hit wonders are the main sources for the ambiguity issue. It is generally very difficult to determine which proteins are truly present in the sample if they share the same set of peptides or have only one constituent peptide being detected.

  • Identification validation problem: assessing the correctness of reported proteins is critical to the success of proteomics applications. For instance, only truly present proteins that are differentially expressed between case and control are of practical interest in biomarker discovery. If we are unable to accurately judge the correctness of reported proteins, some random artifacts that have different expression levels may be returned to the user. Though researchers have realized the importance of identification validation, there is still no consensus of the best evaluation measure so far.

As summarized in this survey, researchers have proposed many solutions from different angles to tackle above technical challenges. However, both our own performance comparison and similar studies from other groups have shown that the identification results of different methods can disagree with each other significantly. This indicates that the performance of current available protein inference methods is still far from satisfactory in practice. Therefore, more research efforts are still needed towards this direction. Here, we would like to discuss some future perspectives along with on-going specific efforts and present our view on potentially promising area and the development of future tools.

First of all, researchers have proposed many strategies to increase the protein identification coverage. We can divide on-going efforts from the peptide identification perspective into three classes.

  • Developing more accurate peptide identification algorithm. Fundamental to any protein inference method, a set of confident and complete peptide identifications are of primary importance. New peptide identification algorithms continue to be developed very year since current software generally failed to interpret most tandem MS spectra generated in a single experiment.

  • Combining results from different peptide identification algorithms. Since each peptide identification method uses a different algorithm, which proceeds from a different view of how the spectrum should be interpreted. Thus, the peptide identification results for one spectrum from various algorithms may differ significantly. Generally, it is feasible to identify more confident peptides if we find complementary identification algorithms and combine the results in an effective way [86].

  • Combining results from multiple replicates. The protein and peptide mixtures are typically still too complex to allow the mass spectrometer to acquire tandem mass spectra for all peptides in a single LC-MS/MS experiment. Consequently, LC-MS/MS experiments are usually repeated extensively, in order to increase the number of peptides for which tandem mass spectra are acquired [87]. Hence, we may increase the peptide identification coverage by merging results from multiple replicate experiments.

The above strategies try to identify as many peptides as possible in order to increase the protein identification coverage. However, we should be aware of the distinction between peptide identification coverage and protein identification coverage. Peptide identification is only one intermediate step of protein identification. Though some strategies are able to effectively improve peptide identification coverage, we still need to check carefully if they are really helpful in increasing the protein identification coverage. That is, maximizing the number of peptide identifications does not necessarily result in maximizing the number of protein identifications. For example, in cases when the newly identification peptides hit only a few proteins already known, it does improve the identification confidence, but cannot increase the identification coverage. Nevertheless, these methods can potentially improve the protein identification coverage by increasing the peptide identification coverage.

Recently, the supplementary information model begins to receive much more attention in which extra information from other data sources is used to facilitate the identification of more proteins. For instance, the use of single-stage MS data makes it possible to find some extra proteins whose peptide digestion results are not covered by MS/MS data [66]. We believe such supplementary information model is the most promising strategy for improving the identification coverage of protein inference. Though it has been demonstrated that we can obtain very encouraging results by incorporating different kinds of supplementary information, the following questions remain open and need further investigation in future research.

  • To date, researchers have investigated the feasibility of using raw MS/MS data, single-stage MS data, peptide expression profiles, mRNA expression, protein interaction network and gene model. Then, a natural question is: Are there other kinds of supplementary information that can be utilized in protein inference? If the answer is yes, we need to further study how to incorporate it in the inference process.

  • Different supplementary information models have different characteristics. Therefore, one may be interested in the following questions: Which supplementary information is more reliable and easy to use? Which supplementary information is more effective in improving the protein identification performance?

  • It is also possible to further extend the identification coverage if we can use all kinds of supplementary information simultaneously. Then, the question becomes: Can we provide an integrated inference model that incorporates all extra information seamlessly? Furthermore, we need to build a unified database and provide corresponding software for storing and manipulating different kinds of supplementary information to facilitate protein inference.

To handle the identification ambiguity problem, researchers have developed various methods according to different principles. On the one hand, we can borrow supplementary information such as gene model, protein interaction network and peptide detectability to distinguish truly present proteins from false positives even they share the common set of peptides. On the other hand, we may use sophisticated statistical models built on the peptide–protein relationship network to perform inference. In our opinion, the identification ambiguity problem is even much harder than the identification coverage problem since some protein sequences are almost identical. To solve the protein identification ambiguity problem, some potentially promising strategies are listed below.

  • Since different protein inference approaches use different algorithms to solve the identification ambiguity problem from different viewpoints. As a result, the protein inference results from the same set of identified peptides may vary significantly. Therefore, a natural idea is to combine the identification results from multiple inference algorithms into a consensus protein list.

Researchers have developed various validation measures to evaluate the correctness of identification proteins. Most of them are based on empirical database-dependent estimates of error rates using the target–decoy search strategy (e.g. MAYU). However, the reliance on decoy database deserves certain drawbacks as we have discussed in the performance comparison section. Some analytically derived and database-independent error rate estimation methods are also available. Unfortunately, there is still no consensus on the best measure so far, urging on us the need for developing new validation methods in future research.

Overall, the issue of validating protein identification results is only partially solved. We have the following suggestions regarding the standard validation protocol:

  • Using different data. To obtain an unbiased and comprehensive performance evaluation, it is necessary to include test data of different kinds of characteristics: standard synthetic mixture of known proteins, simulation data, practical real data.

  • Using multiple validation measures if possible. Inevitably, each validation method has some bias. Using multiple evaluation measures in the test may provide more convincing comparison results.

CONCLUSIONS

This review categorizes protein inference using two general frameworks. Protein inference is very important, from both theoretical perspective and practical aspect. More research efforts should be devoted to this challenging topic. We hope this article can provide a starting point for those who are interested in this area. In addition to the references cited, readers may find the review articles [88–90] and thesis [91] very useful.

Key points

  • Protein inference is an important step in proteomics research. More attention should be paid to this problem.

  • There are at least two sources that cause the difficulties in solving protein inference problem: degenerate peptides and one-hit wonders.

  • We divide the analysis procedure of protein inference into different stages: peptide identification post-processing, peptide assembly and result validation.

  • For existing peptide assembly methods, we propose two hierarchical classification frameworks. According to the dependence on the peptide search engine, peptide assembly methods are categorized into two classes: search engine-dependent approaches and search engine-independent approaches. According to the underlying algorithmic techniques, peptide assembly methods are categorized into two classes: bipartite graph model and supplementary information model.

FUNDING

This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176, the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), and a research competition award (RPC10EG04) from the Hong Kong University of Science and Technology.

Acknowledgements

The comments and suggestions from the anonymous reviewers greatly improved the article. We thank Dr Smriti R. Ramakrishnan for providing us the potein reference set used in the experiments.

References

1
Bern
M
Goldberg
D
Improved ranking functions for protein and modification-site identifications
J Comput Biol
 , 
2008
, vol. 
15
 (pg. 
705
-
19
)
2
Fenyo
D
Identifying the proteome: software tools
Curr Opin Biotechnol
 , 
2000
, vol. 
11
 (pg. 
391
-
5
)
3
Chakravarti
DN
Chakravarti
B
Moutsatsos
I
Informatic tools for proteome profiling
Biotechniques
 , 
2002
, vol. 
32
 
Suppl
(pg. 
4
-
15
)
4
Brunner
E
Ahrens
CH
A high-quality catalog of the Drosophila melanogaster proteome
Nat Biotechnol
 , 
2007
, vol. 
25
 
5
(pg. 
576
-
83
)
5
Eriksson
J
Fenyo
D
Improving the success rate of proteome analysis by modeling protein-abundance distributions and experimental designs
Nat Biotechnol
 , 
2007
, vol. 
25
 
6
(pg. 
651
-
55
)
6
Mallick
P
Schirle
M
Chen
SS
, et al.  . 
Computational prediction of proteotypic peptides for quantitative proteomics
Nat Biotechnol
 , 
2007
, vol. 
25
 
1
(pg. 
125
-
31
)
7
Eng
JK
McCormack
AL
Yates
JR
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
J Am Soc Mass Spectrom
 , 
1994
, vol. 
5
 
11
(pg. 
976
-
89
)
8
Perkins
DN
Pappin
DJ
Creasy
DM
, et al.  . 
Probability-based protein identification by searching sequence databases using mass spectrometry data
Electrophoresis
 , 
1999
, vol. 
20
 
18
(pg. 
3551
-
67
)
9
David
CM
Cottrell
JS
Unimod: Protein modifications for mass spectrometry
Proteomics
 , 
2004
, vol. 
4
 
6
(pg. 
1534
-
6
)
10
Rappsilber
J
Mann
M
What does it mean to identify a protein in proteomics?
Trends Biochem Sci
 , 
2002
, vol. 
27
 
2
(pg. 
74
-
8
)
11
Nesvizhskii
AI
Vitek
O
Aebersold
R
Interpretation of shotgun proteomic data: the protein inference problem
Mole Cell Proteomics
 , 
2005
, vol. 
4
 
10
(pg. 
1419
-
40
)
12
Nesvizhskii
AI
Keller
A
Kolker
E
, et al.  . 
A statistical model for identifying proteins by tandem mass spectrometry
Anal Chem
 , 
2003
, vol. 
75
 
17
(pg. 
4646
-
58
)
13
Shen
C
Wang
Z
Shankar
G
, et al.  . 
A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry
Bioinformatics
 , 
2008
, vol. 
24
 
2
(pg. 
202
-
8
)
14
Gerster
S
Qelib
E
Ahrensb
CH
, et al.  . 
Protein and gene model inference based on statistical modeling in k-partite graphs
Proc Nat Acad Sci USA
 , 
2010
, vol. 
107
 
27
(pg. 
12101
-
6
)
15
Tabb
DL
McDonald
H
Yates
JR
DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics
J Proteome Res
 , 
2002
, vol. 
1
 
1
(pg. 
21
-
6
)
16
Cormen
TH
Leiserson
CE
Rivest
RL
, et al.  . 
Introduction to Algorithms
 , 
2009
3rd edn
Cambridge, MA, USA
MIT Press
17
Hochbaum
DS
Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems
Approximation Algorithms for NP-Hard Problems
 , 
1997
(pg. 
94
-
143
)
18
Yang
X
Dondeti
V
Dezube
R
, et al.  . 
DBParser: Web-based software for shotgun proteomic data analyses
J Proteome Res
 , 
2004
, vol. 
3
 
5
(pg. 
1002
-
8
)
19
Zhang
B
Chambers
MC
Tabb
DL
Proteomic parsimony through bipartite graph analysis improves accuracy and transparency
J Proteome Res
 , 
2007
, vol. 
6
 
9
(pg. 
3549
-
57
)
20
He
Z
Yang
C
Yu
W
A partial set covering model for protein mixture identification using mass spectrometry data
IEEE/ACM Trans Comput Biol Bioinformatics
 , 
2011
, vol. 
8
 
2
(pg. 
368
-
80
)
21
Bergeron
JJ
Hallett
M
Peptides you can count on
Nat Biotechnol
 , 
2007
, vol. 
25
 
1
(pg. 
61
-
2
)
22
Fusaro
VA
Mani
DR
Mesirov
JP
, et al.  . 
Prediction of high-responding peptides for targeted protein assays by mass spectrometry
Nat Biotechnol
 , 
2009
, vol. 
27
 
2
(pg. 
190
-
8
)
23
Tang
H
Arnold
RJ
Alves
P
, et al.  . 
A computational approach toward label-free protein quantification using predicted peptide detectability
Bioinformatics
 , 
2006
, vol. 
22
 
14
(pg. 
481
-
8
)
24
Alves
P
Arnold
RJ
Novotny
MV
, et al.  . 
Advancement in protein inference from shotgun proteomics using peptide detectability
Pac Symp Biocomput
 , 
2007
(pg. 
409
-
20
)
25
Feng
J
Naiman
DQ
Cooper
B
Probability model for assessing proteins assembled from peptides sequences inferred from tandem mass spectrometry data
Anal Chem
 , 
2007
, vol. 
79
 
10
(pg. 
3901
-
11
)
26
Elias
JE
Gygi
SP
Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry
Nat Methods
 , 
2007
, vol. 
4
 
3
(pg. 
207
-
14
)
27
Keller
A
Nesvizhskii
AI
Kolker
E
, et al.  . 
Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search
Anal Chem
 , 
2002
, vol. 
74
 
20
(pg. 
5383
-
92
)
28
Zhang
J
Li
J
Liu
X
, et al.  . 
A nonparametric model for quality control of database search results in shotgun proteomics
BMC Bioinformatics
 , 
2008
, vol. 
9
 pg. 
29
 
29
Zhang
J
Ma
J
Dou
L
, et al.  . 
Bayesian nonparametric model for the validation of peptide identification in shotgun proteomics
Mol Cell Proteomics
 , 
2009
, vol. 
8
 
3
(pg. 
547
-
57
)
30
Martnez-Bartolom
S
Navarro
P
Martn-Maroto
F
, et al.  . 
Properties of average score distributions of SEQUEST: the probability ratio method
Mol Cell Proteomics
 , 
2008
, vol. 
7
 
6
(pg. 
1135
-
45
)
31
Choi
H
Ghosh
D
Nesvizhskii
AI
Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling
J Proteome Res
 , 
2008
, vol. 
7
 
1
(pg. 
286
-
92
)
32
Choi
H
Nesvizhskii
AI
Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics
J Proteome Res
 , 
2008
, vol. 
7
 
1
(pg. 
254
-
65
)
33
Helsens
K
Colaert
N
Barsnes
H
, et al.  . 
ms_lims, a simple yet powerful open source laboratory information management system for ms-driven proteomics
Proteomics
 , 
2010
, vol. 
10
 
6
(pg. 
1261
-
64
)
34
Mueller
M
Martens
L
Reidegeld
KA
, et al.  . 
Functional annotation of proteins identified in human brain during the HUPO brain proteome project pilot study
Proteomics
 , 
2006
, vol. 
6
 
18
(pg. 
5059
-
75
)
35
Geer
LY
Markey
SP
Kowalak
JA
, et al.  . 
Open mass spectrometry search algorithm
J Proteome Res
 , 
2004
, vol. 
3
 
5
(pg. 
958
-
64
)
36
Ma
B
Zhang
K
Hendrie
C
, et al.  . 
Peaks: powerful software for peptide de novo sequencing by tandem mass spectrometry
Rapid Commun Mass Spectrometry
 , 
2003
, vol. 
17
 
20
(pg. 
2337
-
42
)
37
Frank
A
Pevzner
P
PepNovo: de novo peptide sequencing via probabilistic network modeling
Anal Chem
 , 
2005
, vol. 
77
 
4
(pg. 
964
-
73
)
38
Tabb
DL
Saraf
A
Yates
JR
GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model
Anal Chem
 , 
2003
, vol. 
75
 
23
(pg. 
6415
-
21
)
39
Tanner
S
Shu
H
Frank
A
, et al.  . 
InsPect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra
Anal Chem
 , 
2005
, vol. 
77
 
14
(pg. 
4626
-
4639
)
40
Pappin
DJ
Hojrup
P
Bleasby
AJ
Rapid identification of proteins by peptide-mass fingerprinting
Curr Biol
 , 
1993
, vol. 
3
 
6
(pg. 
327
-
32
)
41
McHugh
L
Arthur
JW
Computational methods for protein identification from mass spectrometry data
PLoS Comput Biol
 , 
2008
, vol. 
4
 
2
pg. 
e12
 
42
Moore
RE
Young
MK
Lee
TD
Qscore: an algorithm for evaluating sequest database search results
J Am Soc Mass Spectrometry
 , 
2002
, vol. 
13
 
4
(pg. 
378
-
86
)
43
Weatherly
DB
Atwood
JA
Minning
TA
, et al.  . 
A heuristic method for assigning a false-discovery rate for protein identifications from mascot database search results
Mol Cell Proteomics
 , 
2005
, vol. 
4
 
6
(pg. 
762
-
72
)
44
Peng
J
Elias
JE
Thoreen
CC
Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome
J Proteome Res
 , 
2003
, vol. 
2
 
1
(pg. 
43
-
50
)
45
Ma
Z-Q
Dasari
S
Chambers
MC
, et al.  . 
IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering
J Proteome Res
 , 
2009
, vol. 
8
 
8
(pg. 
3872
-
81
)
46
Price
TS
Lucitt
MB
Wu
W
, et al.  . 
EBP: Protein identification using multiple tandem mass spectrometry datasets
Mol Cell Proteomics
 , 
2007
, vol. 
6
 
3
(pg. 
527
-
36
)
47
Dowe
DL
Gardner
S
Oppy
G
Bayes not bust! Why simplicity is no problem for Bayesians
Brit J Phil Sci
 , 
2007
, vol. 
58
 
4
(pg. 
709
-
54
)
48
Slotta
DJ
McFarland
MA
Markey
SP
MassSieve: Panning MS/MS peptide data for proteins
Proteomics
 , 
2010
, vol. 
10
 
16
(pg. 
3035
-
9
)
49
Li
YF
Arnold
RJ
Li
Y
, et al.  . 
A Bayesian approach to protein inference problem in shotgun proteomics
J Comput Biol
 , 
2009
, vol. 
16
 
8
(pg. 
1
-
11
)
50
Serang
O
MacCoss
MJ
Noble
WS
Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data
J Proteome Res
 , 
2010
, vol. 
9
 
10
(pg. 
5346
-
57
)
51
Sadygov
RG
Liu
H
Yates
JR
Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases
Anal Chem
 , 
2004
, vol. 
76
 
6
(pg. 
1664
-
71
)
52
Kearney
P
Butler
H
Eng
K
, et al.  . 
Protein identification and peptide expression resolver: Harmonizing protein identification with protein expression data
J Proteome Res
 , 
2008
, vol. 
7
 
1
(pg. 
234
-
44
)
53
Li
J
Zimmerman
LJ
Park
B-H
, et al.  . 
Network-assisted protein identification and data interpretation in shotgun proteomics
Mol Syst Biol
 , 
2009
, vol. 
5
 pg. 
303
 
54
Ramakrishnan
SR
Vogel
C
Kwon
T
, et al.  . 
Mining gene functional networks to improve mass-spectrometry based protein identification
Bioinformatics
 , 
2009
, vol. 
25
 
22
(pg. 
2955
-
61
)
55
Ramakrishnan
SR
Vogel
C
Prince
JT
, et al.  . 
Integrating shotgun proteomics and mRNA expression data to improve protein identification
Bioinformatics
 , 
2009
, vol. 
25
 
11
(pg. 
1397
-
403
)
56
Li
Q
MacCoss
M
Stephens
M
A nested mixture model for protein identification using mass spectrometry
Ann Appl Stat
 , 
2010
, vol. 
4
 
2
(pg. 
962
-
87
)
57
Spivak
M
Weston
J
MacCoss
MJ
, et al.  . 
Direct maximization of protein identifications from tandem mass spectra
Mol Cell Proteomics
 , 
2012
, vol. 
11
 
2
pg. 
M111.012161
 
58
Grobei
MA
Qeli
E
Brunner
E
, et al.  . 
Deterministic protein inference for shotgun proteomics data provides new insights into Arabidopsis pollen development and function
Genome Res
 , 
2009
, vol. 
19
 
10
(pg. 
1786
-
800
)
59
Qeli
E
Ahrens
CH
PeptideClassifier for protein inference and targeted quantitative proteomics
Nat Biotechnol
 , 
2010
, vol. 
28
 
7
(pg. 
647
-
50
)
60
Lu
B
Motoyama
A
Ruse
C
, et al.  . 
Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data
Anal Chem
 , 
2008
, vol. 
80
 
6
(pg. 
2018
-
25
)
61
Searle
BC
Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies
Proteomics
 , 
2010
, vol. 
10
 
6
(pg. 
1265
-
69
)
62
Magnin
J
Masselot
A
Menzel
C
, et al.  . 
OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprinting
J Proteome Res
 , 
2004
, vol. 
3
 
1
(pg. 
55
-
60
)
63
Zhang
W
Chait
BT
ProFound: an expert system for protein identification using mass spectrometric peptide mapping information
Anal Chem
 , 
2000
, vol. 
72
 
11
(pg. 
2482
-
9
)
64
Colinge
J
Bennett
KL
Introduction to computational proteomics
PLoS Comput Biol
 , 
2007
, vol. 
3
 
7
pg. 
e114
 
65
Gygi
SP
Corthals
GL
Zhang
Y
, et al.  . 
Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology
Proc Natl Acad Sci USA
 , 
2000
, vol. 
97
 
17
(pg. 
9390
-
5
)
66
He
Z
Yang
C
Yang
C
, et al.  . 
Optimization-based peptide mass fingerprinting for protein mixture identification
J Comput Biol
 , 
2010
, vol. 
17
 
3
(pg. 
221
-
35
)
67
Neilson
KA
Ali
NA
Muralidharan
S
, et al.  . 
Less label, more free: approaches in label-free quantitative mass spectrometry
Proteomics
 , 
2011
, vol. 
11
 
4
(pg. 
535
-
53
)
68
Lundgren
DH
Hwang
S-I
Wu
L
, et al.  . 
Role of spectral counting in quantitative proteomics
Exp Rev Proteomics
 , 
2010
, vol. 
7
 
1
(pg. 
39
-
53
)
69
Choi
H
Fermin
D
Nesvizhskii
AI
Significance analysis of spectral count data in label-free shotgun proteomics
Mol Cell Proteomics
 , 
2008
, vol. 
7
 
12
(pg. 
2373
-
85
)
70
Booth
JG
Eilertson
KE
Olinares
PDB
, et al.  . 
A Bayesian mixture model for comparative spectral count data in shotgun proteomics
Mol Cell Proteomics
 , 
2011
, vol. 
10
 
8
pg. 
M110.007203
 
71
Sharan
R
Ulitsky
I
Shamir
R
Network-based prediction of protein function
Mol Syst Biol
 , 
2007
, vol. 
3
 pg. 
88
 
72
Klimek
J
Eddes
JS
Hohmann
L
, et al.  . 
The standard protein mix database: a diverse dataset to assist in the production of improved peptide and protein identification software tools
J Proteome Res
 , 
2008
, vol. 
7
 
1
(pg. 
96
-
103
)
73
Schulz-Trieglaff
O
Pfeifer
N
Gropl
C
, et al.  . 
LC-MSsim– a simulation software for liquid chromatography mass spectrometry data
BMC Bioinformatics
 , 
2008
, vol. 
9
 pg. 
423
 
74
Benjamini
Y
Hochberg
Y
Controlling the false discovery rate: a practical and powerful approach to multiple testing
J Roy Stat Soc Ser B (Stat Methodol)
 , 
1993
, vol. 
57
 
1
(pg. 
289
-
300
)
75
Reiter
L
Claassen
M
Schrimpf
SP
, et al.  . 
Protein identification false discovery rates for very large proteomics datasets generated by tandem mass spectrometry
Mol Cell Proteomics
 , 
2009
, vol. 
8
 
11
(pg. 
2405
-
17
)
76
Elias
JE
Gygi
SP
Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry
Nat Methods
 , 
2007
, vol. 
4
 
3
(pg. 
207
-
214
)
77
Adamski
M
Blackwell
T
Menon
R
, et al.  . 
Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project
Proteomics
 , 
2005
, vol. 
5
 
13
(pg. 
3246
-
61
)
78
Gupta
N
Pevzner
PA
False discovery rates of protein identifications: a strike against the two-peptide rule
J Proteome Res
 , 
2008
, vol. 
8
 
9
(pg. 
4173
-
81
)
79
Hather
G
Higdon
R
Bauman
A
, et al.  . 
Estimating false discovery rates for peptide and protein identification using randomized databases
Proteomics
 , 
2010
, vol. 
10
 
12
(pg. 
2369
-
76
)
80
Klammer
AA
Park
CY
Noble
WS
Statistical calibration of the SEQUEST Xcorr function
J Proteome Res
 , 
2009
, vol. 
8
 
4
(pg. 
2106
-
13
)
81
Spirin
V
Shpunt
A
Seebacher
J
, et al.  . 
Assigning spectrum-specific p-values to protein identifications by mass spectrometry
Bioinformatics
 , 
2011
, vol. 
27
 
8
(pg. 
1128
-
34
)
82
Mosteller
F
Fisher
RA
Questions and answers
Am Stat
 , 
2008
, vol. 
2
 
5
(pg. 
30
-
1
)
83
Keller
A
Eng
J
Zhang
N
, et al.  . 
A uniform proteomics MS/MS analysis platform utilizing open XML file formats
Mol Syst Biol
 , 
2005
, vol. 
1
 pg. 
2005.0017
 
84
Xue
X
Wu
S
Wang
Z
, et al.  . 
Protein probabilities in shotgun proteomics: evaluating different estimation methods using a semi-random sampling model
Proteomics
 , 
2006
, vol. 
6
 
23
(pg. 
6134
-
45
)
85
Li
X
Gong
Y
Wang
Y
, et al.  . 
Comparison of alternative analytical techniques for the characterisation of the human serum proteome in HUPO Plasma Proteome Project
Proteomics
 , 
2005
, vol. 
5
 
13
(pg. 
3423
-
41
)
86
Alves
G
Wu
WW
Wang
G
, et al.  . 
Enhancing peptide identification confidence by combining search methods
J Proteome Res
 , 
2008
, vol. 
7
 
8
(pg. 
3102
-
13
)
87
Claassen
M
Aebersold
R
Buhmann
JM
Proteome coverage prediction with infinite Markov models
Bioinformatics
 , 
2009
, vol. 
25
 
12
(pg. 
i154
-
60
)
88
Ma
B
Challenges in computational analysis of mass spectrometry data for proteomics
J Comp Sci Technol
 , 
2010
, vol. 
25
 
1
(pg. 
107
-
23
)
89
Shi
J
Wu
F-X
Protein inference by assembling peptides identified from tandem mass spectra
Curr Bioinform
 , 
2009
, vol. 
4
 
3
(pg. 
226
-
33
)
90
Nesvizhskii
AI
A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics
J Proteomics
 , 
2010
, vol. 
73
 
11
(pg. 
2092
-
123
)
91
Claassen
M
Design and Validation of Proteome Measurements
 , 
2010
PhD Thesis, ETH Zurich