## Abstract

Motivation: The key to MS -based proteomics is peptide sequencing. The major challenge in peptide sequencing, whether library search or de novo, is to better infer statistical significance and better attain noise reduction. Since the noise in a spectrum depends on experimental conditions, the instrument used and many other factors, it cannot be predicted even if the peptide sequence is known. The characteristics of the noise can only be uncovered once a spectrum is given. We wish to overcome such issues.

Results: We designed RAId to identify peptides from their associated tandem mass spectrometry data. RAId performs a novel de novo sequencing followed by a search in a peptide library that we created. Through de novo sequencing, we establish the spectrum-specific background score statistics for the library search. When the database search fails to return significant hits, the top-ranking de novo sequences become potential candidates for new peptides that are not yet in the database. The use of spectrum-specific background statistics seems to enable RAId to perform well even when the spectral quality is marginal. Other important features of RAId include its potential in de novo sequencing alone and the ease of incorporating post-translational modifications.

Availability: Programs implementing the methods described are available from the authors on request.

Contact:yyu@ncbi.nlm.nih.gov

Supplementary information:ftp://ftp.ncbi.nih.gov/pub/yyu/Proteomics/MSMS/RAId/MSMS_bioinfo_supp.pdf

## 1 INTRODUCTION

Mass spectrometry (MS) became an important tool in biomolecular study only recently, although the first documented use of MS dates back to 1899 (Thomson, 1899). This is because mass spectrometers require gaseous ions for analysis; transferring biomolecules, which are generally large and polar, into the ionized gas phase is not a simple task. Developed in the late 1980s, soft ionization techniques, such as matrix-assisted laser desorption ionization (MALDI) (Karas and Hillenkamp, 1988) and electrospray (ES) (Fenn et al., 1989), have made MS, and especially tandem MS (MS2), the method of choice in peptide identification, an essential component of protein identification in proteomic studies (Ho et al., 2002; Mann et al., 2001; Shevchenko et al., 2001; Yates, 1998).

In MS2, a parent peptide, with its mass/charge (m/z) ratio selected by the first analyzer, is fragmented by collisions with an inert gas, and the resulting fragments are analyzed by a second analyzer. This technique promises accurate peptide identifications provided that (1) fragments' m/z can be attained accurately, and (2) a robust analysis tool with correct statistical interpretations is available. In terms of the m/z accuracy, advanced techniques, such as the Fourier transform ion cyclotron resonance (FT-ICR) (Comisarow and Marshall, 1974; Henry et al., 1989; Martin et al., 2000) can achieve mass resolution as fine as one part per million (10−6). As for the analysis tools, the numerous methods developed for peptide identification can be categorized into two classes: library search methods and de novo sequencing methods.

Peptide identification tools based on library searches seem most popular. Examples of such methods include, but are not limited to, Sequest (Eng et al., 1994), MS-Tag (Clauser et al., 1996), Mascot (Perkins et al., 1999), Scope (Bafna and Edwards, 2001), Popitam (Hernandez et al., 2003) and Tandem (Craig and Beavis, 2004). This class of methods in general is quite accurate in identifying the correct peptide when spectral qualities are good. For spectra of marginal quality, the lack of appropriate quantification of statistical significance inevitably hinders the performance of library search methods. The other class, de novo sequencing methods, does not use peptide library. For example, Lutefisk (Taylor and Johnson, 1997, 2001), Sherenga (Dančík et al., 1999), and Peaks (Ma et al., 2003) are software designed to derive complete peptide sequences directly from the MS2 spectra. There are also other examples (Chen et al., 2001) along this line of effort. Despite the tremendous amount of effort invested, de novo methods often provide only short motifs correctly but fail to derive the complete sequence.

Judging from the current state of de novo methods and library search methods, it seems sensible to combine the strengths of both. In fact, the sequence-tag method (Mann and Wilm, 1994), employing de novo sequencing to identify peptide tags followed by sequence matching, can be regarded as the first effort along this line. The combination of de novo and library search was also suggested by Taylor and Johnson in their earlier work (Taylor and Johnson, 1997). By using the top

$$\mathcal{K}$$
de novo candidates as queries, each sequence i in a user-specified database obtains a score
$${S}_{i}\equiv \sum _{j=1}^{\mathcal{K}}{s}_{\mathrm{ij}}$$
, with sij being the similarity score between database sequence i and query peptide j. The highest-scoring library sequence, exhibiting strongest similarity to the top-ranking candidates from de novo, is deemed most likely to be the true peptide. This approach is powerful when the top de novo candidates are similar enough to the true peptide. False negatives, however, will occur when the top candidates do not share enough similarities with the true peptide.

To avoid the false negative problem and to attempt a better statistical interpretation, we present in this paper a new method, ‘robust accurate identification of peptides (RAId), that starts with a carefully crafted de novo sequencing followed by a thorough string match (as opposed to homology search) in the database. When we have reasonable performance in the de novo part, this approach can reduce both false positives and false negatives. By retaining a large number of potential candidates, we avoid the occurrence of false negatives; by using an exact string-match, we reduce the occurrence of false positives.

## 2 STATISTICAL SIGNIFICANCE

The key to making MS-based proteomics successful is protein identification, which remains a challenging problem partly owing to the difficulty encountered in peptide identification. Various MS analysis programs, used with their default search parameters, can yield a large percentage of misidentified proteins (Keller et al., 2002; Peng et al., 2003), and their false positive error rates are decreased only in tandem with decreased rates of correct peptide identification. Progress has been made in estimating the false positive and negative rates for a given dataset (Anderson et al., 2003; MacCoss et al., 2002; Nesvizhskii et al., 2003), but this alone does not increase the confidence that can be placed in a particular identification (Olsen and Mann, 2004).

The ability to identify a particular peptide from an MS2 spectrum is clearly dependent upon factors such as sample preparation, collision gas, collision energy, temperature and instruments used. A well designed sample preparation results in less contamination/noise, which is known to be important in the MALDI–MS analysis of peptides and proteins (Kussmann et al., 1997; Yates, 1998). As another example, low collision energy may result in incomplete fragmentation, whereas higher collision energy tends to produce small peptide fragments resulting from overfragmentation. These issues together with other factors mentioned above determine the MS spectrum's quality and specific features, which are hard to predict a priori.

Therefore, any statistical assessment should explicitly take into account these spectrum-specific features and data quality. For example, two different spectra, generated under different experimental conditions by the same peptide, πsig = EETLMEYLENPK, are shown in Figure 1; in this case, the peptide is more easily identified from the second spectrum. The primary hindrances to correct peptide identification come from noise peaks and incomplete peptide fragmentations. For each spectrum in Figure 1, let us separate the peaks into the signal group (peaks that can be explained by πsig) and the noise group (peaks that can not be explained by πsig). Of course, one can do this only if the peptide πsig is already known. Although the two signal groups do overlap, they are far from identical; the overlap between the noise groups is worse. The incomplete overlap between signal groups indicates that the same peptide EETLMEYLENPK will receive different scores from these spectra even when searching the same library; this also shows that the spectrum-independent, prebuilt library statistics for a database search via fitting some training sets can be inappropriate. Furthermore, the substantially smaller overlaps between the noise groups indicates that the noise is spectrum specific.

Taking into account these issues, our approach to improving the accuracy of peptide identification is to combine de novo peptide construction with a protein database search. We capture the quality of a given de novo peptide by a score, calculated in a spectrum-specific, but database-independent manner. Also, we are able to estimate the spectrum-specific score threshold Sth, which only a small fraction of all peptides (with appropriate molecular masses) could attain or surpass. Although the true peptide is often the best scoring one, this is not always the case, and we use database search to select, from among the highest scoring peptides, the correct one.

To elaborate, given a spectrum σ with machine specified parent ion mass, we first calculate, as described in the method section below, the corrected ion mass ℳ, the mass uncertainty δ, and thus range of molecular masses, [ℳ − δ, ℳ + δ], of the peptide to which the spectrum could correspond. We let Π(σ, δ) be the set of all ‘feasible’ peptides, with masses in this range. Given a peptide π from Π(σ, δ), we define a quality score S(π, σ), as described in the method section below. Other than by the generally impractical procedure of examining all members of Π(σ, δ), it is not known how to find the peptide π0 that maximizes S(π, σ). However, as described in the Methods section, we have constructed a heuristic algorithm for generating a large set (typically of size ∼1000) of peptides with high scores, usually including most of the highest scoring peptides. Frequently, the true peptide is the highest scoring one, but we cannot be sure of this with any great confidence. Accordingly, we search a comprehensive protein sequence database

$$\mathcal{D}$$
for the top 1000 peptides returned by our algorithm.

Let Δ(σ, δ) be a random collection of peptides with the correct mass range, i.e. Δ(σ, δ) is a subset of Π(σ, δ). Assume that the presence of each peptide in Δ(σ, δ) is independent of that of the others. We may then estimate the number of peptide hits in Δ(σ, δ) with score at least S to be

(1)
${E}_{h}(S,\sigma )=N(S,\sigma )\frac{\left|\Delta \right(\sigma ,\delta \left)\right|}{\left|\Pi \right(\sigma ,\delta \left)\right|}=\frac{N(S,\sigma )}{\left|\Pi \right(\sigma ,\delta \left)\right|}\left|\Delta \right(\sigma ,\delta \left)\right|,$
where N(S, σ) is the number of peptides from Π(σ, δ) with score at least S, and |Π| represents the number of elements in set Π. When identifying Δ(σ, δ) with peptides within database
$$\mathcal{D}$$
and with feasible molecular mass, Eh(S, σ) is interpreted as the expected number of peptide hits within
$$\mathcal{D}$$
. Consequently, for a spectrum σ a library hit with score S becomes not statistically significant when Eh(S, σ) becomes large.

The statistics are more accurate when we have a reasonably accurate estimate of N(S, σ) (Doerr et al., 2005); however, we find it sufficient to employ a simple heuristic which we now turn to. For molecular mass ranges (e.g. 1100 ± 0.1 to 2300 ± 0.1 Da) implied by typical spectra, we find empirically from our training dataset that N(Sth, σ)/|Π(σ, δ)| is in the range 10−9–10−18 provided that Sth(σ) is chosen to be one half of the highest de novo score for spectrum σ. Since |Δ(σ, δ)| usually decreases with molecular mass ℳ(σ) and is typically of order 107 or less, there is essentially no chance1 of a peptide (with score larger than Sth) other than the correct one being returned by our algorithm as well as being found in the database. In fact, only for low molecular mass peptides (e.g. <935 Da), for which |Δ(σ)| can be >106 and N(Sth, σ)/|Π(σ, δ)| may become >10−7, have we ever found more than one peptide returned by our algorithm to pass the database screen. In these few instances, the highest scoring peptide was invariably the correct one.

## 3 METHODS

In an MS2 process, collisions with low-energy inert gas molecules often break a parent peptide into two or more segments2. When there are more than two segments formed, the ones that do not include either the N-terminal or C-terminal of the parent peptide are called internal fragments. As shown in Figure 2, the non-internal fragments are classified into many named series (Biemann, 1990): a segment containing the N-terminal (C-terminal) of the parent peptide is called a b-ion (y-ion) if it is from the breaking of a peptide bond; similarly, other break points result in a-(x-)ions and c-(z-)ions. As expected from peptide chemistry, the most common break occurs at the peptide bond. Consequently, the dominant and consistent signal peaks correspond to those of the b-ions and y-ions. Since it is also quite common for a peptide to lose an ammonia (NH3) or water (H2O), the m/z peaks associated with b(y) − NH3 and b(y) − H2O are also among the dominant ones. In RAId, we consider nine main series of signal peaks: a, b, c, x, y, b − NH3 (b−17), b − H2O (b − 18), y − NH3 (y − 17) and y −H2O (y − 18). The z-series peaks are excluded because they are usually considerably weaker than the c-series (perhaps because of charge rearrangement preference upon bond breaking) and their inclusion tends to introduce noise rather than provide signal.

RAId uses a conservative de novo method to retain a large number of candidates followed by an exact string match in the peptide library. To succeed in our proposal, we need a peptide library with broad coverage, a decent de novo method, and a scheme that scores de novo peptides and library peptides the same way to allow for correct statistical inference. Below we sketch the major components of RAId; a complete description will be provided elsewhere.

### 3.1 Peptide library construction

The NCBI nr database is theoretically digested using trypsin. Unique peptides with molecular mass up to 3000 Da and with <30 amino acids are collected to form the database. The reason for this limit is that when the parent ion contains >30 amino acids or the ion mass is >3000 Da, it becomes difficult to sequence the peptide using MS2 spectrometry. In our database, each unique peptide has links to all the proteins containing this peptide and the peptide's corresponding locations in each of the target proteins. This information is particularly useful when one wishes to identify proteins based on the peptides identified.

If we collect only properly digested peptides and limit the number of miscleavages, our database will be entirely similar to that of Mascot (Perkins et al., 1999) when the NCBI nr is selected for library search. However, to ensure broadest peptide coverage, we do not limit the number of miscleavages and even allow for incorrect N-terminal cleavages (see Fig. 3 for an example). This evidently causes a significant expansion in the number of peptides within a specified mass range. Nevertheless, it does not slow down our library search much, because RAId matches exact strings only.

We outline the construction of our library here. For every given protein in the NCBI nr database, we first mark every lysine (K) and arginine (R). For every given K or R, we read toward the N terminal for 29 amino acids (provided the accumulated mass is <3000 Da and the N terminal amino acid is not yet reached) and record this peptide. This way, we cover all the possible peptides ≤30 amino acids and under 3000 Da, allow incorrect N-terminal cleavages and do not limit the number of miscleavages.

### 3.2 De novo sequencing

Based on the concept that ‘every candidate peptide is presumed correct until the evidence indicates otherwise,’ RAId's de novo method consists of four steps, each of which will be sketched below. It is this conservative approach, which we term multilayer noise filtering, that allows RAId to retain as many potential peptides as possible for further library identifications.

#### 3.2.1 Preprocessing and raw data filtering

Since the raw spectrum from MS2 contains more information than data collected under centroid mode, RAId prefers to start with raw spectrum and process from there. For every input spectrum, RAId first coarse-grains the masses to ensure that no peaks are within 10−2 Da of each other3. RAId then learns, for the purpose of scoring, important parameters such as the two calibration intensities (Iup and Idown), the spectrum-specific amino acid gap threshold

$$\mathcal{G}$$
, the scale λ and mass standard error R.

RAId estimates, based on the spectrum and the parent ion mass, the lowest intensity among signal peaks and calls it Iup. Similarly, RAId also estimates the noise cutoff and calls it Idown. Mass peaks with intensities less than Idown are removed from scoring, whereas mass peaks with intensities between Iup and Idown are kept but with a weaker contribution to peptide score. The gap threshold

$$\mathcal{G}$$
, computed based on how fast the mass peak intensity decays in the spectrum, decides how severely should RAId penalize the missing peaks when scoring candidate peptides. When the number of peaks with significant intensity is small, RAId scales up the spectrum to obtain more information while preserving relative peak strengths. Through an iterative process, RAId finds the best scale λ ≥ 1 and multiplies each peak intensity by λ. If the resulting intensity exceeds 100, RAId uses the value 100.

Almost all the algorithms/software, de novo or library search, are sensitive to the user input parameters, such as parent ion mass uncertainty and/or fragment ion mass uncertainty. To avoid this complication, RAId learns directly the mass standard error R from the data and then uses 3R as the fragment mass error tolerance and 4R as the parent ion mass tolerance. We now describe in detail how we learn the mass standard error.

One obtains the expected number of amino acids ℓ in the peptide by dividing the parent mass by 110 Da, i.e the average amino acid mass. Considering the nine main series and the two most common types of internal fragments, we find that there are

(2)
$\mathcal{A}=9(\ell -2)+(\ell -3)(\ell -2)$
peak pairs whose mass difference can give rise to one amino acid mass. RAId then extracts the top 3[9(ℓ − 1) + (ℓ − 1)(ℓ − 2)] (three times the expected number of mass peaks from nine main series together with two most common types of internal fragments) intensity peaks and computes pairwise mass differences. Whenever the mass difference D between a pair of peaks is within 0.2 Da of some amino acid X, i.e. |Dm(X)| < 0.2 Da, the mass error |m(X) − D| is binned in a histogram. Starting from smallest error bin, when the accumulated mass-error histogram reaches a count of
$$\mathcal{A}$$
, the corresponding mass error is used as our learned mass standard error.

A similar procedure is then used to learn the parent ion mass. As shown in Table 4 of the Supplementary information, RAId tends to give more accurate parent ion masses than the direct record from the mass spectrometer. After this preprocessing, RAId extracts from the input data a primary list ℒp containing the mass peaks that are most likely to form (b, y) pairs and a secondary list ℒs containing strong but probably unpaired peaks.

#### 3.2.2 Growing de novo candidates

Combinatoric proliferation results in the growth of the number of possible peptides even faster than exponentially with accumulated molecular mass. To be memory efficient, RAId grows the partial candidate peptides to nearly half the molecular mass from the C- and N-termini separately, followed by an assembly procedure to produce the first list of candidate peptides.

In step j, RAId adds an amino acid to a partial peptide only if the resulting partial peptide has molecular mass below 110 × j Da; the addition of the amino acid is postponed otherwise. In every step of the growth, the 10 000 highest-scoring partial peptides will be expanded to roughly 190 000 partial peptides4 and the highest-scoring 10 000 will be retained for growth in the next step. This method avoids the usual difficulties of missing peaks faced by graph-based de novo algorithms (Chen et al., 2001; Dančík et al., 1999; Taylor and Johnson, 1997). As an example, we sketch how RAId scores a C-terminal peptide.

Let the singly charged parent ion mass be ℳ. When a peptide is grown to i amino acids long, RAId searches for a peak list

$${\mathcal{E}}_{i}$$
containing masses that correspond to the y-ion (my), b-ion (mb = ℳ + 1 − my), y − NH3-ion (my−17 = my − 17), y − H2O-ion (my − 18 = my − 18), a-ion (ma = mb − 28), x-ion (mx = my + 26), y2+-ion (
$${m}_{y}^{2+}=({m}_{y}+1)/2$$
), and b2+-ion (
$${m}_{b}^{2+}=({m}_{b}+1)/2$$
). Let Hy denote the first intensity found with the following screening steps: (1) If my ∈ ℒp, then Hy = Iy(my); (2) otherwise, if (my or mb) ∈ ℒs, then Hy = max(Iy, Ib); (3) otherwise, if my−18 ∈ ℒs, then Hy = Iy−18; (4) otherwise, if
$$\left({m}_{y-17}\hbox{ or }{m}_{a}\hbox{ or }{m}_{x}\hbox{ or }{m}_{{b}^{2+}}\hbox{ or }{m}_{{y}^{2+}}\right)\in {\mathcal{L}}_{s}$$
,
$${H}_{y}=max({I}_{y-17},{I}_{a},{I}_{x},{I}_{{b}^{2+}},{I}_{{y}^{2+}})$$
; (5) otherwise, Hy = (10IdownIup)/9. At the i-th amino acid, if my(i) ∈ ℒp or both my(i) and mb(i) are in ℒs, the partial peptide has no gap there. Otherwise, the peptide will acquire a gap number 0 < gi < 1 that increases with the index of the screening steps above. A C-terminal partial peptide containing η amino acids will receive score
(3)
${\displaystyle \sum _{i=1}^{\eta }}\{ln[1+A\left(\delta m\right(i),{H}_{y}(i\left)\right)]+G[\mathcal{G},{T}_{g}\left(i\right),{g}_{i}\left({\mathcal{E}}_{i}\right)\left]\right\},$
where δm(i) is the mass difference between the theoretical value corresponding to Hy(i) and observed value,
(4)
$A(\delta m,{H}_{y})=\Lambda \left(\right|\delta m\left|\right)(0.1+0.9\frac{{H}_{y}-{I}_{\hbox{ down }}}{{I}_{\hbox{ up }}-{I}_{\hbox{ down }}})$
with Λ(x) a monotonic decreasing function that takes value 1 at x = R (the mass standard error learned in data preprocessing) and value 0 for x ≥ 3R, and G is the gap penalty that depends on the gap number gi, the accumulated gap number
$${T}_{g}\left(i\right)=\sum _{j=1}^{i-1}{g}_{j}$$
, and the gap threshold
$$\mathcal{G}$$

(5)
$G[\mathcal{G},{T}_{g}(i),{g}_{i}]=\{\begin{array}{ll}\mathrm{ln}\left(1-{g}_{i}\right)& \hbox{ if }{T}_{g} < \mathcal{G}\\ \mathrm{ln}\left(1-{g}_{i}\right)-\gamma & \hbox{ otherwise }\end{array}$
with γ a positive constant.

Partial peptides that are close to half the molecular mass are then sent to a candidate queue that holds only the best 50 000 partial peptides. In the sequencing procedure, it is possible for the noise peaks and internal fragment peaks to compete with the first few a, b, c peaks of the true peptide and expel the true peptide out of the queue. To reduce the impact of this potential noise, RAId grows the N-terminal partial peptides 19 times5—each with a different, fixed N-terminal amino acid. There are, therefore, 19 candidate queues from the N-terminal sequencing. The N-terminal partial peptides are scored similarly to C-terminal partial peptides. Resulting from trypsin digestion, the C-terminal peptides can only start with K or R. Therefore, the growth of C-terminal peptides is done on a single processor.

#### 3.2.3 Sequence assembly and the heuristic score

The scores in the C-terminal queue are normalized to have maximum 1 by dividing each score by the largest score. This normalization procedure is also carried out for each of the 19 N-terminal queues. This normalization procedure prevents the scenario where partial peptides from one terminus completely dominate peptides from the other terminus. After score normalization in each queue, each of the 19 candidate queues from N-terminal will then be combined with the C-terminal candidate queue to form the 19 assembly pools.

When assembling partial peptides in each assembly pool, we take one partial peptide out of each terminal queue. If the molecular mass of these two peptides plus that of some amino acid is close enough (within allowed mass error) to the parent ion mass, one sums the scores from both partial peptides. RAId then checks if there is any dominant peak (in either ℒp or ℒs) that is used twice, one time from the N-terminal and the other time from the C-terminal. If so, the score of the candidate peptide will be reduced. The score obtained after this check is used as the approximate score of the candidate peptide produced.

RAId then calculates how many of the 50 strongest mass peaks can be explained by the candidate peptide. The number of strong peaks explained and the intensities of those peaks are then combined to form a score, Stop, to quantify how well the candidate peptide explains the apparent feature of the spectrum. Multiplying the approximate score by Stop gives us the heuristic score. The assembled candidate peptides in each assembly pool are then sorted according to the heuristic score.

#### 3.2.4 Final refined scoring

After the heuristic scoring, the top 1000 peptide candidates in each assembly pool are sent to another routine to perform refined scoring. In this stage, the (b, y) sequence integrity, internal fragments and various possible peaks, such as doubly charged b and y peaks, are carefully scored. In particular, a carefully crafted regression scheme is employed to correct for systematic errors. This scheme, implemented in a routine, is needed because it is common that the calibration of a high-resolving power machine drifts from its standard, and the measured m/z values deviate from the theoretical m/z values. The strategy employed in this routine is similar to that in the study of Taylor and Johnson (2001).

### 3.2 Library search

As explained in Section 2, one half of the best de novo score provides a good cutoff for significant database hits, and the true peptide is usually among the highest-scoring de novo candidates. Therefore, the top candidates (1000 of them currently, but this number is adjustable) are checked for a string-match against the library. When no significant library hit is found, the top-ranking candidates from the de novo sequencing results become potential candidates for new peptides that are not yet in the database. However, to avoid contamination of the database, independent experimental verification should be achieved if one wishes to include those top-ranking candidates in the database.

While performing the library search, RAId matches each candidate peptide against every library peptide starting from the C-terminal. In other words, we anchor the C-terminal amino acid of the candidate peptide with the C-terminal amino acid of every library peptide searched. Under such anchoring, whenever a candidate peptide turns out to be a substring of some library peptides, RAId reports such exact matches and no others.

## 4 MATERIALS AND RESULTS

### 4.1 Training dataset and test sets

The internal parameters needed in RAId are trained by using a set of MS2 spectra (numbered 55–67, see Supplementary information), generously provided by J.A. Kowalak, LNT, NIMH. Spectra, of peptides from a trypsin digest of a mixture of chicken lysozyme, bovine serum albumin and horse myoglobin, were obtained on an ABI-QSTAR (LC–MS–QQTOF).

The test datasets were produced by a Q-TOF2 and a Q-TOF-Global mass spectrometer (Micromass, UK) and were downloaded from the weblink provided in the works of Ma et al. (2003). These are the test datasets used by PEAKS (Ma et al., 2003). The peptides are from a trypsin-digested protein mixture of yeast alcohol dehydrogenase, horse myoglobin, bovine serum albumin and horse cytochrome C. The test datasets are ordered as follows: peptides corresponding to spectra 1–17 are from yeast alcohol dehydrogenase; peptides corresponding to spectra 18–40 are from bovine serum albumin; peptides corresponding to spectra 41–50 are from horse myoglobin; peptides corresponding to spectra 51–54 are from horse cytochrome C.

### 4.2 Test results

In terms of library searches, RAId correctly identifies with confidence 92.6% (94.0%) of the peptides excluding (including) the training dataset. For the datasets that RAId fails to report the correct peptides, instead of false positives RAId reports no hits at all. In terms of the de novo part, 61.1% (65.7%) of the time the true peptide ranks among best two de novo candidates when excluding (including) our training set results.

To objectively assess the performance of RAId, we compare the library search performance of RAId with Mascot (Perkins et al., 1999) and Tandem (Craig and Beavis, 2004), and compare the de novo component of RAId with PEAKS6 (Ma et al., 2003) and Lutefisk (Taylor and Johnson, 1997). Each program tested received the same raw data. When the correct peptide is not identified with significance, it can mean either a false negative or a weak positive. Weak positives occur when there is no significant peptide reported and when the true peptide is among the insignificant peptides reported.

The most remarkable aspect of RAId's library search is its ability to correctly and confidently identify the peptide with minimum false/weak positives (Table 1) even when the spectral quality is marginal (in the twilight zone). Basically, we define a spectrum to be in the clear zone if the corresponding peptide is the top-ranking significant hit from all programs considered, i.e. RAId, Mascot and Tandem. A spectrum that is not in the clear zone is defined to be in the twilight zone. According to this classification, there are 37 (39) twilight zone spectra when excluding (including) RAId's training set. However, note that the inclusion or exclusion of these training spectra makes very little difference.

It is worth noting that Mascot has the correct peptide as the number one candidate in the insignificant hit list for each of the 12 weak positive spectra (numbered 2, 6, 14, 16, 18, 20, 24, 25, 26, 41, 42 and 43). Improved score statistics might help move the majority of these spectra into the ‘CIdC’ row. Tandem has the apparent strength that it gives virtually no false positives. The large number of false negatives yielded, however, seems to agree with the general phenomenon when using a higher scoring threshold (Olsen and Mann, 2004).

As for the de novo sequencing, RAId's multilayer noise filtering seems to work very well. Table 2 summarizes the de novo performance comparison of RAId, PEAKS and Lutefisk. Although Lutefisk seems to get the least top-ranking sequences right, the highest-scoring sequence motif reported for each spectrum often shares significant similarity with the true peptide.

In the Supplementary information we document the complete performance test of library searches in Table 3, and the comparison of de novo performance is detailed in Tables 5 and 6.

## 5 DISCUSSION AND CONCLUSION

An outstanding de novo result is not a prerequisite for RAId to perform well in the library searches. When the peptide is in the library, the main purpose of RAId's de novo is to provide a reasonable estimate of the score statistics and a long list of candidate peptides to search against the database. In RAId, one may specify up to three post-translational modifications (PTM) in the de novo growth stage and consequently, the background statistics is limited to have at most three PTM at the current stage.

The other potential limitation of RAId comes from its requirement to have the true peptide correctly sequenced de novo prior to the library search. Although the purpose of this is to eliminate potential false positives, sometimes it will lead to a false negative because the correct peptide is not present in the candidate pool owing to incomplete fragment information. However, methods based on scoring library sequences against the query spectrum will be less influenced by such information loss. As an example, spectrum #54 in Table 3 of the Supplementary information is correctly identified by Tandem, but not by RAId. This limitation, however, can potentially be overcome by a different string comparison strategy when performing the database search. Upon testing a different strategy, where only a partial string match is required, one recovers the corresponding sequence correctly. We are currently investigating the most effective way to combine this new search approach with the exact string match that is currently employed by RAId. Another possibility is to use programs like MS Blast (Shevchenko et al., 2001) to perform the library search. Interestingly, by simply concatenating the top three de novo candidates from spectrum #54, we also recovered the correct peptide using MS Blast.

RAId is designed to analyze spectra obtained by fine/high–resolution instruments, although it might also work with coarse/low–resolution spectra. To allow PTM detection, it is important to enable mass distinctions among nearly degenerate (PTM) amino acids. This requirement of fine resolution is also essential if one wishes to incorporate PTMs into the background statistics. As a test example, we use RAId to provide peptide candidates for spectrum #50, in Table 3 of the Supplementary information, with the correct methionine sulfoxidation PTM included together with two other arbitrary PTMs selected. RAId was then able to identify the correct sequence corresponding to spectrum #50 in both de novo part and library search part.

Owing to mass degeneracy, however, introduction of PTMs in analyzing spectra of coarse resolution is most likely to contaminate the de novo sequencing and degrade the performance of RAId. A similar degradation can occur when one uses, instead of the raw spectrum, the coarse-grained spectrum provided by the manufacturer.

Written in C++, RAId is designed for accuracy and robustness. By default it runs both the de novo part and the library search part upon one command. Depending on the parent ion mass, on a 2.8 GHz processor the runtime per spectrum using RAId is ∼2–4 min. This runtime is about four to seven times longer than that of mascot, which, however, runs on a dedicated 6-CPU cluster. For a large laboratory with 20 dedicated processors, RAId will be able to process ∼10 000 spectra a day. When computational resources are limited, RAId may not seem ideal, at first sight, for a high-throughput setup.

The runtime of RAId can be significantly reduced, however, when one is dealing with peptides from only a few organisms and is willing to use a smaller subset of possible sequences in simulating the score statistics. The idea is to restrict the candidate sequences to be within the protein ensemble of the organisms of interest. At the same time, the background statistics can be simulated separately with many fewer candidates held in each step of the growth. We will describe this heuristic approach in more detail elsewhere (Alves and Yu, 2005). Under this new heuristic, the true peptide (if present in the organismal database) may be lost in the de novo growth, but will be recovered in the restricted growth within the database. Based on the background score statistics, we can again determine whether the peptide hit found within the organismal database is statistically significant (Doerr et al., 2005). Evidently, the statistics obtained this way can be less accurate as a trade-off for the speed. For a spectrum that does not have any significant peptide hit within the organismal database, one can run the full version of RAId to determine whether it results from a peptide of another organism or from a novel peptide that is potentially identifiable by RAId.

1However, peptides with (nearly) identical masses resulting from high identity in primary structure can also trigger multiple hits. For example, in Table 3 of the Supplementary information spectra 52 and 59 both correspond to the same true peptide HGTVVLTALGGILK and both have an additional database hit HGVTVLTALGGILK with score lower than the true one. Note that the only difference between these two hits is the underlined transposition. Nevertheless, when looking into the sequence database, we find that all the protein sequences containing the second peptide hit HGVTVLTALGGILK are not yet validated. Therefore, it is not impossible that database proteins containing the second hit share a mistake in their primary sequences.

2In each individual breakage event, the electric charges—associated with the broken covalent bond—get stochastically redistributed, and the segment(s) with net positive charges will get its (their) m/z value(s) recorded by the mass spectrometer.

3As a simple example, let two masses m1 and m2 be within 10−2 Da of each other and have peak intensities I1 and I2, respectively. These two peaks will then be merged into one with a new mass (m1I1 + m2I2)/(I1 + I2). Generalization to merging multiple mass peaks is straightforward.

4As with most de novo methods, RAId's de novo part does not distinguish leucine (L) from isoleucine (I) and labels them both by L.

5Depending on the hardware configuration, it is also possible to use 19 processors in parallel to achieve this task more effectively.

6The 2003 trial version of the PEAKS commercial software was used in this test. It gives results slightly different from what is reported in the original paper (Ma et al., 2003).

Fig. 1

The two spectra associated with the same peptide EETLMEYLENPK. Both parts (a) and (b) (respectively spectrum 43 and 49 in Table 3 of Supplementary information) are Q-TOF data; the parent ion in (a) is triply charged, whereas the parent ion in (b) is doubly charged. When using the default parameters, both Mascot and Tandem find no significant hit for spectrum (a) and find the correct peptide for spectrum (b).

Fig. 1

The two spectra associated with the same peptide EETLMEYLENPK. Both parts (a) and (b) (respectively spectrum 43 and 49 in Table 3 of Supplementary information) are Q-TOF data; the parent ion in (a) is triply charged, whereas the parent ion in (b) is doubly charged. When using the default parameters, both Mascot and Tandem find no significant hit for spectrum (a) and find the correct peptide for spectrum (b).

Fig. 2

Bond breaking points along a peptide chain. (a) In the example peptide (of four amino acids), a break can occur at any of the sites labeled by the dashed vertical lines. The resulting fragment ions with the N-terminus are the a-, b- and c-ions, and the ions with the C-terminus are the x-, y- and z-ions (Biemann, 1990); (b) The segment(s) with its(their) left cut at 1 or 2 or 3 and right cut at 1′ or 2′ or 3′ correspond to internal fragments. The most common pair of cutting points is (3, 3′). Note that two other pairs of cutting points, (1, 1′) and (2, 2′), also yield the same internal fragment masses, i.e.

$${\mathcal{M}}_{1,{1}^{\prime }}={\mathcal{M}}_{2,{2}^{\prime }}={\mathcal{M}}_{3,{3}^{\prime }}$$
. The other frequent pair of cutting points, (3, 2′), yields a molecular mass
$${\mathcal{M}}_{3,{2}^{\prime }}={\mathcal{M}}_{3,{3}^{\prime }}-28$$
.

Fig. 2

Bond breaking points along a peptide chain. (a) In the example peptide (of four amino acids), a break can occur at any of the sites labeled by the dashed vertical lines. The resulting fragment ions with the N-terminus are the a-, b- and c-ions, and the ions with the C-terminus are the x-, y- and z-ions (Biemann, 1990); (b) The segment(s) with its(their) left cut at 1 or 2 or 3 and right cut at 1′ or 2′ or 3′ correspond to internal fragments. The most common pair of cutting points is (3, 3′). Note that two other pairs of cutting points, (1, 1′) and (2, 2′), also yield the same internal fragment masses, i.e.

$${\mathcal{M}}_{1,{1}^{\prime }}={\mathcal{M}}_{2,{2}^{\prime }}={\mathcal{M}}_{3,{3}^{\prime }}$$
. The other frequent pair of cutting points, (3, 2′), yields a molecular mass
$${\mathcal{M}}_{3,{2}^{\prime }}={\mathcal{M}}_{3,{3}^{\prime }}-28$$
.

Fig. 3

Demonstration of miscleavage and incorrect cleavage. (a) Depicts a segment of a protein sequence, with all theoretical trypsin cleavage points marked by ‘V’. The peptide in (b) corresponds to the case of two miscleavages. In addition to one miscleavage, the peptide in (c) has an incorrect N-terminal cleavage because the peptide bond between

$$\mathsf{\hbox{ H }}$$
and
$$\mathsf{\hbox{ T }}$$
is not one of the theoretical cleavage points.

Fig. 3

Demonstration of miscleavage and incorrect cleavage. (a) Depicts a segment of a protein sequence, with all theoretical trypsin cleavage points marked by ‘V’. The peptide in (b) corresponds to the case of two miscleavages. In addition to one miscleavage, the peptide in (c) has an incorrect N-terminal cleavage because the peptide bond between

$$\mathsf{\hbox{ H }}$$
and
$$\mathsf{\hbox{ T }}$$
is not one of the theoretical cleavage points.

Table 1

Library search performance comparison in twilight zone

RAId Tandem Mascot
CIdC 33 (35) 9 (9) 6 (8)
FPO 0 (0) 1 (1) 19 (19)
FN/WP 4 (4) 27 (29) 12 (12)
RAId Tandem Mascot
CIdC 33 (35) 9 (9) 6 (8)
FPO 0 (0) 1 (1) 19 (19)
FN/WP 4 (4) 27 (29) 12 (12)

The row labeled by ‘CIdC’ records the number of spectra whose corresponding peptides are correctly identified with confidence; the row labeled by ‘FPO’ lists the number of spectra whose candidate lists contain false peptides only; the row labeled by ‘FN/WP’ lists the number of spectra each of whose correct peptide is missing from the list of significant hits. Numbers obtained when including spectra numbered from #55 to #67 are shown in parentheses.

Table 2

De novo performance summary

RAId PEAKS Lutefisk
Top 2 33 (44) 23 (33) 10 (12)
3–20 12 (14) 14 (15) N/A
Below 20 9 (9) 17 (19) N/A
RAId PEAKS Lutefisk
Top 2 33 (44) 23 (33) 10 (12)
3–20 12 (14) 14 (15) N/A
Below 20 9 (9) 17 (19) N/A

The first row documents the number of times the true sequence ranks in the top two in the de novo sequencing. The second row records the number of times the true sequence ranks between 3 and 20. The third row records the number of times the true sequence ranks below 20. Numbers obtained when including spectra numbered from #55 to #67 are shown in parentheses.

We thank Richard Johnson for sending us the Lutefisk program. We also thank the administrative group of the NIH biowulf clusters, where all the computational tasks were carried out. We thank Hank Fales, David Landsman, David Lipman, Lewis Pannel and John Wootton for their helpful comments. The authors are particularly indebted to Stephen Altschul for his helpful comments and very valuable suggestions in presentation. This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS. Funding to pay the Open Access publication charges for this article was provided by the NIH

Conflict of Interest: none declared.

## REFERENCES

2003
A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptite.
J. Proteome Res.

2
137
–146
Alves, G. and Yu, Y.-K. Unpublished
Bafna, V. and Edwards, N.
2001
SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database.
Bioinformatics

17
13
–21
Biemann, K.
1990
Nomenclature for peptide fragment ions (positive ions), in Mass Spectrometry. In McCloskey, J.A. (Ed.).
Methods in Enzymology
, San Diego Academic Press Vol.
193
, pp.
886
–887
Chen, T., et al.
2001
A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry.
J. Comp. Biol.

8
325
–337
Clauser, K.R., Baker, P.R., Burlingame, A.L.
1996
Peptide fragment-ion tags from maldi/psd for error tolerant seaching of genomic databases. Proceedings of the 44th ASMS Conference on Mass Spectrometry and Allied TopicsMay 12–16Portland, Oregan , pp.
365
Craig, R. and Beavis, R.C.
2004
TANDEM: matching proteins with tandem mass spectra.
Bioinformatics

20
1466
–1467
Comisarow, M.B. and Marshall, A.G.
1974
Fourier transform ion cyclotron resonance [FT-ICR] spectroscopy.
Chem. Phys. Lett.

25
282
–283
Dančík, V., et al.
1999
De novo peptide sequencing via tandem mass spectrometry.
J. Comp. Biol.

6
327
–342
Doerr, T.P., et al.
2005
Ranked Solutions to a class of combinatorial optimizations-with applications in mass spectrometry based peptide sequencing and a variant of directed paths in random media.
Phys. A.

354
558
–570
Eng, J.K., et al.
1994
An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
J. Amer. Soc. Mass Spectrom.

5
976
–989
Fenn, J.B., et al.
1989
Electrospray ionization for mass spectrometry of large biomolecules.
Science

246
64
–71
Henry, K.D., et al.
1989
Fourier-transform mass spectrometry of large molecules by electrospray ionization.

86
9075
–9078
Hernandez, P., et al.
2003
Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data.
Proteomics

3
870
–878
Ho, Y., et al.
2002
Systematic identification of protein complexes in Saccharomyces Cerevisiae by mass spectrometry.
Nature

415
180
–183
Karas, M. and Hillenkamp, F.
1988
Laser desorption ionization of proteins with molecular masses exceeding 10 000 daltons.
Anal. Chem.

60
2299
–2301
Keller, A., et al.
2002
Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
Anal. Chem.

74
5383
–5392
Kussmann, M., et al.
1997
Matrix-assisted laser desorption/ionization mass spectrometry sample preparation techniques designed for various peptide and protein analytes.
J. Mass. Spect.

32
593
–601
Ma, B., et al.
2003
PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.
Rapid Commu. Mass Spect.

17
2337
–2342
MacCoss, M.J., et al.
2002
Probability-based validation of protein identification using a modified SEQUEST algorithm.
Anal. Chem.

74
5593
–5599
Mann, M. and Wilm, M.
1994
Error-tolerant identification of peptides in sequence database by peptide sequence tags.
Anal. Chem.

66
4390
–4399
Mann, M., et al.
2001
Analysis of proteins and proteomes by mass spectrometry.
Annu. Rev. Biochem.

70
437
–473
Martin, S.E., et al.
2000
Subfemtomole MS and MS/MS peptide sequence analysis using Nano-HPLC Micro-ESI Fourier transform ion cyclotron resonanace mass spectrometry.
Anal. Chem.

72
4266
–4274
Nesvizhskii, A.I., et al.
2003
A statistical model for identifying proteins by tandem mass spectrometry.
Anal. Chem.

75
4646
–4658
Olsen, J.V. and Mann, M.
2004
Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation.

101
13417
–13422
Peng, J., et al.
2003
Evaluation of multidimension chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome.
J. Proteome Res.

2
43
–50
Perkins, D.N., et al.
1999
Probability-based peotein identification by searching sequence database using mass spectrometry data.
Electrophoresis

20
3551
–3567
Shevchenko, A., et al.
2001
Charting the proteomes of organisms with unsequenced genomes by MALDI-Quadrupole time-of-flight mass spectrometry and BLAST homology searching.
Anal. Chem.

73
1917
–1926
Taylor, J.A. and Johnson, R.S.
1997
Sequence database searches via de novo peptide sequencing by tandem mass spectrometry.
Raid Commu. Mass Spect.

11
1067
–1075
Taylor, J.A. and Johnson, R.S.
2001
Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry.
Anal. Chem.

73
2594
–2604
Thomson, J.J.
1899
On the masses of the ions in gases at low pressures.
Philosophical Magazine

48
547
–567
Yates, J.R.III.
1998
Mass spectrometry and the age of the proteome.
J. Mass. Spect.

33
1
–19