Fast peak error correction algorithms for proteoform identification using top-down tandem mass spectra

Abstract Motivation Proteoform identification is an important problem in proteomics. The main task is to find a modified protein that best fits the input spectrum. To overcome the combinatorial explosion of possible proteoforms, the proteoform mass graph and spectrum mass graph are used to represent the protein database and the spectrum, respectively. The problem becomes finding an optimal alignment between the proteoform mass graph and the spectrum mass graph. Peak error correction is an important issue for computing an optimal alignment between the two input mass graphs. Results We propose a faster algorithm for the error correction alignment of spectrum mass graph and proteoform mass graph problem and produce a program package TopMGFast. The newly designed algorithms require less space and running time so that we are able to compute global optimal alignments for the two input mass graphs in a reasonable time. For the local alignment version, experiments show that the running time of the new algorithm is reduced by 2.5 times. For the global alignment version, experiments show that the maximum mass errors between any pair of matched nodes in the alignments obtained by our method are within a small range as designed, while the alignments produced by the state-of-the-art method, TopMG, have very large maximum mass errors for many cases. The obtained alignment sizes are roughly the same for both TopMG and TopMGFast. Of course, TopMGFast needs more running time than TopMG. Therefore, our new algorithm can obtain more reliable global alignments within a reasonable time. This is the first time that global optimal error correction alignments can be obtained using real datasets. Availability and implementation The source code of the algorithm is available at https://github.com/Zeirdo/TopMGFast.


Introduction
Recent studies show that changes in protein isoform expression and post-translation modifications (PTMs) have gained recognition for their roles in underlying disease mechanisms (Brown et al. 2020).Besides, all protein products that arise from a single gene due to genetic variations, alternative splicing, and PTMs can significantly diversify the proteome for basic and clinical research (Melby et al. 2021).Proteoform identification is a challenging problem that has been attracting lots of attention.
Traditional 'bottom-up' mass spectrometry (MS)-based proteomics has become less practical for high-throughput proteoform identification due to the 'peptide-to-protein' inference problem with the low protein sequence coverage and a loss in proteoform information (Toby et al. 2016, Schaffer et al. 2019).For this reason, 'top-down' MS-based proteomics was coined by McLafferty and co-researchers in 1999.This 'top-down' MS-based proteomics analyzes intact proteins instead of digesting them into peptides, which enables accurate proteoform identification, PTM localization, and relative quantification for well understanding essential biological functions, unraveling disease mechanisms and discovering new biomarkers (Shaw et al. 2013, Riley et al. 2017, Schaffer et al. 2018, Dai et al. 2019, Arauz-Garofalo et al. 2021, Melby et al. 2021).Besides, when multiple PTMs coexist on single protein molecules, 'top-down' MS-based proteomics becomes the only feasible method for characterization (Holt et al. 2019).
For proteoform identification, we need to align a top-down spectrum against a protein database.To handle the combinatorial explosion of possible proteoforms, the proteoform mass graph was proposed (Kou et al. 2017), and a program package TopMG was produced.Moreover, a top-down spectrum can be represented as a spectrum mass graph.The problem of searching a spectrum against a database is formulated as finding an optimal alignment between the two mass graphs.Here we design new algorithms for aligning the two mass graphs with special treatment on spectrum peak error handling.
When aligning the peaks of the spectrum with proteoform mass graphs, the theoretical mass values on the proteoform mass graphs are not necessarily identical to the mass values of corresponding peaks since the mass values of peaks have errors.In previous methods, an error tolerance value δ is used to handle this issue and for any two consecutive nodes in the obtained alignment, the masses between the two consecutive nodes in the spectrum mass graph is within the range ½x − δ; x þ δ�, where x is the theoretical mass between the two consecutive nodes in the proteoform mass graph.This method has a serious problem in that when having a global view of the whole obtained alignment, one may find that the mass between two nodes in one graph is significantly different from that of the two corresponding nodes in the other graph due to some consecutive positive/negative error accumulation.However, any big difference in masses on the two input mass graphs between any pair of matched nodes is clear evidence that the obtained alignment is not reliable.Thus, it is necessary to have some special ways to handle this issue.Some heuristic methods have been used in Kou et al. (2017).A new model was proposed in Zhan and Wang (2022) to handle spectrum peak errors and the corresponding program package is referred to as TopMGRefine.They require that each matched node y i (peak) with mass value m i in the spectrum mass graph has an error correction value k so that m i þ k is the 'true mass' after error correction, which should be the same as the theoretical mass in the proteoform mass graph.The problem is referred to as the error correction alignment of spectrum mass graph and proteoform mass graph problem.The dynamic programming algorithm given in Zhan and Wang (2022) needs to have an extra index k so that the running time of the algorithm is increased by a factor of k and it is extremely slow in practice.Thus, the authors can only provide a program that can compute a local optimal alignment of two input mass graphs with additional input information of the alignment starting nodes in the two graphs (Zhan and Wang 2022).They propose to use existing methods to provide a few candidate starting nodes and use their program to get better quality local alignments.
In this article, we propose a faster algorithm for the error correction alignment of spectrum mass graph and proteoform mass graph problem and produce a program package TopMGFast in Cþþ.The newly designed algorithms require less space and running time so that we are able to compute global optimal alignments for the two input mass graphs in a reasonable time.
For the local alignment version, we used the 2817 protein and spectrum pairs obtained after filtering for experiments.Both TopMGRefine and TopMGFast obtain identical local alignment results.The total running time of the 2817 protein and spectrum pairs for TopMGrefine and TopMGFast is 4760 and 1715 min, respectively.That is, the new algorithm is much faster.
For the global alignment version, the same dataset is used.Since TopMGRefine does not support a global alignment version due to large memory and running time requirements, we compare our program with TopMG.Experiments show that the maximum mass errors between any pair of matched nodes in the alignments obtained by our method are within a small range as designed, while TopMG-generated alignments have very large maximum mass errors for many cases.The obtained alignment sizes are roughly the same for both TopMG and TopMGFast.Of course, TopMGFast needs more running time.In fact, the running time of TopMGFast is 3 times that of TopMG.Therefore, our new algorithm can obtain more reliable global alignments within a reasonable time.This is the first time that global optimal error correction alignments can be obtained using real datasets.

Materials and methods
To deal with all possible proteoforms of a protein, Kou et al. formulate a protein and all its possible proteoforms as a proteoform mass graph (PMG for short) (Kou et al. 2017).Each amino acid in a protein has a left node and a right node corresponding to the bonds left and right to the amino acid.An edge connecting the two nodes is assigned the mass of the amino acid.Each edge corresponding to an original residue has a black color.For each modification of an amino acid, there is a new edge with the modified mass connecting the two nodes of the original amino acid.The edges corresponding to modifications of amino acids have a red color.The PMG has a unique starting node and an ending node.
A spectrum is also formulated as a spectrum mass graph (SMG for short), where there is a special node y 0 , each peak p i with mass m i in the spectrum corresponds to a node y i for i > 0 in the SMG, there is a directed edge connecting y i and y iþ1 with mass m iþ1 −m i such that the length of the path from y 0 to y i is m i .For a pair of peaks p i and p j with i<j in the spectrum, if m j −m i is the same as an amino acid or the modification of an amino acid in a protein, we can match y i and y j to the two nodes corresponding to the amino acid.Besides, following the same way as in Kou et al. (2017), we convert each mass value m in both PMG and SMG into an integer by using the formula bm � 274:335215c.
However, since the masses of peaks have errors, we cannot assume that m j −m i is identical to the theoretical mass of an amino acid.Some kind of peak error handling is required.Usually, people use a value δ for error tolerance.
Two ways to set error tolerance value δ: There are two ways to set the error tolerance value δ (Kou et al. 2016, 2017, Zhan and Wang 2022).One way is to simply set δ ¼ 27 for every peak.The second method sets a value δ i to be the error for m i for each peak y i and we call this kind of error tolerance peakdependent error tolerance.We can calculate the peak dependent error tolerance for each peak y i as follows.
1) For an original peak y i with mass m i , δ i ¼ 27þ 15  1 000 000 m i .2) For a complementary peak y i with mass m i , , where M is the mass of the whole protein.
3) For any peak y i with mass m i larger than 5000, the corresponding δ i should be further enlarged.An alternative way (Kou et al. 2016(Kou et al. , 2017) ) to handle this kind of peaks is to add another two peaks with masses m−1:00235 and mþ1:00235 in pre-process stage.After multiplying the factor of 274.335215, there are three peaks that are about 274.335215 away in the spectrum.
For item 3, there is a trade off between increasing the number of peaks in the spectrum and increasing the value of δ i .We observe that increasing the number of peaks is a better choice in terms of the speed of algorithms since the value of δ i plays an important role for the algorithm with error correction.We denote δ max ¼ max n i¼0 δ i .The traditional method (Kou et al. 2017) computes an alignment between the two graphs (PMG and SMG).The alignment contains a list of nodes x j 1 ; x j 2 ; . . .; x j k from the proteoform mass graph and a list of nodes y i 1 ; y i 2 ; . . .; y i k from the spectrum mass graph such that for any two consecutive nodes y iq and y i qþ1 in the alignment, m i qþ1 −m iq is in the range ½m q − δ; m q þ δÞ�, where m q is the theoretical mass for a path between x jq and x j qþ1 in PMG.We refer to such kind of error tolerance method as local edge tolerance methods.The local edge tolerance methods may suffer from error accumulation if positive/negative errors occur for many edges in the alignment.For example, if m i qþ1 −m iq ¼ m q þδ for q ¼ 1; 2; . . .k, then the mass between y i 1 and y i k is equal to the theoretical mass P k i¼1 m i plus kδ.Thus, in the alignment, there exists a pair of matched nodes, say, y 1 and y k such that the mass between the two nodes has an error kδ compared to the theoretical mass in the PMG.Such a big mass error kδ will show that the obtained alignment is not reliable.The problem is so severe that it is necessary to use some kinds of heuristic methods to refine the obtained alignment (Kou et al. 2017).

Error correction alignment and the dynamic programming algorithm outline
To deal with the peak errors that may occur in the spectrum in a more accurate way, Zhan and Wang proposed a new model for error correction of peaks (Zhan and Wang 2022).They still use the proteoform mass graph (PMG) and the spectrum mass graph (SMG) to represent the database and the spectrum.They propose to have an error correction for each matched peak in the alignment so that after error correction the mass between any two matched peaks in SMG is identical to the corresponding theoretical mass in PMG.
Let x 0 , x 1 , … , x n be the n nodes in a PMG G and y 0 , y 1 , … , y m be the m nodes in a SMG H.An error correction alignment of G and H with size r is a sequence of r triples ðx j 1 ; where k iq 2 ð−δ iq ; δ iq Þ is the error correction value for peak y iq ; m iq is the mass for the peak y iq and M j q−1 ;jq is the mass of path between node x j q−1 and node x jq .Note that δ is the error tolerance value for the mass of the peak.The error correction alignment problem is to compute an error correction alignment between G and H with the maximum size.Let Tði; j; kÞ be the maximum size of the alignments between the first i peaks in H and the first j nodes in G such that the corrected peak of y i has the mass value m i þk, and y i in H matches x j in G.Note that the initial value of Tði; j; kÞ is set to be 1 for all 0 ≤ i ≤ m; 0≤j≤n and −δ i ≤ k ≤ δ i .
When computing Tði; j; kÞ for every 0 ≤ i ≤ m; 0 ≤ j ≤ n and −δ i ≤ k ≤ δ i , a dynamic programming algorithm can be used to simplify the process.Let d(s, j) be the set of distinct masses for paths from x s to x j in G.For each mass m 2 dðs; jÞ, there is a list of nodes corresponding to peaks in H with mass values in the range Let listði; j; k; mÞ be such a sorted list.The peaks in this list can be matched to the node x s in G under the condition that the corrected peak of y i with mass m i þk matches the node x j in G. Therefore, the following dynamic programming equation can be used for computing Tði; j; kÞ: Tði; j; kÞ ¼ max where condition (2) is as follows: (2) When computing listði; j; k; mÞ, let C(i, j) be the set of all listði; j; k; mÞ and can be formulated as Cði; jÞ ¼ flistði; j; k; mÞj m 2 [ dðs; jÞ be the set of masses or subpaths in G that we consider for computation of alignments.Let M ¼ fðm; i 0 ; iÞji 0 <i; y i ; y i 0 2 H; m ¼ m i −m i 0 g be the set of triples for masses differences m between any pair of nodes ðy i 0 ; y i Þ in H.Then, D and M are both sorted in nondecreasing order.Going through the elements in sorted D and M once, we can create C(i, j) with all the lists listði; j; k; mÞ sorted.Moreover, for a specific peak y i , a node x j and a path mass m 2 [ j 0 ¼j−1 j 0 ¼0 dðj 0 ; jÞ in G, one element in listði; j; k; mÞ is enough to compute Tði; j; kÞ for all k 2 ð−δ i ; δ i Þ instead of using equation (1) for all k 2 ð−δ i ; δ i Þ. [See Theorem 1 (Zhan and Wang 2022).]Thus, the total running time of computing all Tði; j; kÞ and finding the largest one is Oðnmδ max þLÞ where L is the total size of [ j 0 ¼j−1 j 0 ¼0 dðj 0 ; jÞ.The above algorithm is still too slow in practice.Thus, Zhan and Wang designed a local algorithm that can only compute an optimal alignment when the starting positions in both G and H are given.In this case, a sub-matrix of Tði; j; kÞ around the diagonal will be computed so that the running time and the memory usage are further reduced.The detailed algorithm can be found in the Supplementary Section S1A.

New fast algorithms
When computing Tði; j; kÞ, for a peak y i , the corrected position m i þk of y i is an integer in the range ½m i −δ i ; m i þδ i �.When the next peak y iþ1 in the spectrum is close to y i , the two ranges ½m i −δ i ; m i þδ i � and ½m iþ1 −δ iþ1 ; m iþ1 þδ iþ1 � may have overlap.In this case, Tði; j; kÞ and Tðiþ1; j; k 0 Þ, where m i þk ¼ m iþ1 þk 0 , mean the same thing, i.e. a peak in the spectrum with mass m i þk ¼ m iþ1 þk 0 matches x j in G.It does not matter whether y i or y iþ1 is corrected to the position m i þk ¼ m iþ1 þk 0 since Theorem 1 in Zhan and Wang (2022) ensures that Tði; j; k i Þ is always equal to Tðiþ1; j; k iþ1 Þ under the condition m small >4δ max , where m small is the smallest mass of all the (modified or unmodified) residues.This condition is true in practice for different error tolerance settings (Zhan and Wang 2022).
Thus, each integer in the overlapped range ½m i −δ i ; m i þδ i � \ ½m iþ1 −δ iþ1 ; m iþ1 þδ iþ1 � is used twice in the computation process of Tði; j; kÞs.See Fig. 1.Therefore, we can make the algorithm faster if every integer in the overlapped range is used once during the whole process of computing Tði; j; kÞs.
To reduce such redundant computing, we propose an algorithm to delete overlaps when computing Tði; j; kÞ and ensure every available position for peaks is computed only once.In the new algorithm, for each peak, we define two variables to indicate the lower and upper bounds of the range, where the correct peak position should be.Let δ − i denote the largest negative error tolerance for peak y i .Similarly, let δ þ i denote the largest positive error tolerance for peak y i .The initial values of δ − i and δ þ i for peak y i are both δ i before deleting overlaps.For any consecutive peaks y i and y j , if the range , then y i can be deleted from the spectrum at the beginning of this algorithm.
We then re-calculate the ranges for peak y i and y iþ1 according to the principles illustrated in the Supplementary Section S1B.Now, when we compute Tði; j; kÞ, the range of the element k is reduced and the total size of Tði; j; kÞs we need to compute is also decreased.We still use Equation (1) to compute Tði; j; kÞ.Thus, the total running time is OðnmqþLÞ, where q is the largest size of δ − i þδ þ i for all peaks.Besides, in the process creating C(i, j), we also change the formulation of C(i, j) by the updated error tolerance for every peak as Cði; jÞ ¼ flistði; j; k; mÞjm 2 [ This updated formulation can also decrease the size of C(i, j) and further reduce the running time for creating C(i, j).

Datasets
Here, we use a dataset generated from Escherichia coli (EC) K-12 MG1655 cells.The protein database was downloaded from UniProt (Proteome ID: UP000000625) and included 4438 protein entries associated with this proteome with F plasmid removed.For MS and MS/MS spectra, we use the raw features downloaded from Kou et al. (2017) and further process the data following the methods described in Kou et al. (2017).After processing, we obtained 4054 top-down MS/MS spectra, where 2027 are collision-induced dissociation (CID) MS/MS spectra and the other 2027 are electrontransfer dissociation (ETD) MS/MS spectra.For this EC dataset (database), three mutations were used as variable PTMs and the three modifications are: lysine (K) to cysteine (C) (UNIMOD Accession number: 1132), threonine (T) to alanine (A) (UNIMOD Accession number: 659) and valine (V) to glycine (G) (UNIMOD Accession number: 672).A txt format file was generated based on these three pre-defined mutations as part of the input.Users can provide their own predefined mutations to replace the txt format file.The original EC database was modified based on the txt format file to form the final proteoform database for database search.For the EC protein database, the protein size (the total number of amino acids) ranges from 31 to 2001, while the spectrum size (the total number of peaks) ranges from 22 to 4604 among 4054 top-down MS/MS spectra.

Speed and memory of (local) diagonal alignment version
The best-known method with error correction is TopMGRefine (Zhan and Wang 2022), where they can only provide a diagonal alignment version reporting an optimal local alignment due to the high time/space complexity of their algorithm.The diagonal alignment version requires users to input the starting positions of the alignment and has a constraint that the two masses from the two starting positions in the alignment to any pair of aligned nodes are roughly the same.To illustrate the time/space complexity of the new algorithm, we start with the diagonal alignment version and follow the experiment processes in Zhan and Wang (2022).
Similar to TopMGRefine, since diagonal alignment needs the starting positions as part of the input, we use TopMG (Kou et al. 2017) to align all the 4054 spectra with all the proteins for the whole EC dataset.Note that, TopMG first uses a filtering method to obtain a few candidate proteins with high scores for each spectrum and then uses an alignment algorithm to further align the spectrum with each of the selected candidates and report the protein and the corresponding alignment with the best score.We refer to this version as TopMG with filtering.
Among 4054 spectra, TopMG reported 2817 spectra that can be successfully aligned to some proteins in the database.Then we choose the protein with the largest alignment size generated by TopMG for each of the 2817 spectra to form protein and spectrum pairs for further refined alignments using TopMGRefine and TopMGFast.
The experiments were performed on a cluster with 200 GB of memory.For the 2817 protein and spectrum pairs, both TopMGRefine and TopMGFast obtain identical local alignment results.The total running time of the 2817 protein and spectrum pairs for TopMGrefine and TopMGFast is 4760 and 1715 min, respectively.
The largest input instance is the protein spjP76347j YEEJ ECOLI with 2001 residues and the spectrum (ID: 3260) with 4604 peaks.For this largest input instance, the starting position for protein and the starting position for the spectrum given by TopMG are 813 and 0, respectively.In this case, the space required by TopMGRefine and TopMGFast is 102G and 80G, respectively.The alignment size obtained by both TopMGRefine and TopMGFast is 83.The running time for TopMGFast and TopMGRefine is 21.4 and 56.9 min, respectively.Again, the running time of TopMGFast is about 1/3 of that for TopMGRefine.Therefore, we can see that TopMGFast and TopMGRefine can obtain the same results, where TopMGFast needs much less running time and memory space.

Preliminary performance of TopMGFast for global alignment
Since TopMGFast needs less memory space, it is possible to do global alignment even for the largest input instance, where TopMGRefine cannot do that.For global alignment, we do not need to give the starting positions of the alignment for a pair of protein and spectrum.Therefore, we can try to compute the final accurate alignment result without using the roughly estimated starting positions of the alignment.
In the rest of this subsection, we will compare the alignment algorithms for TopMG and TopMGFast using 4054 top-down MS/MS spectra and 4438 proteins in the whole EC database.Since both the protein and spectrum databases are huge compared to the slow speeds of the alignment algorithms for both TopMG and TopMGFast, we first run the filtering algorithm in TopMG to obtain a few candidates for each spectrum, and for the same set of obtained candidate proteins, run the alignment algorithms for both TopMG and TopMGFast with the same spectrum to get the best alignment results.
Among 4054 spectra in the whole EC dataset, TopMG reported 2817 spectra with at least one corresponding protein candidate and the total number of reported corresponding protein candidates is 31 481.Each reported spectrum corresponds to a few candidate proteins with large alignment scores.We then align them with each other and report the final alignments by using TopMGFast directly.This time, no starting points are required.Besides, we used two kinds of error tolerances for peaks when computing alignment results for both TopMG and TopMGFast, one is constantly equal to 27 and the other is the peak-dependent error tolerance computed in Section Methods.The experiments were also performed on a cluster with 200 GB of memory.

Results for error tolerance 27
The results with error tolerances 27 are shown in Tables 1  and 2. As shown in Table 1, the running time of TopMGFast is 8702 min which is about 6.6 times that of TopMG when setting the alignment error tolerance to be 27.Among 2817 selected spectra, TopMG and TopMGFast can report the same proteins for 1396 spectra and different proteins for the other 1421 spectra, respectively.Furthermore, among 1396 spectra reporting the same proteins from TopMG and TopMGFast, 1082 spectra report roughly the same alignment locations (resulting in alignments with overlaps), while 314 spectra report completely different alignment locations.
For both 1396 (reporting identical proteins) and 1421 (reporting different proteins) spectra groups, the numbers of matches obtained from TopMGFast are slightly larger than that of TopMG.(24.61 versus 23.19 and 17.75 versus 13.34, respectively.)Consequently, the average numbers of one residue and two residue matches for reported alignments obtained from TopMGFast are also slightly larger than those obtained from TopMG.Since mass matches of one or two residues are more reliable than mass matches corresponding to a sum of a large number of residues, this might be evidence that the alignments reported by TopMGFast are more reliable than those of TopMG.It is interesting to observe that the number of 1 residue matches is slightly <50% of total matches.
The reason that TopMGFast can obtain more matches than TopMG is that TopMGFast requires that the difference between the locations of a peak and the corrected peak is bounded by 27, where TopMG requires that the mass between two consecutive matched peaks is at most 27 away from the corresponding mass in the protein.Thus, it is possible that the tolerated error of the mass between two consecutive matched peaks for TopMGFast is larger than that of TopMG.However, TopMGFast can ensure that there is no error for a mass between any pair of matched peaks after error correction.
To further compare the quality of resulting alignments, we use two measures, the maximum mass error (MME) between two matched peaks in an alignment and the average mass error (AME) between all pairs of matched peaks.The maximum mass error (MME) between two matched peaks in an alignment is the largest error among all the n 2 � � pairs of n matched peaks in an alignment.The average mass error (AME) between two matched peaks in an alignment is the average error (comparing to the theoretical masses in the protein database) among all the n 2 � � pairs of n matched peaks in an alignment.
Here, we plot the differences between TopMG's and TopMGFast's MME and AME values (MME=AME diff) for those spectra, where both methods report the same proteins in Fig. 2. MME=AME diff can be computed by the formula MME=AME diff ¼ TopMG0s MME=AME À TopMGFast0s MME=AME.As shown in Fig. 2, 1396 spectra that report the same proteins have been sorted by MME=AME diff in nondecreasing orders.For these 1396 spectra, there are 347 spectra in which −35≤MME diff<0 in the left figure.This means that TopMGFast's alignment results have worse MME values compared with TopMG's alignment results for these 347 spectra.The worst case is the spectrum that MME diff ¼ −35.Similarly, the numbers of spectra for the conditions MME diff ¼ 0; 0<MME diff≤35 and a For the reported proteins from TopMG and TopMGFast among 2817 spectra, the same reported protein can be obtained for 1396 spectra and different reported proteins can be obtained for 1421 spectra, respectively.
b Among 1396 spectra reporting the same proteins from TopMG and TopMGFast, 1082 spectra report overlap alignment results while 314 spectra report completely different alignment results.
c Each pair of adjacent peaks in the alignment corresponds to a sub-path in G. 'i residue matches' means that such a sub-path contains i residues for i ¼ 1, 2, 3. Accession number: 7), Phosphorylation (UNIMOD Accession number: 21), and Carbamidomethylation (UNIMOD Accession number: 4).Besides, each protein is set to have five modifications in the generated spectrum.Now, we have 100 simulated spectra with their known real corresponding proteins and a protein database with 700 proteins.The original database has >4000 proteins.The reason that we do not use the whole database is that the speeds of both TopMG and our method are very slow to directly search the whole database.
We use both TopMG and our method to directly align the 100 simulated spectra against the 700 proteins in the database and report the proteins with the best alignment score as the search results.According to the alignment results, TopMG reported 94 real corresponding proteins for the 100 cases.Thus, the prediction accuracy for TopMG is 94%.a For the reported proteins from TopMG and TopMGFast among 2817 spectra, the same reported protein can be obtained for 1580 spectra and different reported proteins can be obtained for 1237 spectra, respectively.b Among 1580 spectra reporting the same proteins from TopMG and TopMGFast, 1150 spectra report overlaps alignment results while 430 spectra report completely different alignment results.Fast proteoform identification with peak error corrections Our method, TopMGFast, reported 100 real corresponding proteins for the 100 cases and the prediction accuracy is 100%.
Case study: TopMG failed the case, where the real corresponding protein is sp j P0A235 j RFC SALTY.For this case, TopMG reported protein sp j P26465 j FLII SALTY with 53 matched peaks in the alignment.We checked the errors between every two adjacent matched peaks in this alignment for both TopMG and TopMGFast.We found that there are many consecutive negative errors (the first few errors are: [4, −1, −20, −9, −16, −24, … ]) in the alignment reported by TopMG.These consecutive negative errors make the maximum mass error for TopMG become 121 which is much larger than the user-defined error tolerance 27.However, the error between any two matched peaks in the alignment reported by TopMGFast is strictly no more than 2×27 ¼ 54.Maybe this is the reason why TopMG reported the wrong protein and the evidence that error correction methods such as TopMGFast are more reliable than TopMG.
To test the performance accuracy for local alignment, we randomly select a peptide from each of the above 100 selected proteins.The lengths of those selected peptides are from 23 to 154.Each peptide contains three modifications.We then generate 100 simulated spectra for the 100 peptides and use both TopMG and TopMGFast to search the database containing 700 proteins.TopMG reported 98 real corresponding proteins with correct locations for the 100 cases.The prediction accuracy for TopMG is 98%.TopMGFast reported 100 correct spectra and the prediction accuracy is 100%.

Figure 1 .
Figure 1.The figure illustration example for the error tolerance overlaps.The range between the position mass iþ1 −δ iþ1 and mass i þδ i is the overlap and the positions like the point A in it will be computed twice.

Figure 3 .
Figure 3.The differences between TopMG's and TopMGFast's MME and AME values for those spectra reporting the same proteins when using the peak-dependent error tolerance.Here, MME=AME diff ¼ TopMG 0 s MME=AME−TopMGFast 0 s MME=AME.The 1580 spectra have been sorted by MME=AME diff in nondecreasing order.

Table 1 .
The comparisons between TopMG and TopMGFast using error tolerances equals 27 without diagonal optimization on the filtered EC dataset.

Table 3 .
The comparisons between TopMG and TopMGFast using the same peak-dependent error tolerances without diagonal optimization on the filtered EC dataset.

Table 4 .
The error comparisons between TopMG and TopMGFast for 2817 alignments when using the peak-dependent error tolerance with overlaps deleted.

Table 5 .
The comparisons of reported proteins between two kinds of error tolerances for TopMG and TopMGFast.