GRASP: Guided Reference-based Assembly of Short Peptides

Protein sequences predicted from metagenomic datasets are annotated by identifying their homologs via sequence comparisons with reference or curated proteins. However, a majority of metagenomic protein sequences are partial-length, arising as a result of identifying genes on sequencing reads or on assembled nucleotide contigs, which themselves are often very fragmented. The fragmented nature of metagenomic protein predictions adversely impacts homology detection and, therefore, the quality of the overall annotation of the dataset. Here we present a novel algorithm called GRASP that accurately identifies the homologs of a given reference protein sequence from a database consisting of partial-length metagenomic proteins. Our homology detection strategy is guided by the reference sequence, and involves the simultaneous search and assembly of overlapping database sequences. GRASP was compared to three commonly used protein sequence search programs (BLASTP, PSI-BLAST and FASTM). Our evaluations using several simulated and real datasets show that GRASP has a significantly higher sensitivity than these programs while maintaining a very high specificity. GRASP can be a very useful program for detecting and quantifying taxonomic and protein family abundances in metagenomic datasets. GRASP is implemented in GNU C++, and is freely available at http://sourceforge.net/projects/grasp-release.

The GRASP default parameters were set as the follows. For seeding, the GBMR10 alphabet and a seed length of 6 amino acids were used. These seeding parameters were chosen based on their overall performance in terms of selectivity and sensitivity [see Figure 2 in reference (4)]. The actual alignment score of the seed pair was computed under the original amino acid alphabet. And such an alignment score was further used to filter low-similarity seed pairs (default alignment score cutoff a * 6 . 0 , where a is the average of the matching scores for the same type of amino acids, i.e. scores in the diagonals, in the given scoring matrix). For the assembly module, 10 amino acids ( 10 = l ) were required as the minimum overlap to define the supporting reads r of a given path p . Also, at least 3 supporting reads were required to extend a path. The score drop-off threshold used to terminate the extension is 25 (bit-score). For the alignment module, the BLOSUM62 matrix and a gap open and gap extension penalty of -11 and -1, respectively, were used for all experiments. The same set of Karlin-Altschul statistics (5) parameters in the BLAST suite was used in GRASP to compute E-values. The band size for the alignment is 40.

B. GRASP pseudo-code
The GRASP algorithm contains two major components. The first component (GRASP_MAIN) identifies the initial seeds and issues the extension calls in both left and right directions. The second component (EXTEND) extends the path with the best existing maximal extension sequence in the priority queue, and reinserts the newly identified maximal extension sequences, along with their corresponding assemblies, back to the priority queue based on their coverages (the number of supporting reads). The pseudo-codes for both components are detailed in this supplement. For notations please refer to the main text. ← ∑ * user specified reduced alphabet build suffix array SA on R ; build suffix array rSA on ) (R rev build hash table H using all n where n is a k -long substring of some R r ∈ // the key of H is * ∑ n (a k -mer in alphabet * ∑ ), and the value is the set N of k -mers in alphabet ∑ , where for lb rb rb rb else: // an invalid range means the suffix is not in the database, search proceeds with shorter suffix

C. Evaluation of impact of different length cutoffs
BLASTP results were interpreted as three different sets to evaluate the impact of using different length constraints for filtering the local alignment-based search results. The set "BLASTP (full)" was used to represents those reads whose full-length sequences were aligned by BLASTP, "BLASTP (partial 50%)" for those who had >50% of their full-length sequences aligned by BLASTP, and "BLASTP (partial)" for all reads that had any part of their sequences aligned by BLASTP. The performances of BLASTP for these three interpretations are shown in Supplementary Figure S1. It is observed that "BLASTP (full)" shows slightly improved specificity but significantly lower sensitivity compared to the other two, and "BLASTP (partial)" and "BLASTP (partial 50%)" show no significant difference from each other in terms of both specificity and sensitivity.

D. Auxiliary read mapping step at the end of GRASP
The path extension module of GRASP does not consider mismatches or gaps in the sequence. Even though it is possible to take mismatches into account during extension by issuing multiple suffix array searches (4), the current GRASP setting is conceptually straightforward and computationally efficient, and is capable of reducing the redundancy of the output contigs. GRASP only outputs those reads that perfectly match some substring of the assembled contigs. Here we show that the performance of GRASP can be further improved by incorporating an auxiliary read mapping step at the end, using the simulated marine data set (DS3) as an example.
The auxiliary read mapping step aims at recruiting the reads that are not detected by the main GRASP algorithm due to mismatches. The read mapping step works as follows. For each query protein sequence, we mapped remaining reads in the database against the output contigs. We mapped a read if >60% of its full-length sequence can be aligned to one of the contigs with at most 3 mismatches (only substitution is considered). Note that we allowed up to 3 mismatches because DS3 was generated with 1% error rate, which is higher than the empirical error rate for the Illumina technology (>85% of the reads have error rate less than 0.1%, see http://res.illumina.com/documents/products/technotes/technote_q-scores.pdf). Supplementary Figure S2A shows the performances of GRASP+mapping and the other three programs in searching 16 glycolysis pathway-related genes in Dehalococcoides sp. CBDB1 against DS3. As compared to the performances shown in Figure 3C of the main text, incorporating the read mapping step further improves GRASP's sensitivity by >10% with a slight decrease in specificity (GRASP+mapping achieves 62.62% sensitivity and 86.57% specificity with an E-value cutoff of 10, and GRASP without mapping achieves 50.76% sensitivity and 87.15% specificity with the same E-value cutoff). Supplementary Figure  S2B shows the performances of GRASP+mapping and the other three programs in searching the 198 Amphroa2 (6) marker genes in Dehalococcoides sp. CBDB1. As compared to the results shown in Figure  3D in the main text, GRASP+mapping achieves ~20% higher sensitivity but ~1% lower specificity (GRASP+mapping achieves 67.82% sensitivity and 87.86% specificity with an E-value cutoff of 10, and GRASP without mapping achieves 48.02% sensitivity and 88.99% specificity with the same E-value cutoff). Both results show that incorporating the auxiliary read mapping step will significantly improve GRASP's sensitivity at the same specificity level. The running time for the auxiliary read mapping step is minimal compared to the main program, as we can assume near-perfect matches between the contigs and the reads.
Supplementary Figure S3 shows the performances of the four programs (GRASP, FASTM, PSI-BLAST, and BLASTP) in searching Prevotella (organism code: PIT, estimated abundance 12.03%), Fusobacterium (organism code: FUS, estimated abundance 6.39%), and Aggregatibacter (organism code: AAP, estimated abundance 1.42%) against the real HMP (Human Microbiome Project) saliva data set (i.e. DS4). For definitions of specificity and true homologous reads please refer to the main text. The results shown in Supplementary Figure S3 further confirm the observation that GRASP is capable of recruiting more true homologous reads than the other search programs with a high specificity.

F. GRASP run-time
GRASP runs slower than the NCBI BLAST suite and FASTM in most of the experiments. This is expected since GRASP also performs de novo assembly along with the alignment. Supplementary Figure  S4 shows the run-time for GRASP on different queries and targets (databases). The run-time variation in Supplementary Figure S4A (i.e. different queries against same database) is more significant than that in Supplementary Figure S4B (same query against different databases), because the actual amount of assemblies/alignments to be performed is query-specific (e.g. the number of seeds that can be identified from the query to initiate the assemblies/alignments). Both results show linear correlations between the run-time and query/database size. It implies a linear run-time growth with respect to the number of proteins being searched (in most real applications, the database size is fixed for a given metagenomic dataset). Running GRASP in multi-threaded mode is capable of improving the actual run-time (i.e. 2.12 fold speedup was observed with 4 threads, see Supplementary Figure S4). Potential overhead was observed due to the inter-thread communications that are required to avoid redundant extension of reads that have been consumed by other extensions.