COMER2: GPU-accelerated sensitive and specific homology searches

Abstract Summary Searching for homology in the vast amount of sequence data has a particular emphasis on its speed. We present a completely rewritten version of the sensitive homology search method COMER based on alignment of protein sequence profiles, which is capable of searching big databases even on a lightweight laptop. By harnessing the power of CUDA-enabled graphics processing units, it is up to 20 times faster than HHsearch, a state-of-the-art method using vectorized instructions on modern CPUs. Availability and implementation COMER2 is cross-platform open-source software available at https://sourceforge.net/projects/comer2 and https://github.com/minmarg/comer2. It can be easily installed from source code or using stand-alone installers. Contact mindaugas.margelevicius@bti.vu.lt Supplementary information Supplementary data are available at Bioinformatics online.


S1 Methods
This section provides an overview of algorithms implemented in COMER v2.2 for accelerating profile-profile comparison and alignment on the GPU.

S1.1 COMER2 process
The schematic diagram of COMER2 process using multiple GPUs is shown in Figure S1. To utilize the thousand-core architecture of the GPU, data (profiles) are processed in large chunks read constantly by the Data reader thread. The size of data chunks depends on the GPU memory allocated for computation and the length of the longest query profile. The Job dispatcher thread on the CPU iteratively instructs the Commutator threads to distribute data chunks across one or more GPUs and initiate computation.  Figure S1. Schematic diagram of COMER2 process.
A GPU performs complete alignment computation on an assigned chunk of data for each query profile in parallel. First, a batch score matrix between the query profiles and the multiple database profiles of the data chunk is constructed. The batch score matrix is used to batch calculate dynamic programming (DP) matrices. A challenging aspect is parallelization of DP, which introduces dependencies between the cells of the matrix. Next, calculated alignment scores and estimated statistical significance of the alignments (Margelevičius, 2019) determine which database profiles progress to the next phase. Optionally, DP matrices for selected profiles are recalculated using a maximum a posteriori (MAP) DP algorithm (Durbin et al., 1998, p. 94). Final alignments are obtained by backtracking through the (MAP) DP matrices. The alignments computed on the GPU are further transferred to an Alignment formatter thread on the CPU, which formats them for output and passes the data to the Alignment writer responsible for merging and writing the final alignments to disk once all database profiles have been processed.
The following subsections focus on the aspects affecting COMER2 computing performance the most. That is data representation, batch computation of a score matrix, and batch computation of DP matrices, which is common to both DP and MAP DP matrices.

S1.2 Data representation in memory
Organization of data in memory is highly important when targeting performance. The highest memory throughput can be achieved when adjacent memory locations are accessed for data. Therefore, profiles are organized in memory as a collection of large continuous arrays of attribute values, which represent amino acid distribution, transition probabilities, context and secondary structure data, and others ( Figure S2). One array (cells along the horizontal axis) represents one attribute (e.g., probability of a certain amino acid). And each array is physically independent and can map to a disjoint memory region. Such a data structure ensures coalesced and aligned memory access for each profile attribute and maximizes memory throughput.

S1.3 Computation of scores
Scores between query profiles and database profiles are calculated by exploiting massively parallel computing. Thread blocks consisting of 1024 threads process a batch score matrix in parallel ( Figure S3), where the threads of each block also run in parallel. A block dimension of 32x32 allows for highly efficient memory accesses, as the calculation of 1024 cells of the score matrix requires reading only 32 query and 32 database A-dimensional profile vectors (attribute values at a certain position).
The threads of a block first cache A × (32 + 32) profile data values, calculate the scores for the locations corresponding to the query and database profile positions being processed, and write the 1024 calculated scores to memory. To achieve maximum memory throughput, the threads coalesce reads and writes using shared memory (cache).

Batch score matrix
Processing thread block 32x32 Figure S3. Illustration of score matrix computation. Green squares represent 32x32 thread blocks running in parallel. Each thread block calculates 1024 scores in parallel for a small part of the batch score matrix. The smallest cells in the batch score matrix represent single matrix cells.

S1.4 Computation of DP matrices
The computation of DP matrices is much more complicated. Occupancy of running threads is limited by data dependencies: each DP matrix cell depends on its upper, left, and diagonal neighbors. Therefore, the calculation of a cell in the DP matrix requires the values of the whole submatrix, excluding the cell itself at the bottom-right corner, to be known.
The solution for keeping the thread occupancy high is to arrange the DP matrix in diagonal blocks and process these blocks iteratively in parallel ( Figure S4). Similarly to the approach discussed in Section S1.3, two-level parallelism emerges.
A block of 32 threads processes a diagonal block of 1024 matrix cells in parallel, where the processing also includes updating backtracking information. Concurrently, multiple thread blocks process multiple diagonal cell blocks at the same time. Due to data dependencies, the processing, however, is arranged in iterations.
Dependencies between data imply that only the diagonal cell blocks along one anti-diagonal (highlighted in Figure S4) of a DP matrix, corresponding to one pair of a query and a database Background Figure S4. Illustration of DP matrix computation. DP matrices corresponding to different pairs of query (along the y-axis) and database profiles (along the x-axis) are delineated by thick grey lines. Multiple DP matrices are processed at the same time. The processing is arranged in iterations. At each iteration, 32-thread blocks process one anti-diagonal of diagonal cell blocks from each DP matrix in parallel. Diagonal cell blocks processed in the same iteration are numbered with the same iteration number. For illustration, diagonal cell blocks of iteration 5 are highlighted in orange. The smallest cells in each diagonal cell block represent single cells of DP matrices.
profile, can be processed at a time. Yet, multiple anti-diagonals from multiple DP matrices can be processed in parallel. Accordingly, thread blocks iteratively process the whole batch DP matrix, i.e., multiple DP matrices, at the same time. High occupancy is achieved by processing at each iteration anti-diagonals from all DP matrices being calculated.
The processing of anti-diagonals at each iteration requires only a small fraction of the calculated values of the diagonal blocks from the previous iteration. These values correspond to the bottom row and the right-most column of each diagonal cell block. Hence, the memory requirements to calculate the whole batch DP matrix are O( i l i ), where l i is the length of database profile i, and the sum runs over all profiles in the current chunk of the profile database.
The maximum memory throughput for this configuration of computation is achieved by coalesced reads and writes. The 32 threads of each thread block read and write the data of the corresponding diagonal cell block (including the scores described in Section S1.3) in parallel using shared memory. Instruction throughput is maximized by that the 32 threads of each thread block process all 32 rows of the corresponding diagonal cell block independently (but synchronously) and in parallel ( Figure S4). S1.5 Impact on sensitivity and alignment quality COMER2 introduced an entirely different software architecture compared to the previous version. These differences include changes in data types, new data structures, parallel algorithms, and instruction set architecture. They alone are sufficient to affect sensitivity and final alignments (and hence their quality). For example, 7 significant digits of the 32-bit floating-point number, the main data type used in COMER2, imply rapid error growth when performing millions of fast but less accurate mathematical operations and functions on these numbers.
The result of the calculation of batch matrices was mostly affected upon the introduction of Schraudolph's approximation to exponential and logarithmic functions (Schraudolph, 1999).
They take only one instruction (multiply-add) but cause relative errors of up to 6%. To counterbalance the accumulation of errors, we eliminated the down-rounding of the similarity score between profile context vectors (Margelevičius, 2018), which had been used to produce slightly shorter local alignments. These changes led to an increase in sensitivity score on the validation dataset (Section S2.1.1) from ROC S 4K = 0.8250 to 0.8360, where ROC S 4K is the normalized area under the ROC curve up to 4000 false positives (Section S2.4).
The extent of local alignments produced by COMER is controlled by the posterior probability threshold option MINPP in the options file options.txt (please see the README file in the software package). To take into account the impact of the discussed changes on resulting alignments, the default value of this option was adjusted based on ROC analysis of high-quality alignments on the validation dataset in the local and global evaluation modes (Section S2.5). A value of 0.28 was found to be the best compromise (Table S1) and was set as the default value. Although it does not differ much from the default value used in the previous version (0.30), the user may wish to experiment with different values of the option MINPP.

S2.1 Datasets
S2.1.1 SCOPe dataset The SCOPe dataset was compiled from SCOPe database v2.03 (Fox et al., 2013) of protein domains filtered to 20% sequence identity. It included 4900 domains from 547 folds (Margelevičius, 2018). Out of these domains, 1722 representing each superfamily, but including every fourth domain of larger superfamilies, were used as queries. 1112 domains from the remaining folds were used as the validation dataset to adjust the default value for the user-configurable option MINPP (Section S1.5).
Two categories of multiple sequence alignments (MSAs) were used to construct profiles for the domain sequences. The first category of MSAs was obtained by running for each sequence PSI-BLAST (Altschul et al., 1997) (v2.2.28+) for six iterations using a sequence inclusion threshold of 10 −5 and soft masking of low complexity regions against the UniRef50 sequence database (Suzek et al., 2015). The second category of MSAs was obtained by running HHblits (Remmert et al., 2012) for three iterations using default settings against the UniProt20 database of profile HMMs. Only statistically significant matches were included in MSAs (Margelevičius, 2018).
Secondary structure predictions were calculated from the MSAs using PSIPRED (Jones, 1999).

S2.1.2 Simulated dataset
The simulated dataset was created by concatenating two sets of profiles. First, one million random profiles were generated using the tools published, and the procedure described previously (Margelevičius, 2019). Following the procedure, profiles of various (effective) numbers of observations (sequences) and lengths were generated by adding random noise (r = 0.03) to 1012 (seed) profiles representing unrelated UniRef50 sequences, randomly sampling fragments of length s = 9, and splicing the fragments together.
The initial effective number of observations (ENO) as a parameter for generating profiles was sampled from a normal distribution with a mean of 6 and a standard deviation of 1.5 and then floored to the nearest multiple of 2 with a minimum of 2 and a maximum of 12. The final ENO was not strictly controlled, and profiles were allowed to take on different values in the vicinity of sampled initial values.
The profile length was sampled from a gamma distribution with a shape of 2 and a scale of 125 and then rounded to the nearest integer. The minimum profile length was set to 20. This setting implies a mean profile length of approximately 250. Figure S5 shows the distribution of the number of generated profiles.
The random MSAs from which the profiles were constructed are available at https://sourceforge.net/projects/comer2/files/comer2-pub-data-2.02 The other part of the simulated dataset was 1284 profiles constructed for the sequences of randomly selected PDB (Berman et al., 2000) structures sharing less than 20% sequence identity at the domain level. The COMER profiles were constructed from MSAs obtained from HHpred PDB70 database 190918 (Zimmermann et al., 2018). The MSAs were converted from a3m to FASTA format using the reformat.pl program from the HH-suite3 software package (Steinegger et al., 2019). The HHsearch/HHblits profiles were used directly from HHpred PDB70 database 190918.
The simulated dataset contained 1 001 284 profiles in total. The size of the COMER profile database created for this dataset was 53GB. The corresponding size of the HHsearch/HHblits profile database was 29GB. Hence, COMER2 was required to read nearly twice as much data for its searches. Query profiles for searching the simulated dataset were also obtained from HHpred PDB70 database 190918 entries chosen at random. There were 128 query profiles.
COMER search can be configured using the options file options.txt. The tests were performed with default values set for the options, including an E-value threshold of 10 (EVAL=10). The exceptions were NOHITS=7000 and NOALNS=7000 giving the maximum number of hits to display (never reached), and SSEMODEL=1 specifying the use of statistical model 1 for the SCOPe dataset when producing local alignments (default MINPP value). The number of GPUs to be used and GPU memory limits were specified using the command-line options --dev-N and --dev-mem, respectively. The expected proportion of significant alignments was specified by setting --dev-pass2memp=40 and --dev-pass2memp=1 for the SCOPe and the simulated dataset, respectively.
HHsearch and HHblits searches were configured similarly: -mact m -e 10 -p 30 -E 10 -Z 7000 -z 1 -B 7000 -b 1 -v 0 -alt 1 -cpu c where m = 0 when producing maximally extended alignments and m = 0.3 otherwise, which yielded better alignment quality than a default value of 0.35 without affecting sensitivity. c is the number of CPU cores. HHblits was additionally configured to run one iteration: -n 1.
Default values for options were used for FFAS. COMPASS was configured with the options -e 10 -v 3000 specifying an E-value threshold of 10 and the (allowed) maximum number of hits in output.
Profiles for each of these methods were constructed using default settings. Secondary structure predictions were incorporated into COMER and HHsearch/HHblits profiles using the makepro.sh and addss.pl programs from the respective software packages. HHsearch/HHblits profiles from HHpred PDB70 database 190918 contained DSSP secondary structure assignment (Kabsch and Sander, 1983) and/or PSIPRED predictions and were used as-is.

S2.3 Running time
The running time of profile-profile alignment methods was measured by the Linux time command. The mean and standard deviation were estimated by repeating the measurement three times.

S2.4 Sensitivity
Sensitivity was measured by the number of identified true positives (TPs). A pair of aligned domains that belonged to the same SCOPe superfamily or shared statistically significant structural similarity (DALI (Holm et al., 2008) Z-score ≥ 2) was considered a correct match (TP). For the simulated dataset, the analysis was performed at the domain level.
Aligned pairs that did not meet the above criteria but belonged to the same SCOPe fold were considered to have an unknown relationship and were ignored. Other aligned pairs or alignments involving a random profile were false positives (FPs).
An alignment between two domains was considered to be of high quality if the most accurate structural model generated for one of the two domains, using the other domain as a template, was similar to the real structure. Structural models were generated using MODELLER v9.15 (Šali and Blundell, 1993). TM-score (Zhang and Skolnick, 2004) was used to evaluate the structural similarity between a model and the native structure. An alignment was considered a high-quality alignment (HQA) if it caused a statistically significant similarity, TM-score ≥ 0.4. A low-quality alignment (LQA) was an alignment for which TM-score < 0.2.
Alignments were evaluated in the local and global modes. The global mode was used to evaluate the alignment with respect to the whole protein domain, whereas the local mode along the alignment extent.

S3.1 Additional results for the SCOPe dataset
This section presents additional results for the SCOPe dataset. Figure S6 supplements the first part of Figure 1 in the main text. Figure S6D and H show alignment quality evaluated in the global mode when COMER, HHsearch, and HHblits were configured to produce local alignments. Alignment quality evaluated in the local mode is shown in Figure S6E and I. Here, COMER, HHsearch, and HHblits were configured to extend alignments maximally. The results of COMER v2.2 in Figure S6 show consistent improvement in alignment quality.
The quality of alignments was also measured by counting the number of HQAs that resulted in structural models with a TM-score ≥ 0.5. A TM-score ≥ 0.5 usually indicates a close homology between proteins identified by homology search and is an indicator of a high-quality structural model sharing the same fold with a template structure. This threshold is used for evaluating full-length protein target models. Figure S7 shows that using this threshold, COMER v2.2 produces a greater number of HQAs than the other tested methods irrespective of the evaluation mode and the category of alignments (local or maximally extended).
The running time of COMER v2.2, HHsearch v3.2, and HHblits v3.2 is shown in Figure S8. Similarly to the results observed for the simulated dataset, COMER v2.2 achieves near-perfect scaling even on the relatively small SCOPe dataset. And its running time weakly depends on the GPU memory used.
The results of the running time of HHsearch v3.2 shows that distributing work across a larger number of (physical) cores is not very efficient for a relatively small dataset.

S3.2 Running time for the simulated dataset
The running time of COMER v2.2, HHsearch v3.2, and HHblits v3.2 for the simulated dataset is shown in Figure S9. Figure S9 supplements the second part of Figure 1 of the main text and shows running times for different settings.

S3.3 Benchmark tests on multi-domain proteins
Practical applications involve searching for protein chain sequences. The simulated dataset, therefore, included profiles constructed for full-chain sequences from the PDB. This section reports the results of additional tests performed on multi-domain proteins.
First, profiles and their database were constructed for all 190 multi-domain protein chain sequences of class e from SCOPe v2.03 filtered to 20% sequence identity. Then, all of the profiles were searched against their database. Figure S10 shows the sensitivity and alignment quality results for the multi-domain proteins. The results of the alignment quality measured by the number of HQAs with a TM-score ≥ 0.5 are shown in Figure S11.
The results show that overall, the COMER method retains the advantage, but the quality of COMER v2.2 alignments evaluated in the local mode (Figure S10C,G and Figure S11A,E) is inferior to that of COMER v1.5. On the other hand, the opposite is observed in the global evaluation mode ( Figure S10D,H and Figure S11B,F). Longer alignments produced by COMER v2.2 for TPs ( Figure S10A,B) lead to a greater number of HQAs in the global mode at the expense of a lower number in the local mode. Closer inspection also revealed that the differences between COMER v2.2 and v1.5 in the local mode manifest for statistically insignificant alignments.
The running time analysis ( Figure S12) shows the efficiency of COMER v2.2. A speedup of greater than a factor of 3000 compared to COMER v1.5 is observed.

S3.4 Running time of profile and database construction
Fast computation is particularly important for homology search applications. Importantly, data preparation for homology search based on profile-profile comparison and alignment includes the construction of profiles and profile database creation. Figure S13 shows the dependence of the running time of profile construction on the size of the MSAs of native sequences obtained using HHblits (Section S2.1.1). It also shows the total running time required for profile construction and the running time of database construction for the SCOPe dataset. The running times of database construction for the simulated dataset are shown in Figure S14. These results show that database construction, including the construction of profiles, can be a computationally demanding procedure. Approaches for accelerating profile construction, therefore, would greatly facilitate the data preparation phase. Parallel algorithms for processing multiple MSAs on the GPU will be considered in future work. Running time (s) Figure S14. Running time of database construction for the simulated dataset on the Server using 20 CPU cores. The running time for COMER v2.2 is that of makedb + db2bin. Measurements were performed once.