Scrooge: a fast and memory-frugal genomic sequence aligner for CPUs, GPUs, and ASICs

Abstract Motivation Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large memory footprint, and does some unnecessary work. Results We propose Scrooge, a fast and memory-frugal genomic sequence aligner. Scrooge includes three novel algorithmic improvements which reduce the data movement, memory footprint, and the number of operations in the GenASM algorithm. We provide efficient open-source implementations of the Scrooge algorithm for CPUs and GPUs, which demonstrate the significant benefits of our algorithmic improvements. For long reads, the CPU version of Scrooge achieves a 20.1×, 1.7×, and 2.1× speedup over KSW2, Edlib, and a CPU implementation of GenASM, respectively. The GPU version of Scrooge achieves a 4.0×, 80.4×, 6.8×, 12.6×, and 5.9× speedup over the CPU version of Scrooge, KSW2, Edlib, Darwin-GPU, and a GPU implementation of GenASM, respectively. We estimate an ASIC implementation of Scrooge to use 3.6× less chip area and 2.1× less power than a GenASM ASIC while maintaining the same throughput. Further, we systematically analyze the throughput and accuracy behavior of GenASM and Scrooge under various configurations. As the best configuration of Scrooge depends on the computing platform, we make several observations that can help guide future implementations of Scrooge. Availability and implementation https://github.com/CMU-SAFARI/Scrooge.


Dataset Details
Supplementary Figures 2-7 show the edit and sequence length distributions for each dataset we used. The histograms were generated as follows: 1. Extract the sequence region pairs specified in the maf or paf file 2. Consider the length of the longer sequence in each pair as the pairs length 3. Align each pair with Edlib to obtain the edit distance, and divide it by the length of the pair

Throughput Distributions
To understand the quality of our throughput results, we plot the throughput distributions of each evaluated tool as box plots. Supplementary Figure 8 shows the results. We observe that all throughput values are highly consistent.
Suppl. Fig. 9: Throughput distributions of each evaluated tool. Each boxplot consists of the median as a red line, and the first and third quartiles as the lower and upper end of the box, respectively. The whiskers are 1.5 inter-quartile ranges down and up from the first and third quartile, respectively.

Throughput Sensitivity to Window Size (W) on GPU
We explore the sensitivity of Scrooge's throughput to the window size parameter W (Section 2.2.3) on GPUs. We vary W and set O=W//2+1. Note that larger W improve accuracy (Section 2.2.3). From the GPU results in Supplementary Figure 8 we make five observations: First, as on the CPU, performance generally reduces as W increases. Second, as on the CPU, there are sudden performance dropoffs whenever the window (and thus bitvector) size surpasses a multiple of the machine word size (32 bits on the tested GPU (NVIDIA, 2023)). Third, some configurations can run only for small W when the DP table R is placed in shared memory. This is because as W increases, the memory footprint of R increases cubically, and some configurations run out of shared memory. With global memory, the DP tables can occupy as much of the GPUs off-chip memory as needed, thus Scrooge can run even very large W, although the performance eventually becomes impractically low. Fourth, for very small W, shared memory can offer significantly better performance than global memory. This is because for smaller W the memory footprint per DP table is smaller, and more instances can be kept in shared memory, thus easily achieving high occupancy. In contrast, global memory remains bound by memory bandwidth, because the operational intensity is unchanged. Fifth, for small window sizes and shared memory it is beneficial to run without the memory improvements SENE and DENT. This corroborates the observation made for the CPU implementation: If memory capacity and bandwidth are not limiting factors, then the computational overheads of SENE and DENT overshadow their benefits. In contrast, when W is large, the memory improvements become increasingly important because of the inverse argument. The three key takeaways from these experiments are that (1) the ideal combination of improvements depends on the window size (W) parameter, (2) the memory improvements SENE and DENT memory improvements are increasingly beneficial as the window size increases, and (3) the window size W should be exactly or a small multiple of the machine word size for optimal throughput.

Throughput Sensitivity to Window Overlap (O) on GPU
We explore the sensitivity of Scrooge's throughput to the window overlap parameter O (Section 2.2.3) on GPUs. We vary O and set W=64. Note that larger O improve accuracy (Section 2.2.3). From the GPU results in Supplementary Figure 9 we make two observations: First, when DENT is disabled, performance decreases strictly and smoothly as O increases. This is because the algorithm makes W-O characters progress per window, i.e. progress per window decreases linearly with increasing O. Second, we observe that with DENT enabled, performance can sometimes increase as O increases, up to some limit. This is because DENT reduces the memory footprint of the DP table more as O increases. We observe a large and sudden increase in performance at O=34. This is because with DENT, each bitvector stored for traceback has size W-O+1, thus for O≥33 each stored bitvector fits into a single 32-bit machine word (NVIDIA, 2020). The key takeaway from this experiment is that a larger window overlap (O) can improve performance up to some limit. This is convenient because larger values of O also improve accuracy. In particular, we found for W=64 the optimal performance is attained by O=33, increasing both throughput and accuracy over the default operating point chosen by Senol Cali et al. (2020) of W=64 and O=24 for their accelerator.
Suppl. Fig. 10: Performance of our GPU implementation with W=64 and varying O.

Tool Capabilities
Supplementary

Tool Function Calls and Parameters
Supplementary

Roofline Model Calculations
We analyze a single GenASM window of size W=64 (see Section 2.2.3). Since GenASM simply repeats the calculation of a window in a loop, this analysis is representative of the entire program.

GenASM Data Movement
Per window, GenASM generates W=64 intermediate bitvectors to initialize the topmost row of R (Line 11 of Algorithm 1). It also generates 4×64×64 intermediate bitvectors for the inner entries of R (Lines 13-17 of Algorithm 1), out of which 3×64×64 are stored for traceback (Section 2.2.2). Bitvectors have size W=64 bits each. We assume that the processing elements communicate neighbor entry values without memory accesses. This is a realistic assumption, for example, both the hardware accelerator in (Senol Cali et al., 2020), as well as our GPU implementation are implemented this way. Thus, each of these bitvectors is written to memory exactly once per window (during the construction of R) and not read again. We ignore the memory accesses for reading the input sequences and for traceback, as the consumed bandwidth is negligible relative to DP table construction. This yields data_movement = (64 + 3×64×64) × 64b = 98,816B = 96.5KiB.

GenASM Operational Intensity
The operational intensity is defined as the ratio of work done over data movement (Ofenbeck et al., 2014). We define work as the number of arithmetic and logic 64-bit integer operations (op), and data_movement as the number of bytes moved between registers and the memory hierarchy and scratchpad memory. In each window, first, GenASM computes 65 entries to initialize the rightmost column, using 1 op each (Line 5). Second, it computes 64 entries to initialize the topmost row, using 2 op each (Line 11). Third, it computes 64×64 entries using 7 op each (Lines 13-17 of Algorithm 1). Thus, work = 65op + 64×2op + 64×64×7op = 28,865op. The GenASM algorithm's operational intensity is then I = work / data_movement ≈ 0.29 op/B .

Expected Edit Distance for Random String Pairs
Lemma S1. The expected edit distance between an uncorrelated random string pair of length W over an alphabet of size |Σ| is at most W*(|Σ|-1)/|Σ|. Proof. We prove Lemma S1 in two steps. First, we calculate the expected Hamming distance between an uncorrelated random string pair. Second, we prove that the edit distance of a string pair of equal length is at most its Hamming distance.
From Lemma S1 follows in particular that for a 4-character DNA alphabet, the expected edit distance between an uncorrelated random string pair of length W is at most W*(4-1)/4=3*W/4. Lemma S2. The expected Hamming distance between an uncorrelated random string pair of length W over an alphabet of size |Σ| is W*(|Σ|-1)/|Σ|. Proof. The Hamming distance between two strings is defined as the number of locations where they differ (Hamming, 1950). For two uncorrelated random strings over an alphabet of size |Σ|, the probability that they differ at a given location is P differ =(|Σ|-1)/|Σ|, and the expected number of differences at a single location is E[differ]=P differ =(|Σ|-1)/|Σ|. For W possible locations in two strings of length W is then E[differeces]=W*E[differ]=W*(|Σ|-1)/|Σ|=E [Hamming].
Lemma S3. The Hamming distance of a pair of strings of equal length is an upper bound on its edit distance. Proof. The edit distance of a string pair is defined as the smallest number of single-character insertions, deletions, and substitutions to convert one into the other (Levenshtein, 1966). Suppose the Hamming distance of an equal-length string pair was strictly smaller than its edit distance. Then the edit distance could not have been the smallest number of single-character insertions, deletions, and substitutions to convert one string into the other, as the series of substitutions that resulted in the hamming distance would have converted one string into the other at an even smaller cost. Thus, the Hamming distance of an equal-length string pair cannot be strictly smaller than its edit distance.

Comprehensive CPU Scaling and Sensitivity to W Results
We explore the scaling of Scrooge's throughput with CPU threads for W=64 and O=33, and the sensitivity of throughput to the window size (W). Supplementary Figure 12 is the extended version of Figure 9, showing all possible combinations of improvements. In particular, it shows combinations including and excluding DENT. From Supplementary Figure 12, we observe that DENT yields small benefits at best (e.g., SENE+DENT vs.SENE), but can even lead to a slowdown (e.g., GenASM vs. DENT, SENE+ET vs. Full). In particular, for W=64 and O=33, the fastest configuration is SENE+ET doesn't use DENT. In contrast, enabling SENE and DENT yields consistent throughput increases. Thus, we omit DENT from Figure 9 for readability.
Suppl. Fig. 12: (a) Scaling and (b) sensitivity to window size of our CPU implementation.

Example of a "Difficult" Sequence Pair
We show in Section 3.6 that larger windowing parameters W and O generally improve alignment accuracy and that most sequence pairs are aligned optimally or near optimally. By manually inspecting a few worst-case (i.e., poorly aligned) sequence pairs, we observe that their apparent "difficulty" comes from small regions that (1) are relatively noisy and/or (2) contain repetitions of sequence pairs. We observe that if the windowing parameter W is increased to be larger than the noisy and/or repetitive region, these difficult sequence pairs are aligned correctly.
To further illustrate this observation, we give an example of an apparently "difficult" sequence pair from the long read groundtruth dataset in Supplementary Figure 13. Supplementary Figure  13 shows the alignment path produced by Edlib (i.e., a globally optimal edit distance alignment) and alignments produced by Scrooge for the windowing parameters (W=32, O=17) and (W=16, O=9). We observe that with (W=32, O=17), Scrooge aligns the sequence pair near optimally (the light green alignment by Scrooge almost completely covers the red alignment by Edlib). In contrast, Scrooge with (W=16, O=9) fails to find part of the optimal alignment early in the sequence pair (in the top left of Supplementary Figure 13) and never recovers for the remainder of the sequence pair.
Supplementary Figure 14 shows a zoomed-in view of where (W=16, O=9) first diverges from the optimal alignment. We make two key observations about this region of the sequence pair. First, we observe that it is particularly noisy, i.e., the optimal alignment contains 5 edits between reference characters 26 to 40 (an error rate of 33%). Second, we observe that it contains the highly repetitive strings "AGAGA" and "ACACACA" between reference characters 25 and 35. Scrooge with (W=16, O=9) fails to find the globally optimal alignment of these repetitive strings in favor of a more locally optimal alignment (i.e., it incurs only 2 edits until read character 31 instead of 3, which would be globally optimal). Scrooge with (W=32, O=17) also aligns "AGAGA" wrongly between reference characters 25 and 29 but has a sufficiently global optimal view to recover to the optimal alignment afterward. We conclude that the larger W parameter enables Scrooge with (W=32, O=17) to have a more globally optimal view of the noisy and repetitive region and hence produces a significantly better alignment.