WFA-GPU: gap-affine pairwise read-alignment using GPUs

Abstract Motivation Advances in genomics and sequencing technologies demand faster and more scalable analysis methods that can process longer sequences with higher accuracy. However, classical pairwise alignment methods, based on dynamic programming (DP), impose impractical computational requirements to align long and noisy sequences like those produced by PacBio and Nanopore technologies. The recently proposed wavefront alignment (WFA) algorithm paves the way for more efficient alignment tools, improving time and memory complexity over previous methods. However, high-performance computing (HPC) platforms require efficient parallel algorithms and tools to exploit the computing resources available on modern accelerator-based architectures. Results This paper presents WFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute exact gap-affine alignments based on the WFA algorithm. We present the algorithmic adaptations and performance optimizations that allow exploiting the massively parallel capabilities of modern GPU devices to accelerate the alignment computations. In particular, we propose a CPU–GPU co-design capable of performing inter-sequence and intra-sequence parallel sequence alignment, combining a succinct WFA-data representation with an efficient GPU implementation. As a result, we demonstrate that our implementation outperforms the original multi-threaded WFA implementation by up to 4.3× and up to 18.2× when using heuristic methods on long and noisy sequences. Compared to other state-of-the-art tools and libraries, the WFA-GPU is up to 29× faster than other GPU implementations and up to four orders of magnitude faster than other CPU implementations. Furthermore, WFA-GPU is the only GPU solution capable of correctly aligning long reads using a commodity GPU. Availability and implementation WFA-GPU code and documentation are publicly available at https://github.com/quim0/WFA-GPU.

Table S3: Average power usage (in Watts) to compute alignments using the CPU and GPU aligners.CPU consumption is obtained using Linux perf and GPU consumption is obtained using nvidia-smi.

S3 Power consumption analysis
In this section, we analyze the power consumption enhancements derived from using WFA-GPU over WFA (CPU).It's worth noting that making an entirely equitable comparison is a challenging endeavor, as our analysis is tailored to our specific CPU-GPU configuration.We have tried to make the comparison as fair as possible, using a balanced CPU-GPU combination, however, we acknowledge that comparing diverse hardware devices introduces complexities in the evaluation process.
The thermal design power (TDP) of the NVIDIA GeForce 3080 is 320W, while the TDP of the Intel Xeon-W2155 is 140W.However, many hardware components, like CPU memory and the motherboard, are not accounted for in the CPU TDP.Moreover, hardware devices do not always work at their maximum TDP.As a reference, Table S3 presents the average power by the CPU and GPU when executing WFA (CPU) and WFA-GPU, respectively.
Table S3 shows that the GPU has a higher power draw during the alignment process, however, it is important to emphasize that WFA-GPU significantly reduces execution time, which makes WFA-GPU experiments more power-efficient in terms of energy consumption (measured in Watt-hours, or Wh) compared with CPU executions.Table S4 provides a comparison of power consumption (Wh) between CPU and GPU executions.Remarkably, our results demonstrate that WFA-GPU experiments consume less overall energy, even though GPU devices require more power than typical CPUs.

S4 Performance scalability
Our method is not constrained by the length of the input sequences, but rather by the sequence error rate (i.e., maximum alignment score).According to the maximum alignment score, WFA-GPU allocates the necessary memory, maximizing the total number of GPU workers spawned on the device.Ultimately, this allocation depends on the available computational resources of the device (i.e., Memory and Streaming Multiprocessors, SMs).Note that the memory taken by the input sequences is negligible compared to the memory required by the internal WFA structures.In Figure S1, we show the maximum number of GPU workers that can be accommodated on various GPU models, based on the maximum alignment score allowed.Figure S1 shows how the maximum number of parallel GPU workers decreases as the maximum alignment-score increases.For large alignment-score values, only one GPU worker can be allocated per SM.Labeled points at the right of Figure S1 show the maximum alignment-score for different GPU models.For instance, an RTX 4080 GPU, equipped with 16GiB, could align a pair of sequences with a maximum alignment-score of 10,000.
Note that the RTX 4090, despite having more multiprocessors available, may not be able to handle alignments with alignment-scores as high as the RTX 3080.Although both have the same memory available, the former can execute more alignments in parallel, necessitating a larger memory footprint.

S5 Approximated WFA-GPU parameter exploration
The heuristic parameters affect execution time and recall (percentage of exact alignments).By changing width of the band (β) and the number of steps between the band re-centering (λ) we can find the balance point between time and recall.
It is essential to emphasize that utilizing a narrower band width (β) can significantly influence memory usage.Since employing more threads than the band size is impractical, opting for a smaller β leads to the creation of GPU workers with fewer threads.Consequently, this allows for accommodating a higher number of workers within each Streaming Multiprocessor (SM).This increased parallelism, however, comes at the cost of utilizing more GPU memory, as each worker necessitates its own dedicated memory space.It may be the case where using small bands for alignments with high errors generates a number of GPU workers that exceed the available memory capacity.
We present the Nanopore dataset as a case study, this dataset is the one that benefits the most from the adaptive strategy due to its long sequences and high error rates.Figure S2 shows how time and recall are affected by each (β, λ) combination.It can be seen that there is nearly no performance benefit on incrementing λ beyond 100.When using a large width for the band (β), a λ = 100 is sufficient and, when using a smaller β, setting λ around 50 is recommended.In any case, the accuracy drop is usually small (< 3%).Therefore, unless optimal accuracy is paramount, default (β, λ) parameters will yield a good balance between execution time and minimal loss in recall.

Figure S1 :
Figure S1: Maximum number of GPU workers that can be executed in parallel depending on the maximum alignment score for different GPU device models (i.e., different number of SMs and Memory).
Figure S2: Recall (as a percentage of exact alignments) and time (in seconds) when aligning the Nanopore dataset using different β and λ parameters.

Table S1 :
Description of the real datasets used in the experimental evaluation.

Table S2 :
Time (T, in seconds) and recall (R, as a percentage of exact alignments) for simulated datasets with different error rates (e).All CPU executions use 10 threads.† Implementations can only produce edit-distance alignments.