dadi.CUDA: Accelerating population genetic inference with Graphics Processing Units

Extracting insight from population genetic data often demands computationally intensive modeling. dadi is a popular program for fitting models of demographic history and natural selection to such data. Here, I show that running dadi on a Graphics Processing Unit (GPU) can speed computation by orders of magnitude compared to the CPU implementation, with minimal user burden. This speed increase enables the analysis of more complex models, which motivated the extension of dadi to four- and five-population models. Remarkably, dadi performs almost as well on inexpensive consumer-grade GPUs as on expensive server-grade GPUs. GPU computing thus offers large and accessible benefits to the community of dadi users. This functionality is available in dadi version 2.1.0, https://bitbucket.org/gutenkunstlab/dadi/.

Population genetic data contain much information about the history of the sampled populations, but computationally intensive modeling is often necessary to extract that information.dadi is widely used for inferring models of demographic history (Gutenkunst et al., 2009) and natural selection (Kim et al., 2017) from population genetic data summarized in the form of an allele frequency spectrum.In a typical dadi analysis, the user specifies a model with parameters that represent population sizes, migration rates, divergence times, and/or selection coefficients.For a given set of a parameters, dadi computes the expected allele frequency spectrum, from which the composite likelihood of the sample data can be calculated.Nonlinear optimization is then used to optimize parameters to maximize that likelihood, and the maximumlikelihood values are then interpreted to gain insight into past population genetic processes.During optimization, the model will be evaluated hundreds or thousands of times, leading to substantial computational expense.Here, I show that computing on Graphics Processing Units (GPUs) can massively speed dadi model computation and thus inference.
Modern graphics processing units (GPUs) provide enormous computing power for data parallel tasks, in which the same operations are applied to many entries in memory (Owens et al., 2008).But exploiting this power often demands new algorithms.For example, in computational biology there has been exten-sive research into GPU algorithms for sequence alignment (Manavski & Valle, 2008) and search (Vouzis & Sahinidis, 2011).In genomics, GPU algorithms have been developed for variant calling (Luo et al., 2013) and secondary analysis (Luo et al., 2014) of short-read sequencing data.But GPU computing has rarely been applied to population genetic simulation or inference.Recently, Lawrie (2017) developed a GPU implementation of the single-locus Wright-Fisher model, finding speedups of over 250 times compared to a CPU implementation.Previously, Zhou et al. (2015) implemented a subset of the IM program for inferring isolation-with-migration demographic models (Hey & Nielsen, 2004) on a GPU, demonstrating speedups of around 50 times.Here, I show that GPU computing can dramatically speedup dadi analyses.
For dadi, the limiting computation is the numerical solution of a partial differential equation (PDE) to model the dynamics of the population distribution of allele frequencies φ (Kimura, 1964).In dadi, this PDE is solved using an alternating direction implicit scheme based on the Crank-Nicolson method (Fig. 1A; Press et al. (2007)).Evolving φ forward in time then reduces to solving a large number of tridiagonal linear systems.Model parameters, including population sizes, migration rates, and selection coefficients, affect the a, b, and c diagonal vectors of these systems (Fig 1A).Unlike general matrix equations, < l a t e x i t s h a 1 _ b a s e 6 4 = " w X S r G D U + H 0 P u S y q 9 P E x 6 t z W u f 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " P + C l u h k g c q k P K F Z s  j 4 / 3 O Y c 4 4 f C 6 6 N 4 3 y M d a 6 J i 3 e y u W d g j M S n g V 3 A v m L 7 7 f X j 9 3 S V 7 m V e 2 + 2 I 5 q < l a t e x i t s h a 1 _ b a s e 6 4 = " K H r A 2 J w J l G 6 o t B o 8 r j d P o P j g f T r X e p q 0 Z a z a z B 3 9 g v f 8 A d 9 C W 3 w = = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " J y S J w 8 t p q i R L g K L z K A f 9 8 R l a o X 5 Z 9 q 7 K t 4 9 u q X K G 5 s q j E 3 S K L p C H r l E F P a A q q i G K n t E L G q O J o 5 x X 5 8 1 5 n 5 f m n E X P M f o j 5 + M H 0 T S T q w = = < / l a t e x i t >

• • •
< l a t e x i t s h a 1 _ b a s e 6 4 = " f s y a / f s C s < l a t e x i t s h a 1 _ b a s e 6 4 = " Z 2 0  Nvidia GPU Tesla P100 Tesla K20 Titan RTX GeForce GTX 1650 Super Figure 1: A) Illustration of the dadi numerical integration pipeline.During each timestep, the population allele density φ is propagated forward separately for each population axis.Each row in the non-uniform grid over which φ is approximated yields a tridiagonal linear system that must be solved.In the GPU implementation, the coefficients of all the systems are computed into dense matrices, and the systems are solved in parallel.B) Performance of CPU and GPU dadi implementations (smaller times are superior).For coarse simulations with few grid points, the CPU implementation is faster.But for simulations with many grid points, the GPU implementation can be orders of magnitude faster.In each panel, the dashed line indicates the approximate scaling of memory usage and total floating point operations.
the tridiagonal structure enables these systems to be solved in linear time, using serial Thomas algorithm (Press et al., 2007).
Because diffusion PDEs similar to those in dadi are encountered in many fields, substantial research has been done on GPU algorithms for tridiagonal systems.Early work by Pixar employed the parallel cyclic-reduction algorithm (Hockney, 1965) to solve tridiagonal systems that arise in computer graphics on the GPU (Kass et al., 2006).Later work demonstrated that optimizing such algorithms is complex and that some parallel algorithms are numerically unstable (Zhang et al., 2010).Recently, Valero-Lara et al. (2018) showed that when the number of linear systems to be solved is large, the serial Thomas algorithm can be more efficient on the GPU than these parallel algorithms.This is because the Thomas algorithm enables more efficient memory access, if the diagonal elements a, b, and c are stored in dense matrices (Fig. 1A).
Programming GPUs demands special-purpose frameworks and libraries.Two frameworks dominate the field.The Open Computing Language (OpenCL) is a standard for writing parallel code that is portable across CPUs, GPUs, field-programmable gate arrays, and other specialized hardware (Stone et al., 2010).The Compute Unified Device Architecture (CUDA) is developed by the Nvidia Corporation specifically for use on its GPUs (Nickolls et al., 2008).In general, these frameworks offer similar performance and programming convenience (Holm et al., 2020).OpenCL is available on more platforms, particularly because CUDA is longer supported on OS X, but more libraries are available for CUDA than OpenCL.In particular, the Valero-Lara et al. (2018) algorithm for efficiently solving batches of tridiagonal systems was recently integrated into the CUDA standard library.Thus, the GPU implementation of dadi requires a CUDA-compatible GPU from Nvidia.dadi is primarily written in Python; to interface with CUDA I used the PyCUDA (Klöckner et al., 2012) and scikitcuda (Givon et al., 2019) libraries, which can be easily installed from the Python Package Index using pip.
For the end user, dadi GPU usage is transparent, requirement only a single call to dadi.cuda enabled(True).
To evaluate performance of the dadi GPU implementation, I compared times to compute the population distribution of allele frequencies.The key parameter governing computation time is the number of grid points, pts, used to approximate the numerical solution of the PDE.In practice, the number of grid points must be larger than the largest data sample size, and many more points are needed when simulating strong selection (Kim et al., 2017).The φ matrix has dimensionality P , the number of populations modeled.So the number of elements in φ scales as pts P and so does the expected computation time.I tested models with differing numbers of populations, drawing on the stdpopsim resource where possible (Adrion et al., 2020) on multiple CPUs and GPUs.The benchmarking code is available in the dadi source repository: https://bitbucket.org/gutenkunstlab/dadi/src/master/examples/CUDA.
dadi is most often used with two-or threepopulation models, and in both cases the GPU implementation can be substantially faster than the CPU implementation (Fig. 1B).My two-population test model was the three-epoch model estimated for African and European Drosophila melanogaster by Li & Stephan (2006).My three-population test model was the Out-of-Africa model for humans estimated by Gutenkunst et al. (2009).The server-grade Tesla P100 GPU is over 400 times faster than the corresponding system CPU for two populations and many grid points.Even the mid-range consumer-grade GeForce GTX 1650 Super GPU is over 10 times faster than the system CPU for large 2D and 3D systems.These speed differences are illustrated more directly in Fig. S1, where the ratios of the CPU and GPU times on the same systems are shown.For all systems tested, the GPU began to outperform the CPU at around 150 grid points for two populations and at around 50 grid points for three populations, which are values regularly used with typical data analyses.
Given the dramatic speed up provided by GPU computing, I extended dadi to four-and fivepopulation models.Tests with the four-population New World model from Gutenkunst et al. (2009) and the five-population archaic admixture model from Ragsdale & Gravel (2019) again showed that GPUs can substantially outperform CPUs.
Keys to efficient GPU computing include minimizing data transfer and managing memory usage.For dadi, the φ matrix is copied to the GPU at the outset of each integration function.All computations then remain on the GPU through potentially many time steps, until the φ matrix is copied back when the integration function exits.One subtlety is that during each time step the φ matrix must be transposed and reshaped between integration directions, to maintain the expected alignment of the a, b, c, and φ matrices for the batch tridiagonal solver algorithm.For example, integrating a three-population scenario with 100 grid points involves simultaneously solving 10,000 tridiagonal systems of size 100, so the naturally 100×100×100 φ array must be reshaped to 100 × 10, 000, then reshaped for the next direction of integration, and so forth.Memory is typically more limited on GPUs than on the host systems.For dadi, the full dense a, b, c, and φ matrices are stored as double-precision.For P populations and pts grid points, memory usage is thus roughly 4 × 8 × pts P /1024 3 gigabytes (GB).For 3 populations, a modern mid-range GPU with 4 GB of RAM can thus scale to pts = 400, while a high-end GPU with 24 GB of RAM can scale to pts = 900.A future implementation of dadi on the GPU could potentially only compute portions of the a, b, and c matrices at a time, reducing memory usage but substantially increasing code complexity.
GPU programming can involve unusual performance decisions.For example, if parameter values are constant during an integration, then the a, b, and c matrices do not change between time steps and could be cached.Remarkably, for systems large enough to favor the GPU over the CPU, it is faster to recalculate those matrices for each time step, rather than caching them (Fig. S2).In CUDA, computational threads are organized into blocks that can share local memory, and optimizing block size can be important for maximal performance (Ryoo et al., 2008).For dadi, no communication between threads is needed, so the optimal block size is large (Fig. S3).
For many users, the ultimate benefit of GPU computing is high performance at low cost.In this respect, the performance of the mid-range consumergrade GeForce GTX 1650 Super GPU is remarkable.As of writing, the GeForce costs roughly $200, compared to roughly $2,500 for the high-end consumer-grade Titan RTX and roughly $6,000 for the server-grade Tesla P100.Yet the performance of the GeForce is within a factor of three of the highend GPUs (Fig. S4), even though the P100 theoretically has a 30-fold advantage in double-precision operations per second.This suggests that the dadi workload is bound by memory bandwidth rather than arithmetic.It also shows that the massive benefits of GPU computing for dadi are easily accessible to end users.
GPU computing offers substantial performance benefits for dadi, with minimal user burden.These performance improvements increase dadi's compet-itiveness with alternative methods for calculating the allele frequency spectrum, such as moments (Jouganous et al., 2017).They also make analysis of more complex models, including those with four and five populations, computationally feasible.Lastly, the large benefits of even consumer-grade GPUs for dadi suggest that GPU computing may also be worth considering for other software in computational population genetics.The advantage of the more expensive GPUs grows with problem size, but in these tests it never exceeds a factor of three.
t e x i t s h a 1 _ b a s e 6 4 = " M + 5 M S H f l M O x i L K + i i / w f p P s v 8 H Y = " > A A A B 6 n i c b Z C 7 S g N B F I b P e I 3 x F r W w sB m M g l X Y F U G t D K S x j G g u k C x h d j K b D J m d X W Z m h b D k E W w s F L H 1 H X w J K 7 H x E X w D S y e X Q h N / G P j 4 / 3 O Y c 4 4 f C 6 6 N 4 3 y i u f m F x a X l z E p 2 d W 1 9 Y z O 3 t V 3 V U a I o q 9 B I R K r u E 8 0 E l 6 x i u B G s H i t G Q l + w m t 8 r D f P a L V O a R / L G 9 G P m h a Q j e c A p M d a 6 J i 3 e y u W d g j M S n g V 3 A v m L 7 7 f X j 9 3 S V 7 m V e 2 + 2 I 5 q E T B o q i N Y N 1 4 m N l x J l O B V s k G 0 m m s W E 9 k i H N S x K E j L t p a N R B / j Q O m 0 c R M o + a f D I / d 2 R k l D r f u j b y p C Y r p 7 O h u Z / W S M x w Z m X c h k n h k k 6 / i h I B D Y R H u 6 N 2 1 w x a k T f A q G K 2 1 k x 7 R J F q L H X y d o j u N M r z 0 L 1 u O C e F M 6 v n H z x A M b K w B 7 s w x G 4 c A p F u I Q y V I B C B + 7 g A R 6 R Q P f o C T 2 P S + f Q p G c H / g i 9 / A C z L Z J 8 < / l a t e x i t >a i+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " f Y U 1 D 2 m s / 9 f b W N X x G y v 3 B k 6 + B B 8 = " > A A A B 7 n i c b V D L S g N B E O x N f M T 4 S K J H L 4 N R E I S w K 4 J 6 C 3 j x m I B 5 Q 5 b w 6 b 8 7 7 r H T F m f c c o T 9 y P n 4 A t G y T m Q = = < / l a t e x i t > c i < l a t e x i t s h a 1 _ b a s e 6 4 = " z 2 3 r n o + p 6 I o Y 4 l p t Y n c l c 9 j G M 7 o = " > A A A B 6 n i c b Z C 7 S g N B F I b P e I 3 x F r W w s x Q 4 = " > A A A B 7 X i c b Z D L S g M x F I Y z 9 V b r r e r S T b A K r s q M C O r K g h u X F e w F 2 q F k M p k 2 N p M M y Z l C G f o O b l y 0 i F v x d d z 5 N q a X h b b + E P j 4 / 3 P I O S d I B D f g u t 9 O b m 1 9 Y 3 M r v 1 3 Y 2 d 3 b P y g e H t W N S j V l N a q E 0 s 2 A G C a 4 Z D X g I F g z 0 Y z E g W C N o H 8 / z R s D p g 1 X 8 g m G C f N j 0 p U 8 4 p S A t e r t Q a j A d I o l t + z O h F f B W 0 D p 7 n M 8 1 a T a K X 6 1 Q 0 X T m E m g g h j T 8 t w E / I x o 4 F S w U a G d G p Y Q 2 i d d 1 r I o S c y M n 8 2 m H e F z 6 4 Q 4 U t o + C X j m / u 7 I S G z M M A 5 s Z U y g Z 5 a z q 5 b w 6 b 8 7 7 r H T F m f c c o T 9 y P n 4 A t G y T m Q = = < / l a t e x i t > t i < l a t e x i t s h a 1 _ b a s e 6 4 = " y S B v C R r 6 M g o J U g w r Y 2 5 g 6 w O 3 Z aI = " > A A A B 7 3 i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S p 4 K o k U 1 F v B i 8 c K 9 g P a W D b b T b t 0 s 4 m 7 E 6 G E / g k v H h T x 6 t / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f z s r q 2 v r G Z m G r u L 2 z u 7 d f O j h s m j j V j D d Y L G P d D q j h U i j e Q I G S t x P N a R R I 3 g p G N 1 O / 9 c S 1 E b G 6 x 3 H C / Y g O l A g Fo 2 i l d j c Z i p 5 4 w F 6 p 7 F b c G c g y 8 X J S h h z 1 X u m r 2 4 9 Z G n G F T F J j O p 6 b o J 9 R j Y J J P i l 2 U 8 M T y k Z 0 w D u W K h p x 4 2 e z e y f k z C p 9 E s b a l k I y U 3 9 P Z D Q y Z h w F t j O i O D S L 3 l T 8 z + u k G F 7 5 m V B J i l y x + a I w l Q R j M n 2 e 9 I X m D O X Y E s q 0 s L c S N q S a M r Q R F W 0 I 3 u L L y 6 R 5 U f G q l e u 7 a r l 2 m s d R g G M 4 g X P w 4 B J q c A t 1 a A A D C c / w C m / O o / P i v D s f 8 9 Y V J 5 8 5 g j 9 w P n 8 A G G y P 7 w = = < / l a t e x i t > t i+1 5 b w 6 b 8 7 7 r H T F m f c c o T 9 y P n 4 A t G y T m Q = = < / l a t e x i t > A t+1 i+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " K H r A 2 J w J l G 6 o t B o 8 r j d P o P j g f T 0 c 5 b w 6 b 8 7 7 r H T F m f c c o T 9 y P n 4 A t G y T m Q = = < / l a t e x i t > dadi.cuda_enabled(True)

Figure S1 :
Figure S1: GPU vs CPU speed ups.The results from Fig. 1B are replotted, now showing the ratio between CPU and GPU times on the same systems.The Ocelote and El Gato clusters are University of Arizona centralized Higher Performance Computing resources.

Figure S2 :Figure S3 :Figure S4 :
Figure S2: Caching vs recalculating a, b, and c For the two-population test model and a modified three-population model that uses piecewise constant parameters, plotted is the ratio of times for implementations that cache a, b, and c matrices or recalculate them each time step.Recalculating each time step is faster for larger grid point settings.