FFPopSim: an efficient forward simulation package for the evolution of large populations

Motivation: The analysis of the evolutionary dynamics of a population with many polymorphic loci is challenging, as a large number of possible genotypes needs to be tracked. In the absence of analytical solutions, forward computer simulations are an important tool in multi-locus population genetics. The run time of standard algorithms to simulate sexual populations increases as 8L with the number of loci L, or with the square of the population size N. Results: We have developed algorithms to simulate large populations with arbitrary genetic maps, including multiple crossovers, with a run time that scales as 3L. If the number of crossovers is restricted to at most one, the run time is reduced to L2L. The algorithm is based on an analogue of the Fast Fourier Transform (FFT) and allows for arbitrary fitness functions (i.e. any epistasis). In addition, we include a streamlined individual-based framework. The library is implemented as a collection of C++ classes and a Python interface. Availability: http://code.google.com/p/ffpopsim/. Contact: richard.neher@tuebingen.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.

To compile the library from the C++ source, you need the GNU scientific library (http://www.gnu.org/software/gsl/) and the Boost C++ library (http://www.boost.org/). If all of these are installed and the appropriate path are set, FFPopSim can be compiled using Make. Make will compile the C++ code and call setup.py, which uses python distutils to make sure the correct python path are used. Installation instructions are provided in the INSTALL file. The building process creates files inside the folder pkg; C++ headers are created in pkg/include, the static C++ library in pkg/lib, and the Python module files in pkg/python.
The python interface to the C++ library is produced using SWIG. The output files of SWIG are platform independent and included in the source. Hence SWIG is required only if the source is changed and the python binding need to be rebuilt. For this reason, make clean will not delete SWIG output. Similarly, the documentation is produced using Doxygen (the C++ part) and Sphinx (the python part) and shipped with the source. make clean-all deletes documentation and SWIG output.

II. CONVENTIONS AND DEFINITIONS
FFPopSim assumes haploid individuals. This imposes no restriction as long as dominance effects are not important. An extension to diploids is envisioned. Any population size we refer to is therefore the number of haploid genomes in the populations.
FFPopSim is strictly biallelic: Each locus can be in one of two states, conveniently stored as one bit 0/1. A population in FFPopSim is endowed with a genotype to fitness map. There are many ways in which such a map can be specified. Most straightforwardly, one explicitly assignes a fitness value to each genotype. However, there are prohibitively many genotypes to store and specify if the number of loci is large and this option is only supported in haploid lowd. With biallelic loci, there are 2 L possible genotypes and the most general function would need to be specified for each of these.
Alternatively, a genotype-phenotype map can be specified relative to a reference genotype, commonly referred to as wild type. If the wild type genome is g W T = {0, 0, . . . , 0}, the fitness or phenotype of any other genome g = {s 1 , . . . , s L } is then where c (1) i is the effect of a substitution at locus i, while c (2) i j is the epistatic effect of a double substitution at loci i and j. For computational reasons that will become clear below, it is useful to reformulate this map in a basis where a locus takes the states t i ∈ {−1, +1} rather than s i ∈ {0, 1}. The {−1, +1} basis is much more symmetric and this symmetry can be used to speed up calculations. For this reason FFPopSim uses the {−1, +1} basis. Since most population geneticists are more familiar with the {0, 1} basis, we discuss their relation here.
The two bases are related to each other by a simple linear transformation t i = 2s i − 1 and its inverse s i = 1 2 (t i + 1) In this symmetric basis, a genotype phenotype map is specified as The coefficients of this expansion f Hence, for most practical purposes the single locus selection coefficients in the {−1, +1} basis are simply half of the single locus selection coefficients in {0, 1} basis. Similarly, the two locus interaction coefficients f i j are simply a quarter of their counter part c (2) i j . Note, however, that interactions in the {0, 1} bases contribute to the single locus coefficents in the {−1, +1} basis. The constant term f (0) is rarely important since all fitness values are relative to a mean fitness.
FFPopSim stores all fitness values as log-fitness. Conventional fitness is then obtained by exponentiation: W (g) = exp(F(g)). Specification of an additive fitness landscape in terms of F(g) results in multiplicative fitness in terms of W (g).

III. DESCRIPTION OF FFPOPSIM
FFPopSim contains two packages of C++ classes and Python wrappers for forward population genetics simulations. One for large populations and relatively few loci (L < 20), another one for longer genomes. The former is called hapoid lowd and Figure 1 Calculating the Fourier transform in L2 L operations. Arrow going up indicate addition, going down substraction. For the general L dimensional hypercubes L cycles are necessary where terms differing at different bits are combined. tracks the abundance of all possible genotypes. The other one is called haploid highd, tracks only genotypes present in the population, and is essentially individual-based.
haploid highd only allows for limited complextity of fitness functions and crossover patterns, while hapoid lowd supports arbitrarily complex fitness functions and genetic maps. The two parts of the library have very similar syntax but work quite differently under the hood. We will therefore describe them separately below.
A complete html documentation of the C++ functions is generated automatically from the source using Doxygen and can be found in pkg/doc/cpp/html/index.html. All python functions are documented using Sphinx and can be found in pkg/doc/python/html/index.html. The following is a rather technical description of the algorithms used in the library and how they are implemented as C++ functions. To simply get an impression of what can be done with FFPopSim, the reader can fast-forward to the examples in section V.

A. Fast-Fourier Transform of functions of the genotype
Above, we introduced a specification of genotype to phenotype functions by building them from single locus effects, pair-wise interaction terms, and so forth. In total there are 2 L such coefficients -enough to specify the most general function. Any function F(g) on the hypercube can be expressed as There are L k coefficients f (k) i 1 ...i k for every subset of k loci out of L loci. There are 2 L coefficients in total (Weinberger, 1991).
These nominally 4 L operations (2 L for each coefficient) can be done in L 2 L steps via the Fast Fourier Transform (FFT) illustrated in Fig. 1. Both the forward and reverse transform are implemented in hypercube lowd. This FFT is central to the efficient simulation of large populations and used extensively across haploid lowd.

B. FFPopSim for few loci: haploid lowd
A complete list of all function with descriptions, call and caller graphs, is generated automatically from the source in html by Doxygen. Here, we describe the most important functions and describe the recombination algorithm.

Specification of the population and the fitness function
An instance of hypercube lowd can be initialized with the function values of the hypercube or with its Fourier coefficients. The population class, haploid lowd, holds instances of hypercube lowd for the population, the mutant genotypes, the recombinant genotypes and the fitness function.
From a practical point of view, an instance of a low-dimensional population is initialized in three steps, where L is the number of loci and rng seed is a seed for the random number generator (a random seed by default); (b) the initial population can be specified by one of the functions i n t h a p l o i d l o w d : : s e t a l l e l e f r e q u e n c i e s ( d o u b l e * f r e q , u n s i g n e d l o n g N) i n t h a p l o i d l o w d : : s e t g e n o t y p e s ( v e c t o r < i n d e x v a l u e p a i r t > g t ) i n t h a p l o i d l o w d : : s e t w i l d t y p e ( u n s i g n e d l o n g N) that set the population size and composition. The first function initializes random genotypes in linkage equilibrium with the specifiec allele frequencies freq, the second explicitely sets a number of individuals for each genotype using the new type index value pair, and the last one sets a wild type-only (genotype 000 . . . 00) population of size N; (c) the fitness function is stored in haploid lowd in an instance of hypercube lowd which is a public member haploid lowd.fitness. To inialize, the following functions are available: The hypercube lowd::additive and hypercube lowd::init coeff list allow to set either all additive coefficients, or a vector of index value pair where the index specifies the interaction term (00101 would be a interaction between locus 0 and 2). hypercube lowd::init list specifies fitness of genotypes directly, rather than coefficients. The index in index value pair then specifies the genotype again in binary format.

Evolution
In traditional Wright-Fisher type models, in each generation, the expected frequencies of gametes with a particular genotype after mutation, selection, and recombination are calculated and then the population resampled from this gamete distribution. We will now outline the steps required to update the population. A more detailed discussion can be found in (Neher and Shraiman, 2011). All the following steps are called by either one of the following functions: is another alternative that skips the recombination step (asexual evolution).

Mutation
Let P(g) be the genotype distribution at the beginning of a generation. Denoting the mutation rate towards the 1 or 0 state at locus i by u 1/0 i , the expected P(g) of genotype g = {s 1 , . . . s L } after mutation will be where Π i g denotes genotype g with locus i flipped from 0 to 1 or vice versa. The first term is the loss due to mutation, while the second term is the gain due to mutation from neighbouring genotypes (in terms of Hamming distance). The mutation rates can be specified by the folowing set of overloaded functions: either a single double rate (same for every position and in both forward and backward sense), two double rates (forward and backward rates), a L dimensional double array (site-specific, identical forward and backward rates), or a 2 × L dimensional double array, takes two rates and sets rate forward as the forward rate (0 → 1) and rate backward as the backward rate (1 → 0) takes the pointer to an array of length L and sets site-specific rates, the same for forward and backward mutations; • i n t h a p l o i d l o w d : : s e t m u t a t i o n r a t e ( d o u b l e * * r a t e s ) takes a pointer to a pair of (pointers to) arrays, each of length L, which contain the site-specific rates for forward (rates[0]) and backward (rates[1]) mutations.

Selection
Selection reweighs different genotypes according to their fitness as follows where e F is the population average of e F(g) , which is required to keep the population size constant. The corresponding function is i n t h a p l o i d l o w d : : s e l e c t g a m e t e s ( )

Resampling and genetic drift
For deterministic modeling, one generation would be completed at this point and one would repeat the cycle, starting with mutation again. For stochastic population genetics, we still need to resample the population in a way that mimics the randomness of reproduction. The easiest and most generic way to do this is to resample a population of size N using a multinomial distribution with the current P(g) as sampling probabilities of different genotypes. Alternatively, one can sample individuals according to a Poisson distribution with mean NP(g) for each genotype, which will result in a population of size N ± O( √ N). For large populations, the two ways of resampling are equivalent and we chose the latter (much faster) alternative. The function i n t h a p l o i d l o w d : : r e s a m p l e ( ) samples the next generation according to the expected genotype frequencies. The expected population size used in the resampling is the carrying capacity, which can be set by the user at will.

Mating and recombination
The computationally expensive part of the dynamics is recombination, which needs to consider all possible pairs of pairs of parents and all different ways in which their genetic information can be combined. In a facultatively sexual population, a fraction r of the individuals undergo mating and recombination. In obligate sexual populations, r = 1. The genotype distribution is updated according to the following rule: The distribution R(g) of recombinant gametes would naively be computed as follows: where ξ specifies the particular way the parental genomes are combined: ξ i = 0 (resp. 1) if locus i is derived from the mother (resp. father). The genotype g is summed over; it represents the part of the maternal (g m ) and paternal (g p ) genotypes that is not passed on to the offspring. We can decompose each parent into successful loci that made it into the offspring and wasted loci, as follows: g p = ξ ∧ g + ξ ∧ g and g m = ξ ∧ g + ξ ∧ g , where ∧ and a bar over a variable indicate respectively the elementwise AND and NOT operators (i.e., ξ i := 1 − ξ i ). The function C assigns a probability to each inheritance pattern. Depending on whether the entire population undergoes sexual reproduction or only a fraction r of it, the entire population or a fraction r is replaced with R(g). The central ingredient for the efficient computation of R(g) is the Fourier decomposition introduced above. The generic Fourier coefficient of R(g) is given by Just as g p and g m can be expressed as a combination of g and g , we can invert the relation and express the generic t i as a function of g p and g m , as follows: t i = ξ i t m i + ξ i t p i . Using this new basis and exchanging the order of summations, we obtain Notice that C(ξ) can be pulled out of the two inner sums, because the odds of inheriting a certain locus by the mother/father is independent of what their genetic makeup looks like. Next we expand the product and introduce new labels for compactness, where l is the number of loci inherited from the mother among the k in (i 1 , . . . , i k ). l runs from 0 (everything is contributed by the father) to k (everything from the mother). { j i } and {h i } are all (unordered) partitions of i into sets of size l and k − l, respectively. Now we can group all ξ i in the inner sum with C(ξ), all t m i with P(g m ), and all t p i with P(g p ). The three sums (over ξ, g m , and g p ) are now completely decoupled. Moreover, the two sums over the parental genotypes happen to be the Fourier decomposition of P(g). Hence, we have The quantity can be calculated efficiently, for each pair of partitions ({ j i }, {h i }), by realizing that (a) for k = L, there is exactly one term in the sum on the right that is non-zero and (b) all lower-order terms can be calculated by successive marginalizations over unobserved loci. For instance, let us assume that k = L − 1 and that the only missing locus is the m-th one. We can compute There are L k ways of choosing k loci out of L, which can be inherited in 2 k different ways (the partitions in j and h in Eq. (15)) such that the total number of coefficients is 3 L . Note that these coefficients are only calculated when the recombination rates change. Furthermore, this can be done for completely arbitrary recombination patterns, not necessarily only those with independent recombination events at different loci.
The recombination process is greatly simplified if only one crossover is allowed. In this case, the majority of the coefficients C (L) j 1 ... j l m,h 1 ...h L−1−l are zero. In fact, there are only L allowed recombination patters and the sum over the different partitions in Eq. (15) has only k + 1 allowed terms. In this case, the entire recombinant distribution can be calculated in O(L2 L ) steps and is therefore of the same complexity as the mutation step.
haploid lowd provides a function to calculate C(ξ) from recombination rates between loci assuming a circular or linear chromosome. The probability of a particular crossover pattern is calculated assuming independent crossovers. The function assumes a double array of length L − 1 for a linear chromosome and of length L for a circular chromosome. For a linear (resp. circular) chromosome, the i-th element of the array is the probability of recombining after (resp. before) the i-th locus. The recombination model is specified as an optional argument and can take values CROSSOVERS, SINGLE CROSSOVER and FREE RECOMBINATION specifed in the header file. If the recombination model is CROSSOVERS, independent crossovers between loci are assumed and multiple crossovers allowed. SINGLE CROSSOVER restricts to at most one crossover (not available for circular genomes), and the recombination rates are interpreted as probabilities of observing a crossover between two loci (their sum needs to be less than one). haploid lowd offers the simpler alternative for free recombination which reassorts all loci at random. If the user does not set the recombination rates via set recombination rate, free recombination is the default behaviour.
Other more complicated recombination patterns can be specified manually through which takes a vector of patterns and the associated rates. All other rates are assumed 0. Remember to specify the rate at which no recombination happens (pattern 000000 or 111111). Furthermore, the mating probability r can be specified explicitely via the attribute h a p l o i d l o w d : : o u t c r o s s i n g r a t e the default is obligate sexual reproduction, i.e., outcrossing rate=1. Note that, in a circular chromosome, there is effectively one more inter-locus segment (between the last and the first locus) in which crossovers can occur, and the total number of crossovers has to be even. Assuming independent crossovers, the global recombination rate of circular chromosomes is lower than a linear chromosome of the same length by a factor of (1 − e −r 0 ), where r 0 is the recombination rate between the first and last loci.
The recombination process itself is initiated by

C. FFPopSim for many loci: haploid highd
For more than 20 loci, storing then entire genotype space and all possible recombination patterns becomes prohibitive. For this reason, we also include a streamlined individual-based simulation package that can simulate arbitrarily large numbers of loci and has an overall run-time and memory requirements O(NL) in the worst case scenario. The many-loci package uses the same interface as the few-loci one. This makes it easy, for example, to first test an evolutionary scenario using many (all) loci and to focus on the few crucial ones afterwards.
The one notable difference is that fitness in haploid highd is a function of several traits, all of which are functions on the hypercube. This is useful, for example, to model changing treatment with antiviral drugs: fitness is then a function of replicative capacity and drug resistance. The mapping of traits to fitness can be changed by defining the function which is a virtual member of haploid highd.
To speed up the program in many cases of interest, identical genotypes are grouped into clones. The way the population is initialized and evolution is controlled is largely identical to hypercube lowd above. We only highlight the differences here. A full html documentation and many examples are provided with the source.

Specification of the population and the fitness function
haploid highd stores individual genotypes as bitsets. Each genotype, together with the number of individuals that carry it, as well as traits and fitness associated with it is stored for as long as it is present in the population. All relevant quantities are aggregated in the structure clone t. The population is a vector of clones. Each generation clone sizes are updated, new clones that arise through mutation and recombination are added, and clones of size 0 marked for future reuse.
Traits are functions on the hypercube. The latter is implemented as hypercube highd. Instead of storing all possible trait values, hypercube highd stores non-zero Fourier coefficients. Whenever a new genotype is produced, its fitness is calculated by summing the appropriate coefficients. No explicit specification of trait values for particular genotypes is possible.
Fitness is a function of the traits of individuals. By default, fitness is identical to the value of the first trait. The mapping from traits to fitness is defined as a virtual function and can be overwritten by the inheriting classes.

Mutation
To implement mutation, we determine the individuals that mutate (each indiviudal has probability of 1 − exp(−Lµ) to mutate). In each of these individuals, at least one, and possible several mutations are introduced in random locations in the genome. Mutations are bit-flip operations in the bitset. As of now, mutation rates are the same at every locus in the genome.

Selection
Prior to selection, the population average W := e F and a growth rate adjustment exp(1 − N/N 0 ) are computed. The latter is used to keep the population size close to the carrying capacity N 0 . The size of each clone is then updated with a Poisson distributed number with mean (1 − r)W −1 exp(F(g) + 1 − N/N 0 ), where r is the probability of outcrossing (only non-outcrossing individuals remain part of the same clones). The number of individuals that outcross is again Poisson distributed number with mean rW −1 exp(F(g) + 1 − N/N 0 ). Those individuals are saved for recombination in the next step.

Recombination
The individuals marked for sexual reproduction during the selection step are shuffled and paired. For each pair, a bitset representing the crossover pattern is produced and two new clones are produced from the two parental genomes. Crossovers are assumed to occur independently between any two loci with the crossover rate. Alternatively, all loci of the two genomes can be reassorted at random. The two allowed values for recombination model are FREE RECOMBINATION and CROSSOVERS.

D. Python wrapper
The C++ library includes Python bindings that greatly simplify interactive use and testing. The wrapping itself is done by SWIG (Beazley, 2003). Most notably, the C++ classes haploid lowd, haploid highd and the HIV-specific subclass hiv population are fully exposed to Python, including all their public members. The performance speed for evolving a population is unchanged, since the evolve function iterates all steps internally for an arbitary number of generations.
The bindings are not completely faithful to the C++ interface, to ensure a more intuitive user experience. For instance, some C++ attribute set/get members are translated into Python properties via the builtin property construct. Furthermore, since direct access to the hypercube lowd instances from Python is not straightforward, a few utility functions have been written to do common tasks. The fitness hypercube can be set easily by either one of The former function is used to set specific points on the hypercube: it takes a sequence of genotypes genotypes (integers or binary literals using 0b notation) and a sequence of fitness values fitnesses, corresponding to those genotypes. Fitness values are interpreted as growth rates, also known as log-fitnesses. Any missing points on the fitness hypercube will be set to F(g) = 0. The second function creates an additive fitness landscape with selection coefficients specified by the L-dimensional input sequence selection coefficients, see Eq. (3). After installation, the FFPopSim library can be used in Python as a module, e.g.

from FFPopSim i m p o r t h a p l o i d l o w d
The bindings make heavy use of the NumPy library and its SWIG fragments and typemaps (Oliphant, 2006). We therefore recommend to import NumPy before FFPopSim, although this is not strictly necessary. Moreover, the Python bindings include a few functions for plotting features of the population, such as genetic diversity. The Python module Matplotlib is required for this purpose (Hunter, 2007).
The HIV-specific part of the code has been expanded further in Python to enable quick simulations of viral evolution under commonly studied conditions. In particular, random genotype-phenotype maps for viral replication capacity and drug resistance can be generated automatically from a few parameters, via the functions h i v p o p u l a t i o n . s e t r e p l i c a t i o n l a n d s c a p e h i v p o p u l a t i o n . s e t r e s i s t a n c e l a n d s c a p e Both functions have many input parameters that allow to specify typical properties of HIV populations, such as the fraction of sites carrying potentially adaptive mutations. See the Python documentation or the tables below for further details on these functions. Genomes of large population samples can be written to a file in a compressed format using the function where filename is the name of the file, in which the data are to be stored, and number of individuals if the size of the random sample. The data can be accessed again by the standard Numpy load function. sets the mutation rate(s). rates specifies the forward mutation rate(s) and is a single value or a list. rates back speficies the backward rates get_mutation_rates(locus=None, direction=None)

A. haploid lowd
returns the mutation rate(s). direction is 0 for forward, 1 for backward rates. Defaults to all sites and both directions

Continued on next page
Definition Description get_divergence_statistics(n_sample=1000) returns the mean and variance of divergence from wildtype in the population get_divergence_histogram(bins=10, chunks=None, every=1, n_sample=1000, **kwargs) returns a histogram of divergence in the population. chunks is a list of pairs, each for a genomic region of interest; every enables comparisons of subsets, e.g. only at third-codon positions (every=3) plot_divergence_histogram(axis=None, n_sample=1000, **kwargs) returns a plot handle of the histogram of divergence in the population. axis enables plotting in an extant axis get_diversity_statistics(n_sample=1000) returns the mean and variance of diversity in the population get_diversity_histogram(bins=10, chunks=None, every=1, n_sample=1000, **kwargs) returns a histogram of diversity in the population.
Arguments as in get_divergence_histogram plot_diversity_histogram(axis=None, n_sample=1000, **kwargs) returns a plot handle of the histogram of diversity in the population. axis enables plotting in an extant axis

C. hivpopulation
As a subclass of haploid_highd, this class inherits its methods. Only different or new methods are listed.
Definition Description __init__(N=0, rng_seed=0, mutation_rate=3e-5, coinfection_rate=0.01, crossover_rate=0.001) constructor. rng seed a seed for the random number generator. A wildtype population of size N is generated. If N is zero, you have to initialize the population explicitely before performing any simulations treatment treatment weight (between 0 and 1). See also trait_weights of haploid_highd.

V. EXAMPLES
In the following, we list examples of the usage of FFPopSim for a small number of loci (haploid lowd) and many loci (haploid highd and hivpopulation). Additional examples are provided in the online documentation.

A. haploid lowd: Valley crossing
One of the most striking effects of genetic epistasis is the slowdown of evolution when a combination of mutations is beneficial, but intermediate mutants are deleterious compared to wild type. Such scenario is relevant in applications, for instance for the emergence of bacterial or viral resistance to drugs (Weinreich et al., 2005). Not surprisingly, recombination plays a central role in this process. On the one hand, it enhances the production rate of multiple mutants, on the other it depletes the class of complete mutants by back-recombination with deleterious backgrounds.
FFPopSim makes the efficient simulation of such processes as easy as the following script: During an HIV infection, the host immune system targets several viral epitopes simultaneously via a diverse arsenal of cytotoxic T-cells (CTLs). Mutations at several loci are thus selected for and start to rise in frequency at the same time. Because of the limited amount of recombination, different escape mutations compete against each other. This interference delays CTL escape.
The theoretical description of genetic interference is involved and often limited to two-loci models, but FFPopSim makes the simulation of this process straightforward. The following script evolves a four-loci population under parallel positive selection and tracks its genotype frequencies: The resulting genotype frequencies are shown in Fig. 3. The full script producing the figure is provided as separate file in the examples folder or online.

C. haploid highd: Genetic draft and drift
Neutral polymorphisms are strongly affected by linked selection. To demonstrate this effect, we simulate a large population of N = 50000 individuals and a genome of L = 256 loci. The majority of these loci are designated to be neutral and segregate at frequency 0.5. Mutations at every tenth locus have strongly deleterious effects and they hence are at a frequency close to zero. If once in a while one of these deleterious loci becomes beneficial due to an environmental change, new mutations will now rapidly sweep to fixation and perturb linked neutral diversity. This effect of genetic draft can be easily studied using haploid highd. The following script produces the desired output.

D. hivpopulation: HIV chronic infection
HIV evolution during chronic infection is determined by a number of parallel processes, such as mutation, recombination, and selection imposed by the immune system. In combination, these processes give rise to a complicated dynamics and don't understand how population features such as population diversity depend on model parameters. Hence simulations are an important source of insight.
FFPopSim offers a specific double C++/Python interface to this problem via its class hivpopulation. The following script simulates an HIV population for one thousand generations, under a random static fitness landscape, and stores a hundred random genomes every 250 generations in a compressed file: NumPy can be used subsequently to analyze the genome sequences. Alternatively, the internal Python functions can be used, e.g. for calculating the fitness distribution directly using hivpopulation.get fitness histogram, as shown in Fig. 5. The full script producing the figure is provided as separate file in the examples folder or online.