forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection

forqs is a forward-in-time simulation of recombination, quantitative traits, and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. forqs is implemented as a command- line C++ program. Source code and binary executables for Linux, OSX, and Windows are freely available under a permissive BSD license.


About the image
The image on the first page was created based on haplotype frequency information reported during a forqs simulation. The image shows local haplotype frequencies (equivalently, local ancestry proportions) along a single chromosome. A single population was simulated with selection at two loci (one near each end of the chromosome). 25% of individuals (blue) in the initial generation carried the selected allele at the first locus, and a different 25% (green) carried the selected allele at the second locus. After 100 generations, the two selected variants have fixed in the population. All individuals carry one of the blue haplotypes in the region containing the first locus, as well as one of the green haplotypes in the region containing the second locus.
1 Getting Started forqs binaries (Linux, OSX, and Windows) are distributed in zip file packages containing the executables, documentation, and example configuration files. The latest forqs packages are available here: https://bitbucket.org/dkessner/forqs/downloads forqs is a command line program that takes a single argument specifying the configuration file for the simulation: forqs puts all output files in the output directory specified in the configuration file. After running this command, you will find a new directory output_ example_1_locus_selection with the output from this simulation. For convenience, you can put the forqs executable somewhere in your PATH, e.g.~/bin, so that you don't have to type the path when running the program.
In addition to the document you are now reading, forqs packages include the forqs Module Reference (docs/forqs_module_reference.html) which contains details about the configurable modules available to the user, including parameter names, usage, and links to examples. forqs begins with a set of founding haplotypes representing the individuals in the initial generation. By assigning a unique identifier to each founding haplotype, individual haplotype chunks are tracked as they recombine over subsequent generations. For the purposes of simulation, any existing neutral variation on the haplotype chunks can be ignored. When simulating selection on standing variation, only those loci with fitness effects need to be tracked.
Internally, forqs represents an individual chromosome as a list of haplotype chunks ( Figure 1). Individuals are diploid, and carry a user-specified number of chromosome pairs. To represent genetic variation at particular loci, forqs queries functions called VariantIndicators to obtain the variant values carried by the founding haplotypes at those loci.
forqs uses configurable modules (Section 4) that are plugged into the main simulator to change the behavior of the simulation (Figure 2). forqs performs the following actions during a single cycle of the simulation: (1-3) Generation of new populations: The PopulationConfigGenerator module (1) provides the simulator with a population configuration that specifies how to create individuals in the next generation from parents in the previous generation, based on user-specified population size and migration rate trajectories. Recombination and mutation are handled by a user-specified RecombinationPositionGenerator module (2) and/or MutationGenerator module (3), respectively.
(4) Genotyping: individuals are "genotyped" at a set of loci, using the VariantIndicator module to obtain variant (SNP) values for each individual. The list of loci to genotype is determined from the quantitative traits and reporters that are specified by the user.
(5) Quantitative trait evaluation: for each quantitative trait, a trait value is calculated for each individual, based on the individual's genotype at loci  (7) Recombination (2) Mutation (3) Figure 2: forqs modular design. The orange boxes represent places where the user can plug in configurable modules.
affecting the trait. Each quantitative trait is specified by a Quantitative-Trait module.
(6) Fitness evaluation: the FitnessFunction module calculates each individual's fitness based on the individual's quantitative trait values.
(7) Reporting: each Reporter module updates its output files using current information from the populations (which includes genotypes, trait values, and fitnesses).
Each configurable module in Figure 2 is actually an interface with multiple implementations. The user specifies which module implementations to instantiate in a forqs configuration file. More details on the primary module interfaces and their interaction with the simulator can be found in Section 4.4. The forqs Module Reference contains detailed information on the primary module implementations available to the user.
In addition to the primary modules shown in the diagram, there are several building block modules (Section 4.5) that provide basic functionality to the primary modules. For example, Trajectory modules provide a unified method for specifying values that change over time, such as population sizes or migration rates. Similarly, Distribution modules can be used to specify how to draw particular random values, for example, QTL positions or allele frequencies.

Tutorial Introduction
In this section we introduce forqs configuration files, starting with a minimal example and gradually adding modules to add complexity to the simulation. All example configuration files can be found in the examples directory of the forqs package.

A minimal example
We start with a minimal example configuration file: There are two modules specified in this configuration file. The first module is PopulationConfigGenerator ConstantSize, which we have assigned the object id pcg. The object id, which can be any text string, allows other objects to reference this object. We have set two parameters for this module, which specify the population size (population size = 10) and the number of generations to simulate (generation count = 3).
The second module is SimulatorConfig, the top-level module. This module must be specified last in the configuration file, because it has references to the primary modules (the general rule is: objects must be specified before they can be referenced by other objects). In this case, we are telling SimulatorConfig to use object pcg as our PopulationConfigGenerator, which is the only primary module that is required to be specified. The other primary modules have trivial defaults, and we leave them unspecified. In addition to the primary module references, SimulatorConfig also contains global simulation parameters. In this case, we have specified the output directory to be output_tutorial_0_minimal.
The simulator queries the PopulationConfigGenerator each generation to obtain a population configuration, which contains the information necessary to create the next generation from the previous one. In this case, the population configuration is the same each generation -it tells the simulator to create a population of 10 individuals with parents drawn from the same population in the previous generation.
When you run forqs on this example (tutorial_0_minimal.txt), you will see screen output showing that the simulation ran for 3 generations. You will also see the new output directory output_tutorial_0_minimal. If you look in this directory, you'll see a single file: forqs.simconfig.txt. This file contains the configuration used for the simulation, including unspecified default parameters. However, there are no other output files in the directory, so we don't actually know what happened during the simulation. To get more information, we need to specify Reporters to report output -we show how to do this in the next section.

Recombination and reporting output
In this example, we add modules to include recombination and report output: When creating offspring chromosomes from parental chromosomes, the simulator uses the RecombinationPositionGenerator module to generate lists of recombination positions.
In this case, we are using RecombinationPositionGenerator Uniform, which chooses the positions uniformly at random along the chromosome. Note the use of the compound parameter chromosome length rate that specifies both the length of the chromosome and the recombination rate.
Reporter modules are used by forqs to output information about the simulation. The user can specify an arbitrary number of Reporters, depending on what output is desired. Reporter Population outputs forqs population files, where each individual is represented as a list of chromosomes (one chromosome per line) and each chromosome is represented as a list of haplotype chunks. By default, population files will be produced only for the final generation -setting update step = n will tell the Reporter Population to output population files every n generations. You can see this by uncommenting (removing the '#' character) the update step line in the configuration file and re-running the simulation.
The population files contain the raw information about the simulated individuals, and are not meant to be analyzed directly. Instead, they can be used to propagate existing neutral variation from the founding haplotypes of the initial generation to the mosaic haplotypes of the final generation (see Section 5).
However, we will use the population files to illustrate forqs' internal representation of chromosomes. The first chromosome in the population file from the final generation is: This means that the first chromosome of the first individual is made up of two haplotype chunks: the first chunk (0,12) starts at position 0 and comes from founder individual 12; the second chunk (704958,7) starts at position 704958 and comes from founder individual 7. Now comment out the following line in SimulatorConfig: recombination_position_generator = rpg which 'un-plugs' the RecombinationPositionGenerator from the simulator. Now re-run the simulation, specifying a different output directory on the command line: forqs tutorial_1_recombination_reporter.txt output_directory=out2 When you look at the resulting final population file, you will see that all chromosomes consist of a single chunk, as you would expect with no recombination. In this example, the population size is 100, the initial allele frequency is .5, and the simulation runs for 200 generations.

Wright-Fisher simulation
The Locus module defines a single site -in this case it is position 100000 on chromosome 1. The Locus has id my locus, and this id is used by two other modules to refer to this locus.
As we saw in the previous example, forqs represents chromosomes as lists of haplotype chunks, with no information about nucleotides or variant (SNP) values at any position. (To be concrete, we think of a variant value as being 0 or 1, but forqs variant values can be anything in the range 0-255, so we are not restricted to bi-allelic SNPs). However, if we give forqs the variant values for the founding haplotypes, it can calculate variant values for any mosaic chromosome. This information is represented by a VariantIndicator, which is a function that tells forqs which founding chromosomes contain which variants.
In our case, we specify a VariantIndicator SingleLocusHardyWeinberg, which assigns variants to individuals in Hardy-Weinberg proportions. Because we have specified an initial allele frequency of .5, this means that of the founding individuals, 25% will be homozygote 0, 25% will be homozygote 1, and 50% will be heterozygotes.
We also want to track the allele frequency at this locus, so we specify a Reporter AlleleFrequencies, which reports the allele frequency at each generation. You will find the output in the file allele_frequencies_chr1_ pos100000.txt.
You may have noticed the commented line with the parameter population count = 10. By uncommenting this line, you will tell forqs to simulate 10 populations. PopulationConfigGenerator ConstantSize does not allow migration between populations, so this effectively gives 10 independent Wright-Fisher simulations. The resulting allele frequency trajectories can all be found in the same output file, with one column per population. As expected, you will see that the alternate allele (1) has been fixed or lost in some populations, and is still polymorphic in others.

Selection
In this example, we add selection to our Wright-Fisher simulation. For this simulation, we add a quantitative trait that is determined by the genotype at a single locus. In fact, in our case, the quantitative trait is fitness (QuantitativeTrait SingleLocusFitness), and we use the identity function for our fitness function (FitnessFunction Identity). The parameters w0, w1, w2 are used to specify the fitness for individuals with genotype 0, 1, and 2, respectively. In general, a quantitative trait can depend on multiple loci on multiple chromosomes, and fitness can depend on multiple quantitative traits.
We also add some useful Reporters that output the mean fitnesses of the populations as well as the deterministic trajectories expected in the limit of infinite population size. These output files can be easily read into a plotting application (e.g. R) to compare the random trajectories with the deterministic trajectories ( Figure 3).

Trajectories
This section shows how to use Trajectory modules to specify population sizes that change over time. In general, Trajectory modules can be used to specify any numeric value that varies by population and generation -other examples include mutation rates and migration rates.
In the first example (tutorial_4a_trajectories.txt), there are two populations following the same population size trajectory: constant size 100 until generation 2, followed by a linear increase in population size until it reaches 400 at generation 5, after which the population size stays at 400. This is accomplished by specifying each piece (Trajectory Constant, Trajectory Linear, Trajectory Constant) and composing the pieces together with Trajectory GenerationComposite. Note that PopulationConfigGenerator LinearSteppingStone has a population size parameter whose value must be the id of a Trajectory module (rather than a numeric value). Also, unless otherwise specified, each Trajectory gives the same numeric value for each population in a given generation. You can see the raw population configuration generated by including the parameter write popconfig = 1, which tells forqs to write out the population configs in a file forqs.popconfig.txt. After running this example, you can examine forqs.popconfig.txt to see the population sizes.
In the next example (tutorial_4b_trajectories.txt), population 1 follows the same population size trajectory as in the first example, while population 2 experiences exponential growth, doubling each generation (with initial population size 1000). In order to specify this scenario, each population size trajectory is specified separately (popsize pop1 and popsize pop2 in this example), and the two are composed with Trajectory PopulationComposite. When Trajectory PopulationComposite is queried for a value for a population, it returns the value obtained from the appropriate sub-trajectory.

Quantitative traits
In this section, we present an example of specifying a quantitative trait by specifying quantitative trait loci (QTLs) and their effect sizes. LocusList loci loci = locus1 locus2 locus3 VariantIndicator_Random vi locus_list:population:frequencies = loci * .5 .5 .5 QuantitativeTrait_IndependentLoci qt qtl = locus1 0 . In this example, we specify a single population (size 100) where individuals have 3 chromosomes. We specify 3 loci, each on a different chromosome.
VariantIndicator Random is used to assign variants to the initial haplotypes randomly, according to the specified allele frequencies -in this case, the variant allele frequency is .5 for each locus. The '*' indicates that all populations have the same allele frequencies -because there is only one population, we could have used '1' in place of '*'. The parameter write vi = 1 in SimulatorConfig tells forqs to write the file forqs.vi.txt containing lists of the initial haplotypes carrying the variant allele at each locus.
QuantitativeTrait IndependentLoci specifies a quantitative trait where the effects of variants at different loci are independent of each other, i.e. there are no epistatic effects. Each QTL is listed with its per-genotype effects -in this case, individuals with genotype 0, 1, or 2 at locus1 will have 0, .1, or .2 added to their trait value, respectively. Alternatively, QTL effect sizes and dominance values can be chosen randomly according to specified distributions. For an example of this, see example_qtl_random.txt which also demonstrates the use of LocusList Random to pick the loci randomly as well.
In this example, FitnessFunction TruncationSelection assigns fitness values of 1 to individuals in the upper half of the distribution, and 0 to individuals in the lower half (proportion selected = .5). Reporter AlleleFrequencies uses the parameter quantitative trait = qt, which tells it to report allele frequencies for all QTLs associated with the trait. Reporter TraitValues reports mean trait values for the population each generation; the parameter write full = 1 tells it to also report individual trait values for each generation.

Reporting haplotype frequencies
This example shows the use of Reporter HaplotypeFrequencies to report local haplotype frequencies in the simulated populations.
In this example, there is a single locus that confers an additive fitness advantage to the individuals carrying the selected variant. The VariantIndicator specifies which haplotype ids actually carry the variant: in this case, VariantIndicator IDRange assigns the value 1 to the first 1000 haplotype ids, i.e. the first 500 individuals are homozygous for the selected variant.
Reporter HaplotypeFrequencies uses a HaplotypeGrouping module that specifies how haplotype ids should be grouped. The grouping can be as finegrained as a single haplotype id per group, but for simplicity we have specified just two groups: those individuals who carry the selected variant, and those who don't. Reporter HaplotypeFrequencies then reports local proportions of the two groups in the population.
We have specified a relatively high recombination rate (3 crossovers per meiosis on average) and strong selection (.1 selection coefficient) in order to illustrate the change in local haplotype frequencies in just a few generations. After running the example, the output file haplotype_frequencies_chr1_final_pop1. txt shows that haplotypes carrying the selected variant have risen to high fre-quency near the selected locus, but not as much in regions farther away, due to recombination. The parameter update step = k will result in reporting of local haplotype frequencies every k generations (default is to report only the final generation).

Module types and interfaces
forqs has a single top-level module called SimulatorConfig. There are seven primary modules, corresponding to the orange boxes Figure 2, which represent the main places where forqs can be customized. There are also several building block modules, which are used by the primary modules.
Multiple modules may share a common interface, as indicated by their names. For example, Reporter Timer and Reporter AlleleFrequencies both implement the Reporter interface. Modules with the same interface can be used interchangeably by the simulator.
In addition, specific modules can be instantiated multiple times (with different object ids and parameters) -for example, Reporter AlleleFrequencies can be used to report allele frequencies at several different loci.
By mixing and matching different modules, the user has a great deal of flexibility in creating customized simulation scenarios. In addition, forqs can easily be extended with new functionality by adding new modules that implement existing module interfaces.

Module specification and instantiation
Each forqs module specified in the configuration file corresponds to an object that is instantiated by the simulator before the simulation starts. The first line of a module specification contains the module name and object id. Each subsequent line specifies a parameter of the module, formatted as a name-value pair: name = value. The value is read as a string, but may be interpreted as an integer, floating point, string, or list, depending on the parameter. Each module determines how its parameter values are interpreted, and interpretation errors are reported to the user. In some modules, multiple parameter values may be specified with the same name, in which case each value specified is appended to a list. A blank line ends the specification for that module.
As an example, here is the specification of a Locus module, which represents a single nucleotide position: Locus my_locus chromosome = 1 position = 500000 The first line specifies a Locus object with id my locus. The two parameters specify which chromosome (1) and the position on the chromosome (500000).
After parsing this module specification, the simulator instantiates a Locus object, and puts the object in a registry under the name my locus. Subsequently instantiated objects may need a reference to this object, which can be obtained by looking up the object by name in the registry. Because modules are instantiated in the order they are specified in the configuration file, it is important to specify an object before it is referenced by other objects (otherwise forqs will produce an error message that it couldn't find the referenced object in the registry).
Object references are specified as string-valued parameters, where the value contains the object id of the referenced object. For example, the following is the specification of a Reporter module that holds a reference to our example Locus object my locus:

Reporter_AlleleFrequencies reporter_allele_frequencies locus = my_locus
One final note on a parameter naming convention used in forqs -for parameters where the value is a list of sub-parameters, the parameter name should be a concatenation of the sub-parameters. This convention allows easier reading and editing of configuration files without the need to consult documentation. For example, Trajectory GenerationComposite represents a list of Trajectory modules, to be followed piecewise at specified generations. This module is parametrized by a list of pairs (generation, trajectory), using the parameter name generation trajectory: Trajectory_GenerationComposite id_migration_rate generation_trajectory = 0 id_migration_rate_0 generation_trajectory = 5 id_migration_rate_1 In this case, the module configuration can be interpreted as follows: at generation 0 use trajectory id migration rate 0, and then at generation 5 start using trajectory id migration rate 1.

Top-level module: SimulatorConfig
SimulatorConfig is the top-level module containing: • global simulation parameters • references to the primary modules Because SimulatorConfig has references to the primary modules, it must be specified last in the configuration file.

Global simulation parameters
The main global simulation parameters are: • output directory: forqs will create this directory and place all output files here • seed: seed for the random number generator Command line parameters can also be specified on the command line as name=value with no whitespace, as shown in this example: forqs config.txt output_directory=mydir seed=12345 If a parameter is specified on both the command line and in the configuration file, the command line takes precedence.
If no random seed is specified, forqs will first look in the current working directory for the file forqs.seed containing a previously generated seed. If no seed is found, forqs will generate a new seed from the system time. At the end of each run, forqs generates a random seed and writes it to forqs.seed. To summarize, forqs looks for the seed in the following places, in order of preference: 1. command line 2. configuration file 3. forqs.seed

References to primary modules
Of the primary modules, only PopulationConfigGenerator is required to be specified, because it contains necessary information about population sizes and the number of generations. The rest of the modules default to trivial implementations. The primary modules specified in SimulatorConfig are: For the reproduction/transmission step, in addition to information about the current populations, the simulator needs to know the population configuration for the next generation, which includes: • number and sizes of the populations in the next generation • mating distribution, which describes how parents are chosen from populations in the current generation to create offspring in the next generation The PopulationConfigGenerator module provides an interface for the simulator to obtain a population configuration for each generation. To create each new individual in the next generation, the simulator chooses parent populations according to the mating distribution, and then individual parents with probability proportional to their fitnesses. For the initial generation, individuals are assigned haplotype ids sequentially starting at 0, with two ids assigned per individual (one for the maternal chromosomes, and one for the paternal chromosomes). Id offsets can be used to make it easier to distinguish between populations -for example, individuals from population 1 may be assigned ids starting at 0, and individuals from population 2 may be assigned ids starting at 1000000.
PopulationConfigGenerator File allows the user to specify a population configuration for each generation, giving the user precise control over the demographic histories of the simulated populations. Alternatively, forqs provides higher level PopulationConfigGenerators that allow the user to specify time-dependent population size and migration rate trajectories (e.g. PopulationConfigGenerator LinearSteppingStone). When a migration rate is specified from a source population to a destination population, it is interpreted to be the probability that a new child in the destination population has parents in the source population. Equivalently, it is the expected proportion of individuals in the next generation of the destination population whose parents were in the source population.
For debugging purposes, forqs optionally outputs the file forqs.popconfig. txt with the population configuration that was used for each generation during the simulation. This option can be selected by setting the write popconfig parameter in SimulatorConfig: write_popconfig = 1

RecombinationPositionGenerator
The RecombinationPositionGenerator module provides an interface for the simulator to obtain a list of recombination positions. After querying the RecombinationPositionGenerator, the simulator uses the list of recombination positions to create an offspring chromosome from a pair of parental chromosomes.
There are multiple implementations of the RecombinationPosition-Generator module interface, each of which uses a different method to generate the recombination position list.
RecombinationPositionGenerator SingleCrossover assumes there is a single crossover event at a position chosen uniformly at random over that chromosome. This results in the offspring receiving one of the two parental or one of the two recombined chromosomes with equal 25% probabilities.
RecombinationPositionGenerator Uniform generates recombination positions according to a uniform Poisson process along the chromosome, with rate specified by the user.
RecombinationPositionGenerator RecombinationMap generates recombination positions according to previously estimated recombination maps, in the format used by the HapMap project.

MutationGenerator
The user may specify a MutationGenerator module to generate random mutations during the simulation. MutationGenerator provides a single interface through which the simulator obtains a list of new mutations for a particular generation. However, MutationGenerator implementations may differ in how they generate this list of mutations. For example, MutationGenerator SingleLocus generates mutations only at a single site, while MutationGenerator Regions generates mutations in multiple regions with different (possibly time-dependent) mutation rates. In any case, the user will most likely want to also specify Reporter Regions to report the final mutated sequences at the end of the simulation.
forqs current mutation implementation creates a new haplotype id for each new mutation, storing it in a haplotype ancestry tree (so that information about the ancestral haplotype can be preserved). This results in memory usage that increases with each generation, which may cause performance degradation in simulations involving a high total mutation rate for a large number of generations. (Here total mutation rate means θL, where θ = 4N µ is the populationscaled per-site mutation rate and L is the total length of the genomic region where new mutations are being generated).

VariantIndicator
Internally, forqs represents each individual chromosome as a list of haplotype chunk ids, with no information about particular variants carried on that chromosome. In order to obtain an individual's genotype at a locus, the simulator must use a VariantIndicator, which is essentially a function that maps: (locus, haplotype id) → variant value Typically these variant values will be 0/1 for SNPs; however, any value in the range 0-255 can be used.
In general, a VariantIndicator needs to specify variant values only at selected loci, because neutral loci have no effect on the simulation. Ancestral neutral variation that is present on the haplotypes of the founders (individuals in the initial generation) can be propagated to individuals at the end of the simulation using the forqs map ms tool included in the forqs package.
Several VariantIndicator implementations are available, allowing the user a variety of options for specifying the variants carried by the founders -these can be found in the forqs Module Reference.

QuantitativeTrait
A QuantitativeTrait module represents a single quantitative trait. The module encapsulates the information needed to calculate the trait value for each individual based on the individual's genotypes at a list of user-specified loci. For example, QuantitativeTrait IndependentLoci allows the user to specify loci and effect sizes, and the trait value for each individual is calculated by combining that individual's locus-specific effects additively (with user-specified environmental variance). Full documentation on the available QuantitativeTraits can be found in the forqs Module Reference.

FitnessFunction
A FitnessFunction module represents the function used to calculate an individual's fitness based on the individual's trait values. In simple cases, the quantitative trait may actually be fitness, in which case the user may simply use FitnessFunction Identity. In more complicated cases, fitness may depend non-trivially on one or more quantitative traits. Full documentation on the available FitnessFunctions can be found in the forqs Module Reference.

Reporter
Reporter modules are used to report information during the simulation. For example, Reporter AlleleFrequencies reports the allele frequency of a userspecified locus at each generation. The full list of available Reporters can be found in the forqs Module Reference.

Locus and LocusList
The Locus module represents a single site on a chromosome, specified by the chromosome pair index and position on the chromosome: Alternatively, the loci can be references to previously specified Locus objects: Also, a random list of loci can be generated with LocusList Random: LocusList_Random locus_list_random locus_count = 10

Trajectory
In simple situations, constant parameter values can be used to specify an aspect of the simulation. For example, it is common in population genetics to simulate models where population sizes are constant. In more complicated scenarios, parameter values (e.g. population sizes, migration rates, mutation rates, optimal quantitative trait values) may vary in space (population) or in time (generation). forqs uses Trajectory modules to describe such varying quantities. In essence, a Trajectory represents a function: (population index, generation index) → value Trajectories can be built by the user by starting with simple building blocks (e.g. constant, linear, polynomial, exponential functions) and then composed either by population or by generation. Once the Trajectory has been defined, it can be referenced by name to be used as the parameter of another module. Note that if a module expects a particular parameter to be a Trajectory, it is an error to pass a constant integer or floating point number as the parameter value. In this case, if the user really wants a constant parameter value, Trajectory Constant should be specified.

Distribution
Distribution modules represent probability distributions. These modules are used to model, for example, effect sizes for QTLs or environmental variance for a quantitative trait.
Distribution Constant can be used in situations where a module requires a reference to a Distribution but a constant value is desired.

Simulating background variation
Forward-in-time simulators often use a burn-in period to allow neutral variation to reach mutation-drift equilibrium. An alternative strategy, adopted by forqs, is to use an existing program (such as Dick Hudson's ms) to generate neutral variation for the founders in the initial population. Chromosomes in the final populations are mosaics of the founder chromosomes, so neutral variation can be propagated to the mosaic chromosomes to generate sequence (or SNP) data. The tool forqs map ms is included in forqs packages for this purpose. In this section we discuss the propagation procedure using two examples.

Example 1: No new mutations
The first example is a simple case where there are no new mutations introduced during the forqs simulation. This case includes many scenarios of interest where the variation in the population has been generated over a small number of generations, primarily by recombination and recent admixture between historically isolated (or inbred) populations.
In this case, the propagation of neutral variation is straight-forward: sequence variants on founder haplotypes are mapped onto the mosaic chromosomes of the final population. However, there are many details about the mapping that must be specified by the user in a mapping configuration file. forqs produces final populations of individuals that may carry multiple pairs of chromosomes, with absolute positions specified by integers. ms outputs single chromosomes, with relative positions specified by floating point numbers. The mapping configuration thus needs to specify an integer range of positions on a particular chromosome that correspond to the ms relative positions in the range [0,1]. In addition, the user must specify how forqs haplotype ids correspond to the ms-format variant sequences.
In the examples directory, the following 4 files can be found: • population example.txt (forqs population data) • ms test data 1.txt, ms test data 2.txt (ms-format files) • ms map config.txt (mapping configuration file) Perform the mapping using the following command: forqs_map_ms population_example.txt ms_map_config.txt > output.ms The resulting sequences in output.ms are mosaics of the sequences in the input ms-format files. For example, the first chromosome of the first individual is a mosaic of haplotypes 0, 1, 2, and 3, corresponding to the sequences 'aaaaaaaaa', 'bbbbbbbbb', 'ccccccccc', and 'ddddddddd', respectively. The output sequence for this chromosome is 'aabccccdd'. Note that alphabetic letters are used in these example sequences for illustration only -real ms output consists of 0's and 1's.

Example 2: Including new mutations
In scenarios where it is important to include new mutations in addition to ancestral neutral variation, the mapping procedure has one additional complication. When simulating forward in time, the neutral variants carried by individuals have no effect on the dynamics of the simulation. forqs takes advantage of this by ignoring (i.e. not storing in memory) any ancestral neutral variation on the founding haplotypes. Each new mutation generated in forqs results in a new haplotype id. Because of this, it is necessary to use an id ancestry map to translate the new ids back to the ancestral ids before performing the mapping. After the neutral variation has been mapped, the new mutations can then be merged with the ancestral variants. This procedure is detailed in the following shell script (and data files referenced in the script), which runs a complete example of a forqs simulation followed by mapping ms sequences (with new mutations merged): examples/ms_map_example_new_mutations.sh

Validation
forqs has an extensive set of unit tests that verify the correctness of individual code modules. In addition, in order to validate the larger-scale behavior of the simulations, we have compared forqs simulation results to theoretical predictions from population genetics and quantitative genetics.

Single locus selection
We first considered simple scenarios where an individual's fitness is determined by that individual's genotype at a single locus, under a wide range of fitness effect sizes and dominance values. We compared the simulated allele frequency trajectories to the deterministic trajectories predicted in the limit of infinite population size.
In all cases, the simulated trajectories closely followed the deterministic trajectories, with better agreement in simulations with larger population sizes, as expected. Figure 4 shows the case of additive positive selection, for population sizes of 100 and 1000.

Decay of linkage disequilibrium
In the limit of infinite population size, linkage disequilibrium (measured by D, the correlation between two variants at different sites) decays geometrically at rate 1 − r, where r is the recombination rate (see Figure 5 for an example run).

Mutation-drift equilibrium
To validate our mutation implementation, we approximated the infinite-sites model by simulating mutations generated randomly in large regions (1mb). We set the per-site mutation rate to make the population-scaled mutation rate θ = 10, and simulated for enough generations to reach mutation-drift equilibrium.
We then compared the resulting site-frequency spectra {ξ i } to theoretical predictions, where ξ 1 is the number of singletons, ξ 2 is the number of doubletons, etc. in the sample. Coalescent theory predicts that ξ i = θ/i in expectation (Fu 1995), under the assumption that the sample size is small compared to the population size. When the sample size is close to the population size, the Wright-Fisher model is expected to have approximately 12% more singletons and 2% less doubletons compared to the coalescent expected values (Wakeley and Takahashi 2003).
Site frequency spectra generated from the full population agree with the Wright-Fisher large sample expected values, and site frequency spectra from smaller samples agree with the coalescent expected values, as expected ( Figure  6).

Response to selection
To validate our implementation of selection on quantitative traits, we simulated a quantitative trait with 10 QTLs with identical additive effect sizes and initial allele frequencies, which gives an approximately normal distribution of trait values in the population. We also used the forqs fitness function FitnessFunction TruncationSelection, which selects a specified proportion of individuals at the upper tail of the trait value distribution to produce offspring for the next generation. The Breeder's Equation from quantitative genetics (Gillespie 2004;Falconer and Mackay 1996) predicts the response to selection R from the heritability of the trait h 2 and the selection differential S: where the S and R are the selected parent mean and offspring mean, respectively, measured as deviations from the population mean.
Simulations run with various values for the heritability and proportion of individuals selected agree with the response values predicted from the Breeder's Equation (Figure 7). 6.5 Performance The representation of chromosomes as haplotype chunks in forqs makes very efficient use of memory, independent of the size of the chromosomes. This allows, for example, simulation of entire genomes of individuals. One drawback to this approach is that memory usage grows linearly with the number of generations, due to recombination. On a typical laptop computer, for a population size of 1 million, simulations take ∼1.5 seconds per generation for neutral scenarios and ∼3 seconds per generation with quantitative traits and selection. Decreasing the population size allows the simulation of a greater number of generations in a reasonable amount of time: for a population size of 10,000, it takes ∼3 seconds per 100 generations (without selection, with a slight increase with selection). Configuration files for these benchmark scenarios are included in the forqs packages (examples/benchmark_*.txt).
To illustrate the simulation of entire genomes, the configuration file examples/ example_qtl_random_full_human_genome.txt specifies a scenario where two populations (each size 10,000) are selected for different optimum values of a quantitative trait (Figure 8). In this simulation, individuals have human-sized genomes (23 chromosome pairs, chromosome length 100 million base pairs), with 50 QTLs placed randomly across the genome; the simulation takes ∼2 seconds per generation (averaged over 50 generations).
Such a simulation is not feasible for forward-in-time simulators that store individual mutations in an array. For example, suppose that each locus for each individual chromosome is encoded in a single bit. Then to simulate 20,000 individuals as in the above scenario, 16GB of memory (available on high-end computers) will permit the simulation of only 3.2 million base pairs of sequence. This is an order of magnitude smaller than the smallest human chromosome (or roughly 0.1% of the human genome).  The unit test for each code unit is run during the build; the build is successful only if all unit tests pass.
In addition to the unit tests, there are also regression tests, consisting of scripts in the regression test directory. These tests prevent unintended behavior changes in the software during new development. Most of the tests are based on the example configuration files in the examples directory, and consist of running forqs and performing a diff comparison with known output. The regression tests are controlled by a Makefile, and run by make.
The codebase includes all documentation, including the Latex source code for this document. The forqs Module Reference is generated by Doxygen from the source code -each module is documented using Doxygen markup in the module's header file, immediately preceding the module's class declaration.

Software architecture
This section describes some of the design details of forqs that may be of interest to programmers who would like to add new features.

Low-level data structures
These are the basic low-level data structures used by forqs: • HaplotypeChunk: a pair of integers (position, id ) • Chromosome: an array of HaplotypeChunks In this setup, Population can be seen as an array of arrays, which requires a significant number of memory allocations. To avoid these memory allocations, Population was re-implemented as a two dimensional array of Chromosome pairs, resulting in a roughly 30% speedup.
The implementations produce identical results, and can be used interchangeably.
To accomplish this, Population is actually an interface class, with two concrete implementations: Population Organisms and Population ChromosomePairs. While the latter is faster, the former can be more convenient for testing purposes.
In order to handle both cases in a unified manner, the class ChromosomePairRange represents the begin/end of chromosome pairs belonging to a single individual, and ChromosomePairRangeIterator does the appropriate iteration through individuals, depending on the memory layout of the population.

Configurable modules
Configurable modules in forqs are represented by the Configurable abstract base class, which defines the common interface through which objects are instantiated and initialized based on parameters specified by the user. The top-level module (SimulatorConfig), all primary modules (e.g. PopulationConfigGenerator, Reporter, etc.), and all building block modules (e.g. Trajectory, Locus) implement the Configurable interface.
The Configurable interface describes the functionality required to serialize an object's configuration, which must be representable as a list of name-value pairs. (This is not actually very restrictive, since the value can be anything representable as a string, including an array of numbers).
Configurable objects are instantiated by SimulationBuilder Generic, which handles the mapping from the module name to the actual C++ class. Each object in the user-specified configuration file is instantiated and registered by name so that other objects can obtain references to it if necessary. SimulationBuilder Generic produces the final SimulatorConfig used by forqs for the simulation. (Note on the name: there were other SimulationBuilder classes that constructed SimulatorConfigs based on a more limited set of parameters -these have since been deprecated in favor of the more generic specification via configuration files.) After instantiation, Configurable objects are configured/initialized in two steps. First, the configure() method sets any parameters specific to this module that are specified by the user. Second, after all objects have been configured, the initialize() method allows the modules to communicate with each other. For example, objects that need information about the lengths of the chromosomes (e.g. for recombination) can query the PopulationConfigGenerator, which has this information.
Configurable objects are instantiated, configured, and initialized in the order they are specified in the configuration file. It is important to note that while objects can obtain references to previously instantiated objects during the configure() step, they should not try to communicate via these references until the initialize() step, since the objects they refer to may not be fully initialized until then.
The main Simulator object interacts with Configurable objects through intermediate interfaces. This allows the main simulation logic to be separated from the specific behavior implemented by the various Configurable modules. For example, Reporter AlleleFrequencies and Reporter TraitValues both implement the Reporter interface. Simulator will give each Reporter information about the current populations, but doesn't need to know what they do with this information. As another example, the Simulator gives a QuantitativeTrait object information about individuals' genotypes and expects to receive trait values for each individual, but doesn't need to know details about how this is accomplished.
The Configurable interface also facilitates some features that make forqs more user-friendly, because it provides a translation layer between the user and program internals. For example, while 0-based indexing is used internally, from the user's perspective chromosomes and populations are numbered starting with 1. Also, Configurable modules can support multiple alternate parametrizations. For example, linear trajectories may be parametrized by slope-intercept or by two endpoints, and exponential distributions may be specified by the mean or the rate, depending on which is more natural for a particular scenario.

Mutation handling
Because forqs tracks haplotype chunks rather than sequences of variants, mutation is necessarily more complicated than recombination. This is because a new point mutation essentially creates a new haplotype that is identical to the original except at the mutated site. In order to accomplish this, the VariantIndicator must be able to be updated with the new haplotype and variant value. In addition, the haplotype's ancestry must be stored, since that haplotype chunk may contain other variants known by the VariantIndicator. This is implemented with the special VariantIndicator Mutable that is not specified by the user. Instead, it is a wrapper class that is automatically instantiated when the user specifies a MutationGenerator. This wrapper class provides the functionality for updating variants, but passes calls through to the user-specified VariantIndicator for non-mutated loci.