Abstract

Summary: MODELER4SIMCOAL2 (M4S2) is an extensible graphical tool to model linked loci and population demographies. M4S2 is easy to use, allowing for the modeling of complicated scenarios, making coalescent simulation modeling accessible to biologists with limited computer skills. The software includes an extension system allowing for new models to be created, published and downloaded from the Internet.

Availability: M4S2 is available from http://popgen.eu/soft/m4s2 under a GPL license. The web site also contains guides, screen shots and tutorials.

Contact:tra@fc.up.pt

1 INTRODUCTION

Currently, the use of the coalescent in population genetics is widespread (e.g. Akey et al. (2004); Voight et al. (2006)). Its popularity is largely due to its computational efficiency which allows for relatively rapid simulation of complex or large data sets (e.g. many loci and individuals). Simulations are mainly used to estimate parameters and determine confidence intervals for statistics (e.g. Fst , Tajima D) and also to quantify the power and reliability of statistical methods (e.g. Tallmon et al.2004). For example a procedure to detect candidate adaptive loci is to simulate 1000s of neutral loci, compute summary statistics and, identify those genotyped loci that fall outside the simulated confidence intervals for neutral loci.

The models used in most publications are simple, with at most two demographic events in a single model (e.g. bottleneck followed by a split in two populations). Although the models used are gross simplifications of reality, little is known about the behavior of relevant statistics if more complex models are created. More importantly, few studies have been made on the robustness of model simplification, i.e. if simple models can replace more complicated ones and still give similar confidence bands for important estimates.

Most of the existing coalescent simulation programs (e.g. Laval and Excoffier, 2004; Mailund et al., 2005) provide either a flexible but complex domain specific language which require some programming skills that are beyond the abilities of much of the target audience or impose quite strict limitations on what can be simulated (Beaumont and Nichols, 1996; Hudson, 2002; Spencer and Coop, 2004) both for the demography and marker modeling. Even the applications imposing limitations are often not easy to use. Also, importantly, it is easy to make errors on the modeling part (e.g. by wrongly calculating growth rates or migration matrices). Furthermore, in some simulators, creating scenarios with many populations is in practice impossible to do by hand as it implies the creation of very large matrices.

As the existing programs are generally difficult to use, only very simple scenarios tend to be modeled. This is probably one reason why few or no studies have evaluated the potential problems (e.g. biases) from using simplified models or why most research articles avoid complex models that represent reality more closely.

There are applications that facilitate the modeling of demographies and marker relationships: COASIM has an interface that allows modeling of marker relationships, but only one demography is possible, namely a population of constant size. EASYPOP (Balloux, 2001) is a forward-time simulator that has an easy to use command-line interface, it allows the simulation of several reference demographies. EASYPOP limitations are: a single recombination rate among physically adjacent loci, some more specialized demographies are either impossible or difficult to model, and being a forward-time simulator, EASYPOP has all the advantages and disadvantages of this type of simulator namely forward-time simulators are computationally much more intensive than coalescent ones. In any case, EASYPOP easy to use command-line modeling facilities makes it the most direct comparison with our graphical modeler.

2 IMPLEMENTATION

In order to make a more userfriendly system for modeling coalescent processes, we designed M4S2, an extensible graphical modeler for demography and linked loci. M4S2 is a Java Web Start application, allowing direct execution from the web. Currently, the program generates input files to be run on SIMCOAL2 (Laval and Excoffier, 2004), a coalescent simulation program. The application is divided in two parts: one to model markers that are unlinked or physically linked in chromosome blocks, and another to model different population demographies.

The first issue that the application addresses is modeling chromosome blocks, i.e., blocks of the genome that are independent from a linkage disequilibrium point of view. Each chromosome block has a set of markers that are physically linked. It supports both markers with only frequency information (SNPs and RFLPs) and markers with genealogical information (microsatellites and DNA sequences). The full expressive power of SIMCOAL2 is captured by M4S2's chromosome modeler (image available on the screen shot section of the web site), e.g. the generalized stepwise mutation (GSM) model for microsatellites, which allows single or multistep mutations, is fully supported.

The second part of M4S2 deals with demography modeling. A set of predefined demographies are built-in (Fig. 1 shows nine different demographies), including many demographies that we found in published literature using coalescent simulations. A screen capture of modeling a bottleneck can be seen on the web site.

Fig. 1.

Demography modeling in M4S2.

Fig. 1.

Demography modeling in M4S2.

The demography modeler is fully extensible. An extension language was developed which can be embedded on SIMCOAL2 template files (which will be converted to the final SIMCOAL2 format before simulation). Also, a small parameter language is available allowing for specification of parameters and constraints (e.g. if two parameters T1 and T2 refer to time events then T1 has to be greater than T2). M4S2 will read the SIMCOAL2 template file, the parameter list and an supplied image illustrating the demography and automatically create a user-friendly interface for the new model. The system will not allow the user to violate the constraints imposed by the model designer.

If the template system is not enough to express the desired model, then the template can still be extended in Jython (Python on Java), giving the model creator full expressive power to create new demographies. This extension system is targeted to the more knowledgeable user with regards to computer science. M4S2 is then capable of importing these foreign models directly from the web. We envision a scenario where a bioinformatician creates and publishes a set of new demographies and a population geneticist simply imports the new models directly from the web. This permits the separation of tasks between members of a team allowing for each one to concentrate on their core specialty. The application simplifies unnecessary complexity of the computational part, in as much as possible, for the population geneticist. In addition, we provide a set of extra supplementary models, which were designed to study the spread of domesticated species and humans from the Fertile Crescent across Europe and Asia.

After the modeling part is done, M4S2 will create SIMCOAL2 input files. M4S2 can then optionally be used to call SIMCOAL2 (to generate 1000s of simulated data sets) and after that ARLEQUIN3 (Excoffier et al., 2005) for analysis of the simulation results (e.g. computation of Fst). M4S2 can be used stand-alone from SIMCOAL2 and ARLEQUIN3.

3 CONCLUSION

M4S2 will facilitate the study of more complex models, help increasing our knowledge about the reliability of simpler models, and increase the usage of coalescent simulation by lowering the barriers to usage of these type of applications.

In the future, we intend to support more simulators (both coalescent and forward-time) giving the user choice on which application to use.

In an era where nonuser friendly software abounds, it is important to remove unnecessary complexity from the users. M4S2 does not intend to hide important modeling issues, on the contrary, by easing the burden on using a coalescent simulator, it will expose the user just to the fundamental modeling issues, thus emphasizing the importance of understanding the underlying theory. The tool also uses existing programs giving users not only easy to use interfaces for those programs, but also new functionality built on top of those applications, instead of trying to recreate functionality already available elsewhere. It is in this philosophy that M4S2 was built.

ACKNOWLEDGEMENTS

T.A. was supported by research grant SFRH/BD/30834/2006, A.B.-P. was supported by research grant SFRH/BPD/17822/2004 and POCI/CVT/567558/2004 all from Fundacao para a Ciencia e Tecnologia, Portugal. G.L. received salary and research funding support from FLAD (Luso-American Foundation), UP, and CIBIO and received office space and secretarial and informatics support from UM.

Conflict of Interest: none declared.

REFERENCES

Akey
JM
, et al.  . 
Population history and natural selection shape patterns of genetic variation in 132 genes
PLoS Biol.
 , 
2004
, vol. 
2
 pg. 
e286
 
Balloux
F
EASYPOP (version 1.7): a computer program for population genetics simulations
J. Hered.
 , 
2001
, vol. 
92
 (pg. 
301
-
302
)
Beaumont
M
Nichols
R
Evaluating loci for use in the genetic analysis of population structure
Proc. R. Soc. B
 , 
1996
, vol. 
363
 (pg. 
1619
-
1626
)
Excoffier
L
, et al.  . 
Arlequin ver. 3.0: An integrated software package for population genetics data analysis
Evol. Bioinform. Online
 , 
2005
, vol. 
1
 (pg. 
47
-
50
)
Hudson
RR
Generating samples under a wright-fisher neutral model of genetic variation
Bioinformatics
 , 
2002
, vol. 
18
 (pg. 
337
-
338
)
Laval
G
Excoffier
L
SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
2485
-
2487
)
Mailund
T
, et al.  . 
CoaSim: a flexible environment for simulating genetic data under coalescent models
BMC Bioinformatics
 , 
2005
, vol. 
6
 pg. 
252
 
Spencer
CCA
Coop
G
SelSim: a program to simulate population genetic data with natural selection and recombination
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
3673
-
3675
)
Tallmon
D
, et al.  . 
Comparative evaluation of a new effective population size estimator based on approximate bayesian computation
Genetics
 , 
2004
, vol. 
167
 (pg. 
977
-
988
)
Voight
BF
, et al.  . 
A map of recent positive selection in the human genome
PLoS Biol.
 , 
2006
, vol. 
4
 pg. 
e72
 

Author notes

Associate Editor: Martin Bishop

Comments

0 Comments