Zombi: A phylogenetic simulator of trees, genomes and sequences that accounts for dead lineages

Summary Here we present Zombi, a tool to simulate the evolution of species, genomes and sequences in silico, that considers for the first time the evolution of genomes in extinct lineages. It also incorporates various features that have not to date been combined in a single simulator, such as the possibility of generating species trees with a pre-defined variation of speciation and extinction rates through time, simulating explicitly intergenic sequences of variable length and outputting gene tree - species tree reconciliations. Availability and implementation Source code and manual are freely available in https://github.com/AADavin/ZOMBI/ Contact aaredav@gmail.com


Introduction
Reconstructing the pattern of horizontal gene transfers between species can help us date the origin of different taxa (Davín et al. 2018;Wolfe and Fournier 2018) , understand the spread of genes of clinical importance (Lerminiaux and Cameron 2019) and resolve difficult phylogenetic questions, such inferring the rooting point of prokaryotic trees Szöll ő si et al. 2012;Williams et al. 2017) or the evolutionary position of certain lineages of unclear origin (Boussau, Guéguen, and Gouy 2008) . In the last decades, a large number of simulators have been developed to model a wide range of evolutionary scenarios Carvajal-Rodríguez 2008; but none so far have considered the existence of extinct lineages and the horizontal transmission of genes (by lateral gene transfers) involving species that are not represented in the phylogeny (see: (Szöllősi et al. 2013;Fournier, Huang, and Gogarten 2009;Zhaxybayeva and Peter Gogarten 2004) ). Zombi simulates explicitly the genome evolution taking place in these extinct lineages, which is expected to have an impact in extant lineages by means of Lateral Gene Transfers (Szöllősi et al. 2013) . By not considering extinct lineages, other simulators make the implicit assumption that the transfer donor always leaves a surviving descendant among sampled species, while we know that this is most often not true (Szöllősi et al. 2013) . Making this assumption may potentially hamper our ability to simulate realistic scenarios of evolution. In addition to considering evolution along extinct lineages, Zombi includes several features hitherto not found together in any other simulator (Table S1). Fig 1. Overview of the three steps of the Zombi simulator. A: In T mode, Zombi simulates a species tree using a birth-death process and outputs the pruned version of it by removing extinct lineages. In this example, lineages n3 and n8 go extinct before the simulation ends. B : in G mode, a circular genome evolves along the branches of the complete species tree obtained with the T mode by Duplications (D), Originations (O), Inversions (I), Transpositions (P), Losses (L) and Transfers (T) of genes. The simulation starts with the initial genome (IG) containing a number of genes determined by the user. Each gene has an orientation (+ or -) that is determined randomly and represents the direction of the gene in the coding strand. Here, the IG is composed of five genes (small coloured circles). Various events affecting different genes and their impact on the genome structure are indicated along the branches. The inversion events not only modify the positions of the genes but also change their orientation. C: in S mode, Zombi can be used to simulate codon, nucleotides and amino acids along the branches of the gene family trees. Here, the gene tree of the grey coloured gene family from B has been depicted.
The G mode simulates the evolution of genomes along the branches of the complete species tree ( Figure 1B), using also the Gillespie algorithm ( Figure S2) to account for six possible genome-level events: duplications, losses, inversions, transpositions, transfers and originations. Each of the first five events are characterized by two parameters: the first one is the effective rate, that controls the frequency and fixation probability; the second one controls the extension, i.e., the number of contiguous genes simultaneously affected by the event.
Originations of new genes occurs one by one and therefore only a single effective rate parameter is needed. Once the simulation reaches the end, Zombi outputs a list containing each event that has occured in the simulation for every gene family (all genes that share a common origin). Besides, the gene trees of each family are reconstructed by combining both species-level events (Speciations and Extinctions) and genome-level events (Duplications, Transfers and Losses). When a Transfer event occurs, the recipient lineage is randomly chosen from all the lineages alive at that time. The user can make the frequence of transfers to be higher between closely related lineages (Ochman, Lawrence, and Groisman 2000) ( Figure S3). Inversions and transpositions do not modify the topology of the tree but add an extra layer of complexity by changing the neighborhood of genes, which is especially relevant when genome-level events affect more than one gene at a time ( Figure S4). The gene family trees are also pruned to present the user the trees that can be expected to recover from most real-data analyses, removing all extinct lineages and gene branches that do not arrive until the present time.
The S mode, finally, simulates gene sequences (at either the codon, nucleotide or protein level) along the gene family trees ( Figure 1C). The user can modify the scaling of the tree to better control the number of substitutions that take place per unit of time, and thus simulate fast or slow-evolving genes.

Advanced features
In addition to the basic features presented above, "advanced" modes of Zombi (listed in Table S2) can be used to obtain richer and more realistic evolutionary scenarios. For example, it is possible to use a species tree input by the user, to generate species trees with variable extinction and speciation rates, or to control the number of living lineages at each unit of time ( Figure S5). At the genome level, Zombi can simulate genomes using branch-specific rates (Gu mode, allowing the user to simulate very specific scenarios such as one in which a certain lineage experiences a massive loss of genes), gene-family specific rates (Gm mode, which makes easier the process of using rates estimated from real datasets) and genomes accounting for intergenic regions (Gf mode) of variable length (drawn from a flat Dirichlet distribution (Biller et al. 2016) . At the sequence level, finally, the user can fine-tune the substitution rates to make them branch specific.
Zombi provides the user with a clear and detailed output of the complete evolutionary process simulated, including, the reconciled gene trees with the species tree in the RecPhyloXML reconciliation standard (Duchemin et al. 2018) .

Performance and validation
Simulations with Zombi are fast: with a starting genome of 500 genes and a species tree of 2000 taxa (extinct + extant), it takes around 1 minute on a 3.4Ghz laptop to simulate all the genomes ( Figure S6). We validated that the distribution of waiting times between successive events was following an exponential distribution ( Figure S7 and S8), that the distribution of intergene sizes at equilibrium was following a flat Dirichlet distribution, as expected from Biller et al. 2016 (Figure S9), that the number of events and their extension occurs with a frequency according to their respective rates ( Figure S10) and that the gene family size distribution followed a power-law when duplication rates are higher than loss rates and stretched-exponential in the opposite case   (Figure S11). We also checked by hand the validity of many simple scenarios to detect possible inconsistencies in the algorithm.

Implementation
Zombi is implemented in Python 3.6. It relies on the ETE 3 toolkit (Huerta-Cepas, Serra, and Bork 2016) and the Pyvolve package (Spielman and Wilke 2015) . It is freely available at https://github.com/AADavin/ZOMBI along with detailed documentation and two tutorials in a wiki page.      Extinction event inactivates the genome evolving within that branch. Figure S3. Assortative transfers . By default, when a transfer event occurs, it takes place between two randomly sampled lineages. The user can activate the function assortative transfer, which makes the transfer between two lineages to occur with a probability = e −αδ (being a parameter to control for the strength of the effect and the normalized α δ phylogenetic distance). The between two nodes is defined as the distance (in time units) δ between each of the nodes to their common ancestor. In this example, we simulate two datasets in the same Species Tree (30 species). The parameter was set to 100. We can see α how the assortative model of transfer makes transfers between closely related lineages more frequent than the uniform model, a phenomenon that has been observed in real data (Ochman et al. 2000) . Zombi a single event can affect more than one gene simultaneously. In A we have three snapshots of the genome evolving in the Species Tree on the right, along the blue branches.

Supplementary Materials
At time 0, none of the genes has undergone any event. At time 1, the green genes have undergone a duplication. At time 2, some red genes have been transposed within the duplicated genes. In B, the resulting gene trees from this scenario. We can see that in the resulting genomes the genes that, due to the transposition event, share the same duplication event, are shuffled with those that do not. Although inversions and transposition do not alter the topology of the gene trees directly, they can have a big impact when the different events affect more than one gene at a time.