Mirage: estimation of ancestral gene-copy numbers by considering different evolutionary patterns among gene families

Abstract Motivation Reconstruction of gene copy number evolution is an essential approach for understanding how complex biological systems have been organized. Although various models have been proposed for gene copy number evolution, existing evolutionary models have not appropriately addressed the fact that different gene families can have very different gene gain/loss rates. Results In this study, we developed Mirage (MIxtuRe model for Ancestral Genome Estimation), which allows different gene families to have flexible gene gain/loss rates. Mirage can use three models for formulating heterogeneous evolution among gene families: the discretized Γ model, probability distribution-free model and pattern mixture (PM) model. Simulation analysis showed that Mirage can accurately estimate heterogeneous gene gain/loss rates and reconstruct gene-content evolutionary history. Application to empirical datasets demonstrated that the PM model fits genome data from various taxonomic groups better than the other heterogeneous models. Using Mirage, we revealed that metabolic function-related gene families displayed frequent gene gains and losses in all taxa investigated. Availability and implementation The source code of Mirage is freely available at https://github.com/fukunagatsu/Mirage. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Gene gain and loss events in genomes have played essential roles in the evolutionary history of life. Complex biological systems that function through the coordination of numerous genes, e.g. metabolic pathways and signal transduction systems, have been constructed through the accumulation of such events. To answer the fundamental biological question of how such complex systems have been organized, gene-content evolutionary history has been studied with established bioinformatics methods (Ferná ndez and Gabaldó n, 2020; Hahn et al., 2007;Iwasaki and Takagi, 2009;Montague et al., 2014). The gene count method, which utilizes a species tree and an ortholog table, is one of the effective methods for reconstructing gene-content evolutionary history (Ames et al., 2012;Cohen and Pupko, 2010;Csurö s and Mikló s, 2009;Hahn et al., 2005;Han et al., 2013;Iwasaki and Takagi, 2007;Kim and Hao, 2014;Li et al., 2014Li et al., , 2019Librado et al., 2012;Liu et al., 2011;Rabier et al., 2014;Snel et al., 2002;Zamani-Dahaj et al., 2016;Zwaenepoel and Van de Peer, 2020). These algorithms estimate gene content of ancestral species based on the maximum parsimony or maximum likelihood (ML) method, where the ML method is known to show better performance (Ames et al., 2012;Cohen and Pupko, 2011).
In the ML method, it is important to specify which gene-content evolutionary model is adopted. The ML method first estimates evolutionary model parameters, such as gene gain and loss rates, and then the ML evolutionary history of gene content is reconstructed based on the estimated parameters. Some ML methods adopt a twostate evolutionary model and require a two-state ortholog table, which contains the presence/absence information of each ortholog group in each genome. These methods estimate whether each ortholog group existed or not at each ancestral node of the given phylogenetic tree (Cohen and Pupko, 2010;Li et al., 2014). Although the two-state evolutionary model is mathematically simple, it is apparently unable to deal with gene copy number variations, which play important roles in the evolution of biological systems (Saitou and Gokcumen, 2020). The other ML methods estimate a copy number of each ortholog group at each ancestral node from an ortholog table that contains copy number information of each ortholog group in each genome.
To date, various gene gain/loss models have been proposed for the gene copy number evolution. For example, the birth and death (BD) model is a two-parameter model, which considers only gene gain and loss parameters (Han et al., 2013;Iwasaki and Takagi, 2009;Fig. 1A). Other models [the Csurö s and Mikló s (C&M) model and the birth, death and innovation (BDI) model] decompose the gene gain parameter into gene birth (innovation) and duplication parameters, resulting in three parameters (Ames et al., 2012;Csurö s and Mikló s, 2009;Karev et al., 2002;Fig 1B and C). A richer parameter model is the all rates different birth and death (BDARD) model, which allows all gene gain and loss parameters to be varied freely (Kim and Hao, 2014;Fig. 1D).
Another important aspect of gene-content evolution that should be considered is that different gene families have different gene gain/ loss patterns (Krylov et al., 2003). For example, housekeeping genes are seldom lost from genomes and thus the gene loss rates to zero copies are small, whereas antibiotic resistance genes are easily lost from genomes. Another example is olfactory receptor genes, which are prone to increase copy numbers and have exceptionally large gene gain rates. The most popular model considering the heterogeneity among gene families is the discretized C model, which assumes that the distribution of the evolutionary rate multipliers follows the discrete C distribution (Yang, 1994). Another rate multiplier heterogeneity model is the probability distribution-free (PDF) model, which directly learns evolutionary rate multipliers from an input dataset without making assumptions about the rate multiplier distribution (Kalyaanamoorthy et al., 2017;Yang, 1995). These rate multiplier heterogeneity models can represent heterogeneous evolutionary rate multipliers among gene families (e.g. Kim and Hao, 2014;Librado et al., 2012), but cannot represent the heterogeneity of rate patterns among gene families. For example, in the BD model with a rate multiplier heterogeneity model, the ratio between the gain and loss parameters becomes always constant among gene families. To deal with such heterogeneity, the pattern mixture (PM) model is used in molecular evolutionary analyses (Dang and Kishino, 2019;Lartillot and Philippe, 2004;Pagel and Meade, 2004;Quang et al., 2008). In gene-content evolutionary analyses, the heterogeneity model has been adopted in the two-state evolutionary model (Cohen and Pupko, 2010;Li et al., 2014;Spencer and Sangaralingam, 2009;Zamani-Dahaj et al., 2016). Additionally, in some methods, the discretized C model has been used in the gene copy number evolution (Csurö s and Mikló s, 2009;Librado et al., 2012;Mendes et al., 2020). However, there have been no studies using the PDF or PM model for modeling the gene copy number evolution; in other words, existing models cannot reflect diverse gene copy number evolution patterns that depend on gene families.
In this study, we developed Mirage (MIxtuRe model for Ancestral Genome Estimation), which reconstructs a gene-content evolutionary history based on various gene gain/loss models by unsupervised classification of evolutionary patterns among gene families. We verified that Mirage can estimate both model parameters and gene-content evolutionary history with high accuracy using simulated datasets. In addition, we demonstrated that the combination of the BDARD and PM models fitted empirical datasets better than the other models. Finally, we reconstructed gene-content evolutionary histories of several taxonomic groups using Mirage and revealed that gene families involved in metabolic functions frequently experienced gene gain/loss events in all taxonomic groups investigated.

Input data for our method
The input data for our method are an ortholog table D and a phylogenetic tree T. D is a data matrix that consists of N species (genomes) and L gene families (ortholog groups). D i;j , which is an element of the species i and the gene family j in the matrix, represents the gene copy number of j in i. The phylogenetic tree T is a binary rooted tree whose branches have branch lengths greater than 0. The tree has N leaves (external nodes), which correspond to the N species in the ortholog table D. T also has N-1 internal nodes, which correspond to the ancestral species. The reconstruction problem of gene-content evolutionary history is defined as an estimation problem of gene copy numbers (X) in the ancestral species for each gene family.

Gene-content evolutionary models
Gene-content evolution is formulated as a continuous-time Markov model, where gene copy numbers and gene gain/loss events are represented as states and state transitions, respectively. Gene gain/loss events in each gene family are assumed to have occurred independently of those in other gene families. In an infinitesimal time Dt, a gene gain/loss event of one gene is assumed to have occurred at most once. In the BD model, where the model parameters are a gene gain rate a and a gene loss rate b, transitions from a gene copy number n to n þ 1 and n-1 occur at probabilities of aDt and bDt respectively, in Dt (Han et al., 2013;Iwasaki and Takagi, 2009;Fig. 1A). Csurö s and Mikló s developed a model with three parameters: a gene acquisition rate a, a gene loss rate b and a gene duplication rate c (Csurö s and Mikló s, 2009; Fig. 1B). By assuming that a horizontal gene transfer (HGT) is a main mechanism of gene acquisition, the C&M model defines transition rates from n to n þ 1 and n-1 as a þ nc and nb, respectively (note that the gene gain/loss rates change linearly with gene copy numbers). The other three-parameter model, the BDI model, utilizes a novel gene family acquisition rate d in addition to a general gene gain rate a, and a gene loss rate b (Ames et al., 2012;Karev et al., 2002;Fig. 1C). This model is basically the same as the BD model, except that the transition rate from 0 to 1 is d. The most flexible model is the BDARD model, which allows all state transition rates to be different (Kim and Hao, 2014; Fig. 1D).
For improved stability and ease of the computation, we set the maximum gene copy number as a parameter l max (i.e. gene families having copy numbers larger than l max are considered to have l max gene copies). Note that the limitation of maximum size was utilized in some previous researches (Ames et al., 2012;Iwasaki and Takagi, 2007;Spencer et al., 2007) and results in a finite number of parameters in the BDARD model. A large l max value introduces more parameters in the BDARD model and may cause instability of parameter estimation. Here, the number of states is l max þ 1 (0 to l max ). Because l max is a user-input parameter, the user can freely set it to a reasonable value. Let R be a (l max þ 1) Â (l max þ 1) transition rate matrix. ½R i;j , which is an (i, j)-th element of R, represents the state transition rate from the state i to the state j in Dt. ½R i;j ¼ 0 when ji À jj > 1. We define Pðyjx; R; tÞ as a transition probability from state x to y in time t. If t ¼ 0, the gene copy number does not change and Pðyjx; R; 0Þ ¼ ½I x;y , where I is the identity matrix. If t ¼ Dt; Pðyjx; R; DtÞ ¼ ½I þ RDt x;y , where ½R i;i ¼ À P j;i6 ¼j ½R i;j . Then, under the Markov process assumption, we obtain Pðyjx; R; tÞ ¼ ½lim n!1 ðI þ t n RÞ n x;y ¼ ½expðtRÞ x;y . Furthermore, we allowed different gene families to have different gene gain/loss parameters. Instead of assuming that all of the L gene families evolve under the same transition rate matrices, these models utilize K transition rate matrices. Here, K is a user-input parameter. Each of the gene families is probabilistically assigned to K clusters in the framework of the mixture model. When the discretized C model is used, the C distribution is divided into K categories so that each category has the same probability, and calculates r 1 ,. . .,r K as the mean rates of each category. Here, the C distribution f(x) is a a CðaÞ expðÀaxÞx aÀ1 , and the distribution is parameterized by a. Then, the transition parameter matrix for each cluster k is defined as r k R. In addition, / k , which is the probability that a gene family belongs to the category k, is set to 1 K . When the PDF model is used, we do not assume the discrete C distribution for the rate multiplier distribution and directly learn r k and / k for each cluster i from the input data. Note that both the discretized C model and the PDF model use only single transition rate matrix R and thus cannot represent the heterogeneity of evolutionary patterns among gene families. Finally, when the PM model is used as the most flexible heterogeneous model, we directly introduce K transition rate matrices (i.e. R 1 ,. . .,R K .) and learns R k and / k for each cluster k from the input dataset.

Parameter estimation and gene-content evolutionary history reconstruction algorithm
The evolutionary model parameters to be estimated for the discretized C model, the PDF model and the PM model are h ¼ fa; R; pg; f/; r 1 ; . . . ; r K ; R; pg and f/; R 1 ; . . . ; R K ; p 1 ; . . . ; p K g, respectively.
Here, / is a K-length vector that is the mixing probability of each gene-content cluster and p k is a (l max þ 1)-length vector that is the state occurrence probability of the k-th gene-content cluster at the root node in the phylogenetic tree. For the discretized C and PDF models, we assumed that all gene families follow the same distribution p. We modeled p and R as independent parameters whereas p is generally modeled as the stationary distribution of the Markov process formulated by the parameter matrix R in the DNA evolution models. This is because it is difficult to assume stationarity in the gene-content evolution (Wolf and Koonin, 2013).
The model parameters are estimated by the EM algorithm (Dempster et al., 1977). The EM algorithm is an ML method for estimating parameters from observed data in statistical models that assume unobserved hidden states. In our model, the observed data are the ortholog table D, while the unobserved hidden states are the gene-content evolutionary history X and assignments of each gene family to each gene-content cluster Z. The EM algorithm consists of the following four steps. (1)  Here, we describe the EM algorithm for the PM model in detail (see Supplementary Material for those of the other models). The Q function of our EM algorithm is described as follows: where D l is the column l of the ortholog table D, XðD l Þ is the set of all possible gene-content evolutionary histories on D l and Z lk is an indicator variable representing whether the gene family l belongs to the genecontent cluster k. Here, for the formula of conditional probabilities, lnpðX; Z lk ; D l jhÞ ¼ lnpðXjZ lk ; hÞ þ lnpðZ lk jhÞ ¼ lnpðXjR k ; p k Þ þ ln/ k :: Therefore, Here, pðZ lk jD l ; hÞ / / k pðD l jR k ; p k Þ; [pðZ lk jD l ; hÞ ¼ / k pðD l jR k ; p k Þ P K j¼1 / j pðD l jR j ; p j Þ

:
We describe pðZ lk jD l ; hÞ as cðZ lk Þ for the simplicity of the notation. Based on discussion of the sufficient statistics for the phylogenetic tree model (Kiryu, 2011), We assigned a distinct index to each node and t m is a branch length between the node m and the parent node. F ðmÞ ði; XÞ and N ðmÞ ði; j; XÞ are the fractional duration of the state i and the number of state changes from the state i to the state j on the history X at the branch between the node m and the parent node, respectively. n root ði; XÞ is an indicator variable representing whether the root node takes the state i on the history X. By substituting these formulae for the Q function, we obtained the following equation: In the step 2 of our EM algorithm, we calculated the values of cðZ lk Þ; F ðmÞ ði; D l ; R old k ; p old k Þ; N ðmÞ ði; j; D l ; R old k ; p old k Þ and n root ði; D l ; R old k ; p old k Þ for each k and l. These expected values can be efficiently calculated using eigenvalue decompositions of the state transition probability matrices and a dynamic programming method for the phylogenetic tree T (Holmes and Rubin, 2002;Kiryu, 2011;Quang et al., 2008). Subsequently, we found the parameter h that maximized the Q function in the step 3. The details of the step 3 are described in the Supplementary Material.
We obtained the computational time complexity per iteration as follows. The most computationally expensive part of the algorithm is calculating the expected number of transitions for each branch, which is a part of dynamic programming in step 2. We need to calculate ðl max þ 1Þ 4 expected values because the number of states is (l max þ 1). However, since we assumed ½R i;j ¼ 0 when ji À jj > 1, the number of transitions from the state i to j is always 0 when ji À jj > 1. Therefore, the expected values that need to be counted are Oðl 3 max Þ. We have to calculate the expected values for each branch, species and cluster, thus the total computational time complexity per iteration is OðNLKl 3 max Þ. After the parameter estimation, the ML evolutionary history (X) is reconstructed by a dynamic programming method using the estimated parameters. The reconstruction method is similar to the Viterbi algorithm, which obtains the ML path of hidden states in the hidden Markov model, and also resembles an algorithm for the reconstruction of ancestral protein sequences (Pupko et al., 2000). The details of the algorithms are described in the Supplementary Materials. We implemented the algorithms in Cþþ, and the source code is freely available at https://github.com/fukunagatsu/Mirage.

Preparation of simulated datasets
We evaluated the performance of Mirage using simulated datasets. We simulated gene-content evolution for all combinations of four gain/loss models (the BD, C&M, BDI and BDARD models) and three heterogeneity models (the discretized C, PDF and PM models). We used a perfect binary tree with 128 leaves as the input phylogenetic tree topology and determined the branch lengths by the Yule process with a birth rate k of 5.0. For the number of gene-content clusters K and the maximum gene family size l max , we used two sets of parameters, (K ¼ 4 and l max ¼ 3) and (K ¼ 6 and l max ¼ 5). The parameter h was different for each evolutionary model, and these were described in the Supplementary Materials.
We simulated the evolution of 10 000 gene families along the input phylogenetic tree for a simulated dataset, and we constructed an ortholog table from the gene copy numbers at the leaf nodes. We prepared 10 simulated datasets, each of which consisted of an ortholog table and a phylogenetic tree.

Preparation of the empirical datasets
We created three empirical datasets including Archaea (domain), Micrococcales (order) and Fungi (kingdom). We used ortholog tables provided in the STRING database (Szklarczyk et al., 2019) and NCBI Taxonomy for taxonomic annotation. Next, we retrieved species in the phylogenetic trees provided by the Genome Taxonomy Database release 89 (Parks et al., 2018) for Archaea and Micrococcales, and those provided by the SILVA database release 111 (Yarza et al., 2017;Yilmaz et al., 2014) for Fungi. Then, we removed species data that were only included in either the ortholog tables or the phylogenetic trees from those datasets. In the Fungi phylogenetic tree, some species contained multiple strains. For those species, we randomly selected one strain and removed the others. Then, we reshaped the phylogenetic trees to satisfy the following three conditions: (i) a leaf of the phylogenetic trees always corresponds to a species, (ii) tree topology is binary and (iii) the distances and the phylogenetic relationships between species are the same as in the original tree. Because there were branches with branch lengths of 0 in the Fungi phylogenetic tree, we added a pseudo length 0.0001 to all tree branches. Note that the minimum branch length excluding 0 in the tree was 0.00059, which was larger than 0.0001. Finally, the Archaea, Micrococcales and Fungi datasets comprised 151 species and 11 650 gene families, 111 species and 9523 gene families, and 123 species and 34 454 gene families, respectively. The constructed datasets are freely available at https://github.com/fuku nagatsu/Mirage.

Evaluation
The EM algorithm is guaranteed to converge to a local optimum but not to a global optimum, and thus the estimation results can depend on the initial values of the model parameters. Therefore, we estimated parameters 100 times using the EM algorithm for each dataset and each evolutionary model, and we adopted the estimation results with the largest data likelihood.
In the simulated dataset analysis, to evaluate the effect of the heterogeneity model on the performance, we investigated the performance when we changed K. Additionally, to assess the accuracy of presence/absence state reconstruction of the two-state model, we examined the performance when we set l max as 1. Furthermore, we evaluated the difference in the performance among various gene gain/loss models and heterogeneity models by applying these models to the datasets generated by the BDARD model with the PM model. As the evaluation criteria for the reconstructed evolutionary history, in the experiments to evaluate the effect of K, gene gain/loss models and heterogeneity models, we used the proportion of gene families whose gene copy numbers were correctly estimated in ancestral nodes. We also investigated the correlation coefficients between the number of gene gain/loss events for gene families in the reconstructed history and those in the true history. In the experiments to assess the performance of the two-state model, we evaluated the estimation accuracy of the presence or absence of gene families.
To evaluate the computational time, we applied Mirage to the simulated datasets under various conditions. Six factors can affect computational time: L (size of gene families), N (size of species), K, l max , a gain/loss model and a heterogeneity model. To estimate the influence of each factor on the computational time, we first defined the base condition and then measured the computation time by changing only one factor from the base condition. The base condition was defined as a condition that L ¼ 5000, N ¼ 64, K ¼ 6, l max ¼ 3, the gain/loss model is the BDARD model, and the heterogeneity model is the PM model. We measured the computational time 100 times for each condition. The computation was conducted on an Intel Xeon Gold 6130 2.1 GHz CPU with 16GB of memory.
In the empirical dataset analysis, we tested K values from 1 to 10 and l max values from 2 to 4. We first divided each dataset into gene families of training and test datasets, and estimated the model parameters using the training dataset only. We then calculated loglikelihoods of the test datasets based on the estimated parameters. We divided the datasets in the following three ways. For experiment 1, we randomly divided the gene families into training and test datasets at a 4:1 ratio for each dataset. Here, the species sets were common between the two datasets. For experiment 2, for each dataset, we randomly divided the species into training and test datasets at a 1:1 ratio. In this method, some gene families were shared between the training and test datasets. For experiment 3, we further processed the datasets obtained by the second method. In particular, we randomly assigned gene families shared between the training and test datasets to either of the datasets, so that no gene families were shared between the two datasets. Therefore, in this division, both the species and gene families were different between the two datasets. The numbers of gene families for each dataset in the experiments 2 and 3 are listed in Supplementary Table S1.

Performance evaluation of Mirage based on simulated datasets
For the evaluation of the performance of Mirage, we first applied Mirage to the simulated datasets. Supplementary Figures S1 and S2 show the relative errors of the estimated model parameters. We defined the relative error as 100 Âĥ Àh h , whereĥ and h are the estimated and true parameters, respectively. Additionally, in order to evaluate the variability of the estimation accuracy while ignoring outliers, we used interquartile range (IQR), which is defined as a difference between the 75 and 25 percentiles of relative errors. When K ¼ 4 and l max ¼ 3, the largest IQRs among the various model settings were 2.30 for R, 12.49 for p, 7.19 for /, 4.18 for r and 3.50 for a. Additionally, when K ¼ 6 and l max ¼ 5, the largest IQRs among the various model settings were 5.98 for R, 94.12 for p, 15.31 for /, 7.44 for r and 3.44 for a. These results show that Mirage can estimate parameters with high accuracy although the accuracy decreased when K and l max were large. On the other hand, in some model settings, the maximum values of the relative errors were very large. For example, in the p estimation in the PM model, the relative errors sometimes exceed 100.0, i.e. the estimated p may substantially differ from the true parameter. Such difficulty in root parameter estimation is well-known and may stem from the fact that the root node is the topologically furthest away from the observable leaf nodes (i.e. extant genomes).
When we used the same model as the one that generated the dataset for the estimation, the median of the accuracy of the reconstructed ancestral states (gene copy numbers) was more than 75% in all model settings ( Supplementary Figs S3 and S4). Additionally, the median of the correlation coefficients for evaluating the estimation accuracy of numbers of gene gain/loss events was more than 97.5 in all model settings ( Supplementary Fig. S3 and S4). These results show that Mirage can reconstruct the evolutionary history with high accuracy. Next, to investigate the effect of the model misspecification, we evaluated the different model from the one that generated the dataset for the estimation. We investigated whether there is a difference in accuracy between the two methods using paired t-tests. We used 0.05 as the original significance level, and we adjusted the value using the Bonferroni's multiple correction, which divides the original significance level by the number of tests. When the phylogenetic mixture model was not used (i.e. K ¼ 1), the reconstruction accuracy and the correlation coefficients significantly decreased in almost all cases, likely because the heterogeneity of gene-content evolution was ignored ( Supplementary Fig. S3 and S4). On the other hand, when we set K to a larger value than the true value, we could not observe the significant increase in the reconstruction accuracy and the correlation coefficients in any cases (Supplementary Fig. S3 and S4). On the contrary, in some cases, such as the C model, the larger K value shows better performance than the true K value. These results mean that increasing the value of K does not significantly impact the quantitative results.
If the copy number states were ignored and only presence/absence information was considered (i.e. if incorrect estimation among the copy numbers 1, 2 and 3 was ignored), the median of the accuracy of presence/absence state reconstruction of the ancestral nodes was more than 90.0% in all model settings (Supplementary Figs S5 and S6). When the two-state model (with the phylogenetic mixture model) was applied to those cases (i.e. l max ¼ 1), the accuracy significantly decreased depending on the model setting in many cases ( Supplementary Figs S5 and S6). However, only when we used the C&M model with the C model for the dataset with K ¼ 4 and l max ¼ 3, the accuracy significantly increased. This result indicates that the estimation of gene copy number evolution can become effective even when only presence/absence information is reconstructed. Mirage cannot estimate parameters when users set l max to a value larger than any value contained in the dataset. As Mirage is based on the ML method, the parameters are estimated so that the probability of an event not occurring in the dataset is zero.
We also investigated the performance among various gene gain/loss models and heterogeneity models when the datasets were generated by the BDARD model with the PM model ( Supplementary Fig. S7). Although we could not observe a significant difference probably because of outliers, we found that the median of the reconstruction accuracy and the correlation coefficients of the BDARD model with the PM model were larger than those of the other models.
We finally evaluated the computational time of Mirage under various conditions on the simulated datasets. We confirmed that the computational time was linearly proportional to L, N and K and more than linearly proportional to l max as indicated by the computational complexity analysis ( Fig. 2A-D). The result about l max suggests that it is impractical to model the evolution of many copy gene families, such as olfactory receptor genes (Niimura, 2009), in the current Mirage implementation. When we changed the gain/loss models, the BDI and BDARD models were faster than the BD and C&M models (Fig. 2E). Additionally, when we changed the heterogeneity models, the PM model and the C model were the slowest and fastest, respectively (Fig. 2F). This result shows that the computation time increases as the complexity of heterogeneity increases.

Comparison of models and parameters by holdout validation based on empirical datasets
Next, we compared the effects of models and parameters by evaluating holdout performance of Mirage using empirical datasets. For the appropriate setting of l max , we first investigated the largest gene copy number among all gene families. Supplementary Fig. S8 shows cumulative relative frequency curves of the largest gene copy number in each gene family. In all datasets, the majority (80-90%) of the gene families had a maximum value of 2-4. Because large l max values require huge computation time (Fig. 2D), we tested l max values from 2 to 4 as a range of values that can be used in a case of large-scale data analysis. We learned the model parameters from the training datasets only under various model settings and subsequently calculated the loglikelihoods of the test datasets using the estimated parameters. Regardless of l max or the dataset used, the log-likelihood increased with the increasing number of gene-content cluster K, except for limited cases, likely because of convergence to local optima by the EM algorithm. In addition, the BDARD and BD models showed the best and the worst log-likelihood under the same heterogeneity model, respectively (Fig. 3). When we changed the heterogeneity model while using the BDARD model, the PM model achieved superior performances to the other models, and the PDF model showed slightly higher likelihood than the discretized C model. When we divided the training and test datasets in different ways (i.e. by species or by species and gene families), we obtained similar results ( Supplementary Figs S9 and S10). In conclusion, the combination with the BDARD and PM models yielded gene-content evolutionary models with the largest log-likelihood values among the models we investigated.
Interestingly, although the C&M and BDI models had the same numbers of parameters, their log-likelihood values were slightly different (Fig. 3). When the Archaea or Micrococcales dataset was used and l max was 4, the C&M model exhibited a larger log-likelihood. On the other hand, when the Fungi dataset was used, the BDI model exhibited a larger log-likelihood. When l max ! 3, the C&M model naturally assumes that gene duplication and loss rates change linearly with gene copies, whereas the BDI model assumes that gene duplication and loss occur at a constant rate regardless of gene copy numbers (Fig. 1). Thus, the difference likely reflects the nature of gene duplications and losses in prokaryotic and eukaryotic genomes. Specifically, the BDI model may be more suitable for eukaryotic evolutionary processes in which meiotic recombination introduces tandem gene duplications and losses, which are basically independent of gene copy numbers.

Analysis of estimated evolutionary model parameters
Next, we applied Mirage to each of the complete Archaea, Micrococcales and Fungi datasets. Based on the holdout validation results, we used the BDARD model with the PM model and l max ¼ 3 for the evolutionary model. Additionally, we set K to 5 in order to achieve both large likelihood in the holdout validation and high interpretability thanks to the small number of K. Estimated model parameters are presented in Table 1, Supplementary Figure S11 and the Supplementary Data. In all datasets, the gene gain rates (½R k i;iþ1 ) tended to be smaller than the gene loss rates ([½R k i;iÀ1 ), being consistent to a previous study (Cohen and Pupko, 2010).
We next examined the evolutionary model parameters estimated for each gene-content cluster and each dataset. To quantify the Mirage frequency of gene gain/loss events occur in each cluster k, we calculated a normalized cluster evolutionary rate, which was P lmax i¼0 p i ½R k i;i divided by the minimum of these values among each dataset ( Supplementary Fig. S12). The maximum normalized cluster evolutionary rates were 87.4, 18.4, 18.7 for the Archaea, Micrococcales and Fungi datasets, respectively, indicating that different genecontent clusters have largely different evolutionary rates. The Archaea dataset exhibited the largest difference, where the gene-content clusters 1, 2 and 3 exhibited large, moderate and small normalized cluster evolutionary rates, respectively (Table 1). We also investigated whether specific gene functions were enriched in specific gene-content clusters. We used EGGNOG database version 4.0 for gene annotation to COG, arCOG and NOG gene families and version 3.0 for gene annotation to KOG category (Powell et al., 2012(Powell et al., , 2014. After removing 'poorly characterized' supercategories, we observed differences in the enriched COG supercategories among gene-content clusters (Supplementary Tables S2-S4).

Reconstruction of the gene-content evolutionary history
We then reconstructed the gene-content evolutionary history for each dataset using Mirage. We first counted gene gain/loss events in each gene family from the reconstructed evolutionary history ( Supplementary Fig. S13). In all datasets, gene gain/loss events were rare in most gene families, whereas some gene families exceptionally frequently experienced gene/gain loss events. Supplementary Tables  S5-S7 list the 20 gene families with the most frequent gene gain/loss events for each dataset. Many transposase genes were commonly found in all three datasets, whereas one gene family, COG0286 (HsdM), commonly appeared in the lists of the Archaea and Micrococcales datasets. COG0286 is annotated as a DNA methylase subunit of the type I restriction-modification system. It is reasonable that a restriction-modification system has been spread by HGT, as is well-known for the type II system (Jeltsch and Pingoud, 1996).
Finally, we investigated whether specific gene functions were enriched in the gene families with frequent gene gain/loss events. First, we examined differences in the distributions of the COG supercategories between the top 10% of gene families with frequent gene gain/loss events and entire gene families (Table 2). Based on the v 2 test with Bonferroni's multiple correction, we found that the 'metabolism' supercategory was significantly enriched in the gene families with frequent gene gain/loss events in all datasets. Then, we analyzed which categories in the 'metabolism' supercategory were enriched in the different datasets (Table 3). In the Archaea dataset, gene families in categories C, 'Energy production and conversion', and P, 'Inorganic ion transport and metabolism', were the most enriched, probably reflecting the diverse ways in which Archaea obtain energy. In the Micrococcales and Fungi datasets, gene families in categories E, 'Amino acid transport and metabolism', G, 'Carbohydrate transport and metabolism', I, 'Lipid transport and metabolism' and Q, 'Secondary metabolites biosynthesis, transport, and catabolism', were highly enriched, probably reflecting rich secondary metabolism functions of those taxonomic groups.

Discussion
In this study, we developed Mirage, which adopts heterogeneous evolutionary model among gene families for accurate ML reconstruction of gene-content evolutionary history.
We demonstrated that the combination with the BDARD and PM models achieved good performance based on empirical datasets. While the rate multiplier models, the PDF and discretized C models, are very frequently used in the molecular evolutionary analysis, our results show that these models may be not suitable for modeling gene copy number evolution. Molecular evolution would follow similar patterns because of physicochemical characteristics of substitutions and/or constraints due to the genetic code, whereas gene copy number evolution does not have such universal constraints and therefore may show diverse evolutionary patterns.
Whether the proposed probabilistic model is identifiable is a critical theoretical problem in statistics. Here, 'Identifiable' means that two different parameters always produce different probability distributions. Rhodes and Sullivant (2012) proved a theorem about a condition for identifiability in general heterogeneous evolutionary rate models. The theorem insists that heterogeneous evolutionary rate models with large K can be identified when the number of species is sufficiently large. However, we could not apply this theorem to our model because the assumption of the stationarity of the Markov process does not hold true. It is essential to discuss the identifiability of gene-content evolution models in the future.
Although we assumed that the input phylogenetic tree was correct, the tree topology or branch lengths may contain estimation errors. Incorrect trees would affect the estimation of ancestral gene copy numbers. The analysis of the robustness to the input tree error is important for the gene-content reconstruction analysis. Additionally, in phylogenetic tree inference, to avoid inaccurate estimation, phylogenetic relationships are often estimated not as a perfect binary tree but as a consensus tree or a phylogenetic network. Improving Mirage to accept them as input is important in mitigating the impact of estimation errors of phylogenetic trees on gene-content evolution.
The datasets in this research are unreduced datasets but not unbiased datasets because the datasets do not include OGs that are not possessed by extant organisms. Examples of these OGs are those that possessed by extinct organisms and have now been lost. The unbiased datasets have to contain these OGs because our probabilistic model can generate the all-absent patterns. Therefore, our estimation may include biases based on the unobserved patterns even if we use unreduced datasets (Cohen et al., 2008;Cs} urö s, 2005;Felsenstein, 1992). The development of the bias correction methods for the EM algorithm is an essential future task.
The setting of the gene-content category number K is an important problem in Mirage. Although the Akaike Information Criterion (AIC) is a widely used estimator for model selection in phylogenetics, AIC can only be applied to statistically regular models, whose ML estimator asymptotically follows a normal distribution. Mixture models are generally nonregular models, and thus we Fig. 3. Log-likelihood values of various model settings by the holdout validation of the experiment 1. The x-axis and y-axis represent the number of gene-content clusters K and the log-likelihood of the test dataset, respectively. The BD, C&M, BDI and BDARD models are represented by blue, green, yellow and red lines, respectively. In addition, the PM, PDF and discretized C models are represented by solid, dashed and dotted lines, respectively. (A-C) Archaea dataset when l max was set to 2-4, (D-F) Micrococcales dataset when l max was set to 2-4 and (G-I) Fungi dataset when l max was set to 2-4 cannot apply AIC to the heterogeneity evolution models. Another popular technique for model selection is the nonparametric Bayesian method. This method can be applied to nonregular models, but requires a lot of computation time. A practical model selection method for nonregular models is an unsolved problem in statistics, and various methods have been proposed (Fujimaki and Morinaga, 2012;Watanabe, 2013). The integration of Mirage with model selection methods for nonregular models is also a future task.
As an application of Mirage, we envision function prediction of function-unknown genes by integrating it with phylogenetic profiling to be an interesting direction. The phylogenetic profiling method predicts gene functions based on correlated occurrence patterns between genes in an ortholog table (Kensche et al., 2008;Kumagai et al., 2018;Sherill-Rofe et al., 2019). The method generally ignores evolutionary relationships, for example by using simple mutual information as an index of correlation, and such ignorance is known to decrease prediction performance (Kensche et al., 2008). Previous studies showed that the prediction performance can be improved by observing correlation patterns of gene gain/loss events in the reconstructed gene-content evolutionary history instead of gene occurrence patterns in extant species (Barker et al., 2007;Moi et al., 2020;Ta et al., 2011). Precise reconstruction of the gene-content   The bold letter means that genes in the supercategory are likely to appear in top 10% gene families compared to the whole gene families COG category symbols: C, energy production and conversion; E, amino acid transport and metabolism; F, nucleotide transport and metabolism; G, carbohydrate transport and metabolism; H, coenzyme transport and metabolism; I, lipid transport and metabolism; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport, and catabolism. The bold letter means that genes in the category are more than 1.2 times more likely to appear in top 10% gene families compared to the whole gene families. evolutionary history by Mirage would contribute to the improvement of the phylogenetic profiling method.
With the present Mirage implementation, it is still difficult to reconstruct gene-content evolutionary histories of all genomesequenced species to the last universal common ancestor of life because of the huge computation time required. Thus, improving the computation time of Mirage is essential. In particular, application of the series acceleration method, which improves the convergence rate of a series, to the iteration steps of the EM algorithm seems promising. Specifically, the vector-acceleration technique, which does not require derivation of acceleration formula for each statistical model, may be readily applied to Mirage (Kuroda and Sakakihara, 2006). Another powerful approach would be a partitioning method, which does not use probabilistic but deterministic assignment of gene families to each gene cluster in the mixture model. This method has been widely used in molecular evolutionary analyses, but not in genecontent evolutionary analyses (Brown and Lemmon, 2007;Frandsen et al., 2015;Lanfear et al., 2017). Although the partitioning method can be less accurate due to the deterministic approximation, its computational efficiency would be high.
Although Mirage can model differences in the evolutionary rates among gene-content clusters, it assumes the same evolutionary rate among all branches of the phylogenetic tree. However, this assumption does not always hold true. For example, polyploidization events cause massive gene gains (Inoue et al., 2015;Sriswasdi et al., 2016), and parasitization events cause massive gene losses (Sun et al., 2018). Moreover, heterogeneity of evolutionary rates among branches may also be caused by changes in survival strategies (Sriswasdi et al., 2017) or large-scale extinction events (Wolf and Koonin, 2013). Although various programs for modeling heterogeneity among branches have been developed (Han et al., 2013;Iwasaki and Takagi, 2007;Zwaenepoel and Van de Peer, 2020), there are no software that can take various gene gain/loss models and heterogeneity models into account. Therefore, the expansion of Mirage in this direction would be needed to deepen our understanding.