Measuring the relative contribution to predictive power of modern nucleotide substitution modeling approaches

Summary Traditional approaches to probabilistic phylogenetic inference have relied on information-theoretic criteria to select among a relatively small set of substitution models. These model selection criteria have recently been called into question when applied to richer models, including models that invoke mixtures of nucleotide frequency profiles. At the nucleotide level, we are therefore left without a clear picture of mixture models’ contribution to overall predictive power relative to other modeling approaches. Here, we utilize a Bayesian cross-validation method to directly measure the predictive performance of a wide range of nucleotide substitution models. We compare the relative contributions of free nucleotide exchangeability parameters, gamma-distributed rates across sites, and mixtures of nucleotide frequencies with both finite and infinite mixture frameworks. We find that the most important contributor to a model’s predictive power is the use of a sufficiently rich mixture of nucleotide frequencies. These results suggest that mixture models should be given greater consideration in nucleotide-level phylogenetic inference.


Introduction
Markovian models of nucleotide substitution have now been in development for over 50 years. Starting from the first simple idea of Jukes et al. (1969), modeling substitution events as a Poisson process, this history has been one where the assumptions of existing models are progressively relaxed in order to better capture the features of molecular evolution when conducting phylogenetic inferences. While many innovations have been proposed, practical and computational limitations have constrained the set of models that are widely used to those of the general time reversible (GTR) family (Tavaré 1986). The GTR model is parameterized by two distinct sets of parameters: four nucleotide frequency parameters (three degrees of freedom), denoted as p ¼ ðp i Þ ð1 i 4Þ and summing to 1, as well as six nucleotide exchangeability parameters (5 degrees of freedom), denoted as q ¼ ðq ij Þ ð1 i;j 4Þ . Together, these parameters specify the rates of the Markov generator governing the nucleotide substitution process along the branches of the phylogeny, denoted as Q ij ¼ q ij p j , which is to say that the instantaneous rate from nucleotide i to nucleotide j is given as the product of the relative exchangeability of nucleotides i and j and the frequency of the target nucleotide j. Special cases of the GTR model can be obtained by constraining either set of parameters, or both sets; for instance, the F81 model (Felsenstein 1981) is obtained from setting all q ij values to 1, and the JC model (Jukes et al. 1969) can be obtained by further setting all p i values to 1/4. Choosing which model from this family to apply to a given dataset was a common concern under the small sample sizes of the early days of probabilistic phylogenetic inference. Thanks to the availability of practical tools (e.g. Posada and Crandall 1998), a common protocol was to select one of the special cases of the GTR model, based on informationtheoretic criteria such as the AIC (Akaike 1974) or BIC (Schwarz 1978), and then conduct the phylogenetic inference per se utilizing the chosen model. By the turn of the century, however, these and other model selection methods began to elect the GTR substitution model as the most warranted for nearly all datasets (Sumner et al. 2012), which incorporated more and more sequences, of greater length.
Along the way, the importance of accounting for the heterogeneity in overall rates across sites became well understood, thanks to the gamma distribution approach of Yang (1993Yang ( , 1994. The general idea behind the gamma-distributed rates model is to consider that different sites of the alignment have different branch length multipliers, i.e. positive real values that average 1 across sites; loosely speaking, branch lengths seem shorter if the multiplier is less than 1, and longer if it is greater than 1. The distribution across sites of these branch length multipliers follows a gamma law, constrained to a mean of 1, and with a variance given by a À1 , with a itself considered a free parameter of the inference. The likelihood function then takes the form of an integral over the gamma law, usually approximated by discretization into a sum over four categories of rates. The gamma-distributed rates-across-sites model is also now nearly universally utilized on modern datasets. Recent research now often also accounts for pattern heterogeneity in the substitution process. Rather than invoking a single GTR model, the basic idea to capturing pattern heterogeneity is to invoke several, either through a fixed-effect preset allocation of models to sites of the alignment (i.e. partitioning methods, Lanfear et al. 2012), or through a random-effects mixture modeling strategy (e.g. Pagel and Meade 2004). Within the random-effects mixture approaches, one considers that the sites of a multiple sequence alignment were generated with several different Q matrices. The number of Q matrices, as well as the proportion of sites considered to have been generated by each, is generally unknown. This constitutes another model selection challenge.
The potential richness of these partitioning and mixture approaches is a departure from the models typically considered during the development of selection criteria such as the AIC and BIC, which were comparatively low dimensional. With high-dimensional models, these information criteria have been shown to decrease in reliability (e.g. Yang and Barron 1998, Broman and Speed 2002, Brewer et al. 2016, Liu et al. 2023. Such findings may partly explain some of the conflicting results in studying partitioning methods based on these criteria (e.g. Brown and Lemmon 2007, Ho and Lanfear 2010, Cameron et al. 2012, Leavitt et al. 2013, Kainer and Lanfear 2015. In contrast, recent work by Lartillot (2023) has shown that some Bayesian cross-validation methods, such as the sitewise cross-validation method, can be reliable even for high-dimensional mixture models.
Here, we apply the sitewise Bayesian cross-validation score to compare a large set of nucleotide substitution models. Our objective is to quantify the relative importance of modeling three aspects of nucleic acid sequence evolution. First, we measure the importance of accounting for uneven exchangeabilities across the six possible pairs of nucleotides, by contrasting the simple F81 model (Felsenstein 1981) and the GTR model. Second, we quantify the importance of modeling the heterogeneity of overall rates, by contrasting a model with homogeneous rates against the gamma-distributed rates model of Yang (1994). Third, we evaluate the importance of capturing pattern heterogeneity, by contrasting single-matrix models against mixture models. The mixture models we consider have components that are distinguished by their own nucleotide frequency parameters, while all other parameters are common across components. Whereas nucleotide-level mixture models of this type have usually been limited to finite mixtures of only a few components (typically between 2 and 4), we explore finite mixtures up to 100 components, as well as infinite mixtures based on the Dirichlet process. These models are the nucleotide level versions of the CAT models at the amino acid level (Lartillot and Philippe 2004). We also explore every combination of all three modeling approaches, to assess potential synergizes or redundancies.

Data
Four multigene datasets were studied here: We also studied four single-gene datasets: These datasets were chosen because of their high-quality, completeness, and open-access availability.

Models
We explore the following models: • F81: The simplest model considered, built from a single set of nucleotide frequency parameters (Felsenstein 1981). • GTR: A model built from a single set of nucleotide frequency parameters and pairwise nucleotide exchangeability parameters (Tavaré 1986). • CAT-Poisson: A model built from a Dirichlet process on nucleotide frequency parameters, capturing pattern heterogeneity across sites, but without invoking nucleotide exchangeability parameters (Lartillot and Philippe 2004); the model can be viewed as an infinite mixture of F81 models. • CAT-GTR: As before, a model built from a Dirichlet process on nucleotide frequency parameters, along with a set of nucleotide exchangeability parameters (Lartillot and Philippe 2004).
The number of components we explore includes every integer up to 10, then every 10 until reaching 100. • CAT f ¼x -GTR: A finite mixture version of CAT-GTR with x components (from 2 to 100, with CAT f ¼1 -GTR being the GTR model), along with a set of nucleotide exchangeability parameters. • F81 þ C: The F81 model with gamma-distributed rates across sites, discretized into four categories of rates (Yang 1994). Bujaki et al.
• GTR þ C: The GTR model with gamma-distributed rates across sites; this model serves as the reference in our overall model ranking. • CAT-Poisson þ C: The CAT-Poisson model with gammadistributed rates across sites. • CAT-GTR þ C: The CAT-GTR model with gammadistributed rates across sites.
To our knowledge, the CAT-inspired models have never been studied at the nucleotide level of the present work. We used the default priors in PhyloBayes (Lartillot et al. 2009) on all models.

Bayesian cross-validation
We carried out 5-fold cross-validation protocol on each multiple sequence alignment (Bujaki and Rodrigue 2022). The protocol is as follows: 1) Subdivide the dataset into a testing set, made from one fifth of the alignment columns drawn at random, and a learning set, made from the remaining four fifths of the alignment columns.
2) For a given model, a Bayesian Markov chain Monte Carlo algorithm is used to sample 1500 draws from the posterior distribution on parameters, conditional on the learning set; discarding the first 500 cycles as burn-in leaves a sample of K ¼ 1000 draws from the posterior distribution. A particular set of parameter values from this sample is denoted h k , where 1 k K. 3) Writing the likelihood score at site n of the testing set as pðD n jh k Þ, the site cross-validation score is taken as the average site likelihood over the entire sample of 1000 draws, and the overall sitewise cross-validation score is a sum of the natural logarithm of the site cross-validation score, written as: 4) Repeat steps 2 and 3 for all models. 5) Using the GTRþC model as a reference, we calculate the difference cv-score between it and a model of interest (although any model can be used a reference). 6) Repeat the entire protocol five times, reporting the mean and standard deviation across the five.
We used PhyloBayes (Lartillot et al. 2009) for all calculations. Relative contribution of substitution modeling approaches 3 3 Results and discussion

Finite and infinite mixtures
We first ran our model comparisons on multigene datasets. Figure 1 shows the cross-validation results relative to the GTR þ C model for a multigene dataset taken from Regier et al. (2010). The graph highlights how the finite mixture models' scores climb rapidly with an increasing number of components. The cross-validation score of finite mixtures appears to plateau by around 30 components; increasing the number of components beyond this point does not increase the cross-validation score. Interestingly, a model built solely on a mixture of nucleotide frequency parameters (i.e. with even exchangeability parameters and homogeneous rates across sites) outperforms the reference GTR þ C model. The best-performing models are those that combine a sufficiently rich mixture with free exchangeability parameters as well as gamma-distributed rates. The variation across cross-validation replicates does not allow for a clear distinction between the rich finite mixture models and the infinite (CAT) models. In other words, for rich mixtures, the sampling error induced from splitting the original data into learning and testing sets (see Methods) leads to average cross-validation scores that are within each other's error margins. These models are thus very similar in overall performance. Table 1 shows that these results generalize across multigene datasets (reporting results for 100 component finite mixtures). From a practical standpoint, it may be simplest to utilize the infinite mixture model (CAT), rather than repeatedly applying finite mixtures with progressively more components until reaching a plateau, given that it automatically adapts to the heterogeneity within the data. These numbers can be interpreted as follows: Using the Regier data and CAT-GTR þC model compared to GTR þC, the per site cv-score is 18 941/ 41 976 ¼ 0.451, making the average data column e 0:451 ¼ 1:57 times better explained by CAT-GTR þC.

Single matrix models
Unsurprisingly, the GTR model always outperforms the F81 model, across all datasets. Given the size of the multigene datasets (see Methods), the six (5 degrees of freedom) nucleotide exchangeability parameters can be reliably inferred, and it is well understood that nucleotides of any given pair are not evenly exchangeable as assumed by the F81 model. We have not explored other forms of single matrix models, given that the relative difference between F81 and GTR is modest when either of them is compared to the mixtures models we consider here.

Gamma-distributed rates across sites
Also well established is the fact that the gamma-distributed rates model always outperforms its homogeneous counterpart. Indeed, this modeling strategy introduces a single new parameter controlling the shape of the unit-mean gamma distribution, which can be reliably inferred from even single-gene datasets. It is also fully expected that positions across the alignments we study will exhibit variation in their overall substitution rates given variable pressures of selection under which they have evolved.

Contrasting modeling approaches
Still focusing on multigene datasets, Table 2 shows the difference in cross-validation scores between modeling approaches, rather than all against GTR þ C. Reporting results in this way highlights the relative contribution of each approach to predictive power; while it is well-established that each of the modeling approaches reported above improve model performance, we still lack quantitative measurements of which of these makes the greatest contribution to model performance.
The mixture modeling approach consistently leads to the largest difference in cross-validation score relative to its single-matrix counterpart. Contrasting the F81 model, which has a single nucleotide frequency vector as its parameterization, against the CAT model, which is based on an infinite mixture of nucleotide frequency vectors, shows the largest improvement in cross-validation score brought about by modeling pattern heterogeneity. When contrasting GTR with CAT-GTR, we still see a large improvement in cross-validation score, albeit not quite as large as F81 versus CAT. This suggests that there may be a slight overlap between the GTR and CAT modeling approaches; indeed a mixture of nucleotide frequency parameters can have the effect of modulating pairwise exchanges between nucleotides. Comparing F81þC against CAT þ C still show a large improvement, although it is less than the F81 versus CAT comparison. This again suggests that the modeling approaches have a level of overlap; a mixture of nucleotide frequency parameters can induce variations in overall rates across sites. An example of this would be having nucleotide frequency vectors that are dominated by a single nucleotide, which would lead to low overall rates of substitution, and other frequency vectors that are even across the four nucleotides, which would lead to comparatively higher overall rates of substitution. The mixture modeling approach produces the smallest (though still relatively high) improvement in model performance in the GTR þ C versus CAT-GTR þ C context. Table 2 also shows that the gamma-distributed ratesacross-sites approach is the second biggest contributor to a model's performance. Its contribution is of a lesser magnitude when coupled to a richer model, as expected from the partial overlap in capturing rate heterogeneity from the different modeling approaches. This is also the case for the GTR modeling approach, which yields the least improvement in performance of the three approaches. Some of the smallest increases in cross-validation scores are seen in the CAT þ C versus CAT-GTR þ C comparisons, as well as CAT versus CAT-GTR.

Single-gene alignments
In order to study the relative model performance on smaller datasets, we performed the same analyses on single-gene alignments. Figure 2 shows the results of the cross-validation model comparisons with the RBCL gene. Overall, the same general trends observed with multigene datasets are found in the single-gene context. However, the 5-fold cross-validation protocol leads to a relatively higher variance in scores across the five replicates. This can be seen from the error bars that are much larger relative to the difference in cross-validation score than for results on multigene datasets. This is to be expected with smaller, single-gene alignments. Beyond the small size of the datasets, the error bars across the five crossvalidation replicates can be attributed to even smaller learning sets (i.e. even less evolutionary signal from which to learn parameter values), and very small testing sets.
Another notable feature of Fig. 2 is the slightly decreasing trend in cross-validation scores for finite mixture models beyond 20 components. Although the error bars are large, preventing us from clearly distinguishing between these different mixture models, we believe this reflects the intrinsic penalty induced by the cross-validation protocol: over-parameterized models lead to a decrease in predictive power, with parameter values that become too specific to the dataset under study. The Dirichlet process-based CAT-inspired models tend to perform as well as the best finite mixture versions. As before, it may provide a more convenient random-effects method of capturing pattern-heterogeneity. Table 3 summarizes the results across all single-gene alignments, again with cross-validation score relative to the GTR þC model. For finite mixture models, the results with 20 components are reported, which is where their performance was best. The same overall model rankings found with multigene alignments are found with all single-gene alignments. Note that a finite mixture with 20 components is a much richer model than traditional partitioning schemes used for proteincoding data, which are usually limited to three models, assigned to first, second, and third codon positions, respectively. This result suggests that pattern heterogeneity at the nucleotide level goes beyond the simple coding nature of the datasets.
The relative contribution of each modeling approach again holds for single-gene datasets, as reported in Table 4. It is important to note that the cross-validation approach we utilize inherently tends to disfavor the richest models. This is because we are utilizing the cv-score as a measure of the predictive power of the model for the dataset as a whole, but we only present the model with four fifths of the data. Since the richest models may need more data to reliably infer parameters than simpler models, depriving them of the full dataset may have a greater impact on their performance. In other words, the true improvement in predictive power of the CAT-inspired models over homogeneous models is likely to be greater than what is reported herein.

Conclusions and future directions
Early methods for comparing the predictive power of nucleotide substitution models were based in information-theoretic criteria. These criteria compared the likelihood scores obtained under a pair of models being compared, but also included a greater penalty on the higher dimensional model. This penalty was meant to ensure that the values of additional parameters were not too specific to the dataset under study, and would generalize well to other datasets. The main advantage of these criteria was their ease of use on small datasets. Cross-validation methods, on the other hand, require larger amounts of data; one must have data to spare (i.e. not used in model fitting) in order to actually test the predictive power of model on previously unseen data. Indeed, with our singlegene datasets, setting aside a fifth of the data leaves us with a partial gene on which to learn parameter values. The large error-bars displayed in Fig. 2 would likely prevent one from clearly distinguishing between some of the models of the GTR family using cross-validation approaches. However, methods based on cross-validation may be important in comparing models high in dimensionality (Liu et al. 2023). Future work should explore other cross-validation approaches, including the leave-one-out scheme.
While the ideas behind the nucleotide substitution modeling approaches studied herein have existed for decades, no careful measurements of their relative contribution to predictive Relative contribution of substitution modeling approaches power had yet to be conducted, to the best of our knowledge. Our study shows that modeling pattern heterogeneity is by far the most important contributor to a nucleotide substitution model's performance, a result consistent with amino acidlevel models (Bujaki and Rodrigue 2022). Of course, these models are parameter rich in comparison to single-matrix models, but this increased richness is accounted for intrinsically by the cross-validation framework. Given these results, more in-depth studies on the various ways of modeling pattern heterogeneity seem warranted. Partitioning approaches may capture some of the pattern heterogeneity within a dataset, but they are limited to fixed-effect heterogeneity across partitions, which usually consist of genes, and/or codon positions. The random-effects mixture models studied here can capture more subtle variation across all positions at once. Still, quantifying the difference in performance between partitioning methods and the random-effects methods studied herein is pressing.
Other models that could be incorporated into the comparisons conducted here include: mixtures of nucleotide exchangeabilities (Pagel and Meade 2004), mixtures of gammadistributions (Mayrose et al. 2005) or Dirichlet process approaches (Huelsenbeck and Suchard 2007) for a more subtle account of overall rate heterogeneity across sites, and models accommodating heterogeneity across the branches of the phylogeny (e.g. Blanquart and Lartillot 2006). The typical focus of such works was generally to show that the proposed modeling idea had merit over a null model that ignores the feature being Frequency components allotted to the model CV score against GTR Figure 2. Cross-validation scores (relative to the GTR þ C model) for the RBCL dataset, plotted as a function of the number of nucleotide frequency components; note that models consisting of special cases with a single component are labeled near x ¼ 1. Table 3. Cross-validation score relative the GTR þ C with single-gene datasets.
captured. Such a focus is an important step in model development. We have emphasized here, however, that the larger model development program is greatly informed by quantitative comparisons of a new modeling idea relative to the broad classes of modeling approaches that are currently widely available. Historically, most nucleotide-level phylogenetic analyses have utilized a protocol to select a substitution process from the GTRfamily, along with an account of overall rates heterogeneity, while ignoring the pattern heterogeneity across gene alignments, or doing so in only a limited manner. The CAT-inspired nucleotide substitution models can flexibly accommodate pattern heterogeneity, at only a slightly increase computational cost. Thanks to the data-augmentation-based Markov chain Monte Carlo methods utilized in the PhyloBayes package (Lartillot 2006), the implementation of CAT-GTR þC requires only about 1% more computational time than GTR þC in the case of the largest dataset studied herein (Regier). We thus suggest that the CAT-inspired models be incorporated in model selection protocols given that their quantitative difference in predictive power largely overshadows any differences across GTR-family substitution matrices or gamma-distributed rates across sites. Relative contribution of substitution modeling approaches