A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree

Abstract A novel test is described that visualizes the absolute model-data fit of the substitution and tree components of an evolutionary model. The test utilizes statistics based on counts of character state matches and mismatches in alignments of observed and simulated sequences. This comparison is used to assess model-data fit. In simulations conducted to evaluate the performance of the test, the test estimator was able to identify both the correct tree topology and substitution model under conditions where the Goldman–Cox test—which tests the fit of a substitution model to sequence data and is also based on comparing simulated replicates with observed data—showed high error rates. The novel test was found to identify the correct tree topology within a wide range of DNA substitution model misspecifications, indicating the high discriminatory power of the test. Use of this test provides a practical approach for assessing absolute model-data fit when testing phylogenetic hypotheses.


Introduction
Substitution model misspecification contributes greatly to phylogenetic uncertainty (Gaut and Lewis 1995;Inagaki et al. 2004;Jermiin et al. 2004;Gruenheit et al. 2008;Sheffield et al. 2009;Nesnidal et al. 2010;Betancur-R et al. 2013;Goremykin et al. 2013Goremykin et al. , 2015Xi et al. 2014) and can lead to phylogenetic error in reconstructed relationships (Lockhart et al. 1996;Bruno and Halpern 1999;Buckley 2002;Lartillot et al. 2007). However, direct estimations of model-data fit are very rarely used in phylogenetic practice. In the vast majority of exploratory phylogenetic studies, only the relative fit of different substitution models to each other is estimated (e.g., via the hierarchical likelihood ratio tests: Frati et al. 1997;Sullivan et al. 1997; via Akaike or Bayesian information criterionbased comparisons, respectively : Akaike 1974;Schwarz 1978; or via cross-validation: Lartillot et al. 2007). This practice occurs, even though there are strong indications that assessment of model-data fit based on direct comparison of observed data and parametric replicates is more reliable for assessing phylogenetic accuracy than are measures of relative model fit (Waddell et al. 2009;Grievink et al. 2010).
Absolute model fit indicators compare simulated and empirical data, and the measures of resemblance used by different approaches are diverse. A characteristic trait of absolute model fit indicators is that the best possible model has a fit of zero. All indicators assess whether an evolutionary model is likely to well predict observed data properties. A weakness of current methods is that the statistics employed are often insufficient to distinguish between models. The most widely used statistics are the frequentist Goldman-Cox (GC) test (Goldman 1993) and the Bayesian posterior predictive model checking approach (Rubin 1984;Bollback 2002). Both methods utilize multinomial likelihood-based statistics to assess absolute model fit (Althoff et al. 2006;Lanfear and Bromham 2008;Grievink et al. 2010;Anisimova et al. 2011;Ekman and Blaalid 2011;Nguyen et al. 2012;Reid et al. 2014;Kitahara et al. 2014;Kaehler et al. 2015;Duchêne et al. 2016). These two tests assess overall model fit, and not specific data features such as composition and saturation (Foster 2004;Duchêne et al. 2016). These tests are also similar in the sense that they check the ability of a model to correctly predict the distribution of frequencies of site patterns in the observed data.
The statistics of the frequentist GC test examine the difference between the model-based likelihood and the multinomial likelihood (the product, across all alignment positions, of frequencies of site patterns). Although multinomial likelihoods are simple to calculate they have also been reported to lack discriminatory power (Bollback 2002;Foster 2004;Waddell et al. 2009;Ripplinger and Sullivan 2010;Duchêne et al. 2016). Lewis et al. (2014) have noted that multinomial likelihoodbased statistics "depend[s] only on the counts associated with site patterns and not the patterns themselves. Thus, two data sets that are extremely different can potentially have identical numerical values for this statistic." Pointing out the irrelevance of information contained in the site patterns for the multinomial likelihood inference, Lewis et al. (2014) have instead suggested a modification of the Gelfand and Ghosh (1998) test statistic (GG) for assessment of similarity between observed data and replicates simulated on an evolutionary model. An advantage of using this statistic is that it evaluates accuracy in prediction of data patterns directly (Lewis et al. 2014). Theoretically, the GG test statistic as applied by Lewis et al. (2014) can evaluate model-based predictions of distinct site pattern frequencies. Practically, the test requires that the same site pattern must occur at least once in the observed data and in each simulated replicate in order to avoid calculation of logarithm of zero (as explained in detail in the section "Estimation of Substitution Model Fit"). Because this is unlikely to always occur, Lewis et al. (2014) group individual patterns into categories composed of A, C, G, T, AC, AG, AT, CG, CT, GT, ACG, ACT, AGT, CGT, and ACGT character states, since these categories are more likely to be encountered in all alignments.
To further explore the potential of using phylogenetic information contained in alignment site patterns, a novel pattern-sensitive frequentist test is proposed for assessing absolute model-data fit. This test incorporates GG statistics, albeit based on different data properties. The discriminatory power of this novel test for model-data fit is compared with the frequentist GC test. The performance of these tests has been examined for a number of DNA evolutionary models. Because phylogenetic analyses of biological data always involve a degree of model misfit, assessment of the tests has considered model-data fit when the substitution model component of the evolutionary model was misspecified. The novel test metric described was found to be sufficiently sensitive to identify the correct tree topology within wide range of DNA substitution model misspecification. The high discriminatory power of the novel test and its robustness to substitution model misspecification encourage its use for hypothesis testing in phylogenetics.

Estimation of Substitution Model Fit
In order to perform the test of substitution model fit, counts are first made of pairwise aligned character states (e.g., A-A, A-C, A-G, A-T, C-C, C-G, C-T, G-G, G-T, T-T) in each multiple sequence alignment analyzed. For a multiple sequence alignment (henceforth referred as msa) which has n sequences and k columns the counts of character state alignments are calculated in the following way: wherein x2{A, C, G, T}, y2{A, C, G, T}, S ia and S ja are the character states at site a for sequences i and j, respectively, and 1 v ¼ 1 if v is true and 0 otherwise. The alignments compared should not contain gaps and ambiguous character states.
The counts of pairwise aligned character states (eq. 1) are quantities which can be expected to be predicted well by a fit substitution model and poorly when a substitution model is misspecified. They represent the basic type of information used by the proposed test. The test statistic is calculated by comparing the counts of pairwise aligned character states in the data, which have evolved under an unknown model of evolution (henceforth referred as the "empirical model," EM), to the counts in replicates simulated under an explicit evolutionary model (henceforth referred as the "simulation model," SM).
The GG criterion for model fit is a sum of two components, one related to goodness-of-fit (GGg) and the other related to variance (GGp): For the purposes of conducting the test, the default GG parameters as recommended in Lewis et al. (2014) following Ibrahim et al. (2001) and Chen et al. (2004) are used. If the sum of all counts of pairwise aligned character states in each msa included in the test is s, and the number of replicates generated is q, then and . Each of the four t functions in equations (3) and (4) is computed as: wherein x represents the corresponding function number as shown in equations (3) and (4), d is an alphabet-specific number of distinct character state alignments and C is a variable indicating the values for counts of character state alignments. For t 1 , the C variable indicates the values for character state alignment counts in each replicate. For t 2 , the C variable indicates the mean value over counts of each such alignment in the distribution of replicates. For t 3 , the C variable indicates the counts of pairwise character state alignment in the observed msa. In the t 4 function, the C variable indicates the mean value for each character state alignment over two corresponding values used in the functions t 2 and t 3 . Although originally suggested for Bayesian analyses (Gelfand and Ghosh 1998), these statistics in fact can also be expected to function well in a frequentist analytical framework. In the case of no variation in the counts among simulated replicates, the indicator of variance (GGp) would equal zero. In addition, a perfect fit (GGg ¼ 0) occurs when the mean values predicted by the SM do not deviate from those of the empirical model. Obviously, such definitions of fit and variance should hold true for comparisons of observed data with parametric replicates generated under maximum likelihood models too.
Since use of the full GG statistic (GG ¼ GGgþGGp) offered no sizable advantage over using only the GGg function in the experiments conducted here, the proposed test uses GGg function as a default option. The full GG statistic, although not recommended, can also be optionally employed.
An important advantage of the GGg-based test of substitution model fit presented here over the test suggested in Lewis et al. (2014) is that the GGg-based test can be performed under conditions where the Lewis et al. test cannot be applied. Regardless of the data properties compared, Gelfand and Ghosh statistics require each C variable in equation (5) to have a positive value in order to avoid calculation of logarithm of zero. In the case of Lewis et al. test, the C variable indicates counts for 15 site pattern categories which contain the following character states: A, C, G, T, AC, AG, AT, CG, CT, GT, ACG, ACT, AGT, CGT, and ACGT. These categories should be present in observed data and in each replicate, otherwise the test cannot be performed. The test presented here requires the following ten character state alignments (in any combination of character states, since C xy ¼ C yx [eq. 1]): A-A, C-C, T-T, G-G, A-C, A-G, A-T, C-G, C-T, G-T to cooccur in the observed data. They must also occur at least once in the distribution of replicates. Thus, the more replicates that are generated, the higher is the chance of failure in calculation of the Lewis et al. test  In preliminary experiments aimed at working out the scoring options in the presented study, I encountered program crashes which could be traced back to the problem of noncalculability of Lewis et al. test with large number of replicates and small alignments used for testing purposes. For this reason, this test cannot be performed on a number of published data sets (Nikiforova et al. 2013;Fu c ıkov a et al. 2016;Johnson et al. 2018;McManus et al. 2018). The alignments from these studies which cause calculation failure of the test (43 in total) are provided as Supplementary Material online. In contrast to the performance of Lewis et al. test, these alignments, ranging in length from 96 to 134,553 positions, fulfill the conditions required by the novel test presented here. Assessment of substitution model fit based on counts of pairwise alignments among the character states is illustrated with the following example. Let us consider the two trees shown in figure 1. In Tree I, taxon A is sister to taxon B, while in Tree II, taxon A is sister to taxon C, otherwise the trees are identical.
With the help of the INDELIBLE program (Fletcher and Yang 2009), a 10,000 positions long replicate has been simulated on Tree I. The simulation assumed a GTRþG model (with the numerical parameters taken from arbitrarily selected T01þGTRþG EM model used in table 2). The corresponding evolutionary model is referred to hereafter as the "empirical model 1," EM1. Two further replicates of the same length were generated under Tree I and Tree II assuming a GTR model with the same numerical parameters as the above GTRþG model. These evolutionary models assuming Tree I and Tree II have been termed simulation model 1 (SM1) and simulation model 2 (SM2), respectively. SM1 was used to illustrate a case where the tree and substitution model are correct, while SM2 illustrates a case where the tree is misspecified and the substitution model is correct. Replicates of the same length were also generated under Tree I and Tree II assuming a K80 model (with the numerical parameters taken from arbitrarily selected T01þK80 EM model used in table 2). These evolutionary models assuming Tree I and Tree II have been termed simulation model 3 (SM3) and simulation model 4 (SM4), respectively. SM3 was used to illustrate a case where the tree is correct and the substitution model is misspecified. SM4 illustrates a case where both the tree and substitution model were misspecified.
Counts of alignments among character states (eq. 1) in replicates representing EM1, and four SMs were calculated ( fig. 2). The results presented in figure 2 show similarity in the counts calculated for the replicates simulated under all models assuming a GTRþG substitution model component (EM1, SM1, and SM2). The values calculated for the models assuming a K80 substitution model component (SM3 and SM4) were also similar to each other. These were quite distinct from those calculated for GTRþG-based replicates for a majority of character state alignments. GGg test values calculated in comparisons of EM1 to SM1 (68.14), EM1 to SM2 (69.76), EM1 to SM3 (16,367.30), and EM1 to SM4 (16,135.76) allow visualization of the difference in datamodel fit due to substitution model misspecification.
The test component used for assessing overall substitution model fit is a mean over GGg values computed for a set of evolutionary models sharing a substitution model component: wherein w is a number of such models. The tree model component influences GGg values only slightly and randomly, and so this metric cannot be used to rank alternative tree components of an evolutionary model. For example, SM3 which shared the same tree with EM showed worse fit compared with SM4, an evolutionary model which assumed a wrong tree. In the example presented above, the A value (eq. 6) for a GTRþG model is calculated as (

Ranking Combinations of Each Substitution Model and Model Tree in Terms of Model-Data Fit
The metric presented above cannot be used to rank alternative tree components of an evolutionary model. However, by considering data partitions (groups of sequences) separately, a further statistic can be defined that will provide a means of accomplishing this task. To obtain this statistic, the counts of pairwise aligned character states are first transformed into frequencies. For each msa which has n sequences and k columns: wherein x2{A, C, G, T}, y2{A, C, G, T}, and C xy is the value estimated in equation (1). The frequencies of alignments of nonequal character states should be divided by a large constant positive value (P-factor) to produce the scoring matrix values. As explained later in this section, division of frequencies of character state mismatches by a large positive P-factor value improves the discriminatory power of the test. The scoring matrix values are given by equation (8): wherein P v ¼ P if v is true and 1 otherwise. The script which performs the calculation for the component of the presented test related to model tree fit (test_stage1.pl, available at: https://github.com/vadimgoremykin/absolute-model-data-fit) uses a default P-factor value ¼ 10,000, which was used in all experiments conducted here, unless stated otherwise. The scoring matrix values are used to estimate dissimilarity (D values) between subsets of taxa separated by internal branches in the model trees. D values provide a means to evaluate which fully resolved tree is most supported by the data. Furthermore, although the space of all possible trees can be large, competing hypotheses ofrelationshipoften onlyconcernafinitenumberof possible tree topologies, which are represented by a limited number of alternative internal branches. Thus, while there are 2 (nÀ1) Àn À1 splits or internal branches in a tree space with n taxa, D values do not need to be calculated for all splits in the possible tree space, but rather they need only be calculated for a set of splits that represents competing trees. D values can be used to rank tree models because their value is influenced by the treelike divergence of sequences.
D values for each internal branch are obtained by considering the average heterogeneity of m values for the node on one side of the branch being evaluated against the average heterogeneity of m values across the branch at each alignment position. A D value is calculated for each of the two internal tree nodes separated by the internal branch. The position-based D value for each node is calculated as the ratio of a mean m value (eq. 8) for character alignments within the taxon subset on a side of the branch corresponding to the node (termed here "node-specific taxon subset") at the given alignment column c (V1 c ) to the mean m value for alignments between these characters and all other characters at the same column (V2 c ).
The V1 c value is computed as the mean of the m values (eq. 8) for all possible alignments between the character states in the node-specific taxon subset at the column c.
Since m xy ¼ m yx , V1 c can be computed as a mean over the m values in an upper or lower triangular part of a square matrix which rows and columns represent OTUs included into the subset, populated with the m values for letter alignments among these OTUs, e.g.: wherein h is the number of taxa in the subset, and m x(i), y(j) is the scoring matrix value for the character states in column c (x and y) observed, respectively, in sequences i and j which belong to the subset. The value for V2 c is analogously computed as a mean of m values for the character state alignments between each character state observed in the node-specific taxon subset at the same column c and each character state not included in the subset at the column: where x(i) is a character in the column c observed at the ith OTU from the node-specific taxon subset which has a total of h OTUs, y(j) is a character in the column c observed at the jth OTU which belongs to the rest of OTUs, comprised of b sequences and m x(i), y(j) is the scoring matrix value for x(i) and y(j) character states. The D value for a node-specific taxon subset at the column c is a ratio of the corresponding V1 c value to the V2 c value: The D value for a node-specific taxon subset in a multiple sequence alignment of k positions is calculated as the arithmetical mean of all the corresponding position-based D c values: The D values are computed for all internal tree branches present in all the model trees representing alternative evolutionary hypotheses to be tested. For the test purposes, the trees should be fully resolved. The set of the D values (eq. 12) used for model fit assessment should not contain redundant estimates. Regardless of the frequency of occurrence of an internal branch in the model trees, the corresponding D values should be present in the set only once. A feature of this novel statistics which is utilized to identify the true tree topology is that D values that correspond to splits that are in the true tree will tend to be larger than D values for the same splits if they are not in the true tree. Employing P-factors in calculation of D values helps to emphasize this difference.
In designing a scoring function related to model tree fit, it was assumed that common character state(s) observed across a hypothesized branch at an alignment position indicate that the position (termed "X site") does not support the branch, but rather supports other branches or does not support any branch (e.g., is constant). In addition, it is assumed that a position at which common character state(s) are not shared across a hypothesized branch (termed "S site") supports the branch. In the case of S sites, the denominator in equation (11) is a mean over the m values assigned to mismatches only. The significance of this is that division of frequencies of mismatches by a P-factor value in equation (8) leads to relative increase of sizes of position-based D values for S sites compared with the corresponding values calculated for X sites. Consequently, the numerical size of a D value (eq. 12), which is a mean of the position-based D values calculated over all msa positions, would be largely determined by S sites. Under each substitution model, employing P-factors emphasizes the difference between a D value calculated in the case when S sites are abundant, which is expected if a branch is in the true tree, and the D value for the same node-specific taxon subset calculated when S sites are scarce. The latter is expected when the branch is not in the true tree. The difference between D values for the same node-specific taxon subsets expected under these alternative evolutionary scenarios provides the basis for discriminating between model trees as described below. The effect of using P-factors is illustrated in table 3. At high Pfactor values, there is greater resolution among models (table 3) while numerical test values (table 5) are similar.
The D values are calculated based on each individual msa, representing either EM or SM. The strength of rank correlation between a set of D values calculated based on an alignment representing the empirical model and a set of D values calculated for a SM-based replicate is measured using the Spearman correlation coefficient (r s ). The replicate-specific r s values are each estimated by comparing a set of D values for an individual msa representing EM to a set of D values for each replicate representing SM. They are converted to a normally distributed variable z by the Fisher transformation: In the rare case in which r s is >0.999999999999999, r s is set to this value to enable computation of the Fisher transformation in the Perl programming language. The arithmetical mean for a q-member distribution of z values is calculated as: The test component used for assessing evolutionary model fit to a msa representing EM is obtained by inverting the Fisher transformation for the mean value (eq. 14) and subtracting the resulting value from 1: The smaller is the resulting value, the better is SM-data fit.
Model tree selection based on D values is illustrated employing the replicates simulated on the trees shown in figure 1 and used to provide an example for the calculation of substitution model fit (eq. 6). D values were calculated based on each replicate for all the node-specific taxon subsets which could be sampled from the model trees ( fig. 3).
The results presented in figure 3A and B show that, under each distinct substitution model, each D value that corresponds to a split that is in the model tree is larger than the D value for the same node-specific taxon subset if the split is not in the model tree. D values for each of the AB and CDEF node-specific taxon subsets which are separated by a branch in the Tree I but not in Tree II are larger for the evolutionary models assuming Tree I compared with those assuming Tree II. D values for each of the AC and BDEF node-specific taxon subsets, which are separated by a branch in Tree II but not in Tree I, are smaller for evolutionary models assuming Tree I compared with those assuming Tree II.
Spearman rank correlation coefficients (r s ) were employed to assess the difference in model-data fit due to an incorrect tree. In the case of no substitution model misfit, comparison of identical evolutionary models (EM1 and SM1) yielded r s ¼ 0.98, and the comparison of models assuming Tree I and Tree II (EM1 and SM2, respectively) yielded r s ¼ 0.38. In comparisons involving GTRþG-based EM1 and evolutionary models that contained a substitution model component (K80 model) that was misspecified, comparison of models assuming the same set of correct branches (EM1 and SM3) yielded r s ¼ 0.98, and comparison of EM1 and SM4, assuming different sets, yielded a relatively lower value (r s ¼ 0.36).
These results illustrate the assessment of model-data fit when distinct SMs share the substitution model component but assume different trees. The highest degree of rank correlation can be expected when tree topologies are equivalent for the empirical model and the simulation evolutionary model. In this case, the sets of correct and wrong branches used for model fit assessment would be the same for the EM and SM. When the SM assumes a misspecified tree, each node-specific D value for the internal branch that is in the SM tree, but not in EM tree, is expected to be larger in the simulation than in the observed data. In contrast, each nodespecific D value for the internal branch that is in the EM tree, but not in the SM tree can be expected to be smaller in the simulation than in the observed data. The rank correlation coefficients provide a simple means to assess these shifts in D values in SM compared with EM.

Test Statistic
It should be mentioned that the proposed rank correlation analysis is not well-suited to distinguish substitution models which exhibit the same or similar rank correlation values and the same tree topologies. A combined test metric for evolutionary model fit (T) which is a product of two components (value A and value B, eqs. 6 and 15, respectively) is free from this limitation: . In the case of perfect fit the test value should approximate zero. In practice, a fit of 0 is rarely achieved even in comparison of identical models due to stochastic differences in replicate samples. In

Assessment of Resolution among Individual Models and Sufficiency of the Number of Replicates for Correct Model Identification
The discriminatory power of the proposed test has been assessed in experiments involving comparison of explicit models, where correct test outcomes were known. In such experiments, stochasticity in test estimates can arise due to the limited size of the SM-based replicate distribution (error source 1) and due to the choice of the replicate representing the "correct model" (which, e.g., can be an outlier, error source 2). In order to quantify the error rate associated with each specific experimental setup directly, a number of EMbased replicates were compared with each distribution of the SM-derived replicates. When both SM and EM have represented the same evolutionary model, the replicates compared were different.
Iterative comparison of EM-based replicates to a distribution of SM-based replicates can yield the mean test value over all the comparisons, which would be less dependent on the error source 2. Series of such comparisons can also show how well the test can discern one model from another. The parametric bootstrap-based estimate for model separation (MS) employed in this study is the percentage of times a preferred simulation model showed better fit to each EM-derived replicate in comparison to a wrong simulation model.
In the experiments aimed at estimation of the test's ability to separate each EM from every other model (table 1), the preferred evolutionary models had both components (tree and substitution model) identical to those of the EM. An obvious assumption of these experiments was that the test statistic used should be able to identify the correct model.
In the experiments aimed at identifying the correct tree topology when the SM substitution model was misspecified, the preferred models had tree topologies that were congruent with the correct tree model. In contrast, unfavored models had incorrect tree topologies under the same substitution models scheme (e.g., K80þG) as in the preferred models. The purpose of these experiments was to test the hypothesis that the specification of a correct tree topology will improve model-data fit in the case when the substitution model is misspecified. In addition, the mean test values for over 500 EM-SM comparisons were also used to assess if the specification of the correct tree topology led to the best estimates of model fit within the SM sets, each characterized by the same misspecified substitution model. Since a degree of model misfit is unavoidable in phylogenetic analyses of the biological data, estimations of the test's performance under increasingly severe model misspecification (table 2) provides another analytical perspective on the discriminatory power of the test metric.
It should be noted that the MS scores can be calculated for any test of model fit which is based on comparison of replicates. For example, in this study, the MS scores described in the above paragraph were also computed for the GC tests (table 4).
Increasing the number of replicates (both EM-and SMbased) in experiments aimed at calculation of MS scores can also give a researcher the means to judge whether the degree of model discrimination observed is due to the resolution limit of the method or rather due to stochasticity in replicate sampling (error source 1). Leveling of the MS values obtained in experiments involving incrementally increased replicate numbers would indicate that the former explanation is correct.

Assessment of Discriminatory Power for the Novel Test and GC Test
Estimators for model-data fit which use different site patternbased statistics were evaluated to determine whether they could identify evolutionary models used to simulate parametric replicates. Performance of the estimators was evaluated in instances where there was specification of the correct substitution model and also when there was substitution model misspecification.
Model trees with branch lengths and substitution model parameters for generation of replicates under arbitrarily selected GTRþG 4 , GTRþIþG 4 , GTR, HKYþIþG 4 , HKYþG 4 , HKY, K80þIþG 4 , K80þG 4 , and K80 substitution models were obtained from RaxML v. 825 (Stamatakis 2014)  The trees used as full topological constraints in all the searches performed based on the above-mentioned biological data set were arbitrarily selected to represent various hypotheses of the relationships among the land plants and their  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10  T11  T12  T13  T14

(C) Estimates of Fit b of the Preferred Models to GTR1G-Based Empirical Models
The best-fitting model for the above data set selected by the ModelFinder pipeline (Kalyaanamoorthy et al. 2017) was a GTR-based model (GTRþR) with across sites rate heterogeneity modeled via FreeRate model (Soubrier et al. 2012), which allows site rates to vary freely and allows automatic determination of the number of rate categories. The best model assumed five rate categories. Model trees and substitution model parameters for generation of replicates under the optimal GTRþR model were obtained from 15 IQ-TREE (Nguyen et al. 2015) searches performed based on the observed alignment iteratively specifying each of the above mentioned 15 topological constraints.
Multiple sequence alignments were simulated using INDELIBLE (Fletcher and Yang 2009) for all the models, except those that used a GTRþR substitution model component, for which I used Seq-Gen (Rambaut and Grassly 1997). All INDELIBLE configuration files used for replicate generation and Seq-Gen command lines used to generate the GTRþR model-based replicate distributions are available as Supplementary Material online.
The simulations under each of GTRþIþG 4 , GTR, HKYþIþG 4 , HKYþG 4 , HKY, K80þIþG 4 , K80þG 4 , and K80-based evolutionary models (substitution modelþtree combinations, 120 in total) were conducted to sample sets of 500 parametric replicates, each 21,300 pos. long. The replicate sets were chosen to represent 120 simulation evolutionary models (SMs) in follow up experiments. For each GTRþR and GTRþG 4 -based evolutionary model (30 in total), I simulated 1,000 replicates, each 21,300 pos. long. Each of the 30 resulting replicate files was then partitioned to obtain a partition A with the first 500 replicates sampled and a partition B containing the rest of replicates. The A partitions were chosen to represent 30 "empirical" models (EMs) and the B partitions were chosen to represent other 30 simulation evolutionary models (SMs) in follow up experiments.
The discriminatory power of the GC test (Goldman 1993) has also been assessed. The GC metrics have been used to compare GTRþG 4 -based EMs to GTRþG 4 , GTRþIþG 4 , GTR, HKYþIþG 4 , HKYþG 4 , HKY, K80þIþG 4 , K80þG 4 , and K80based SMs. Based on lack of resolution (table 4) observed in these experiments, comparisons involving other models have not been conducted. In order to obtain model-based likelihood values, RAxML searches were run for replicates generated under the above models as described above, specifying the general definition of the substitution model scheme (e.g., GTRþG 4 ) and the correct full topological constraint which was used to create the corresponding replicate. The corresponding RAxML options led to adjustment of branch lengths and model numerical parameters during the program run.
To obtain the GC test statistics (d ¼ Ln(multinomial)ÀLn(model)), the model-based likelihood values obtained have been compared with the corresponding unconstrained likelihood values which were calculated for the replicates. The GC test statistics for each EM-based replicate was compared with the SM-based distributions of the corresponding values to compute the absolute values of the Zscores (numbers of standard deviations between the d value for a EM replicate and the mean d value for each SM-based replicate distribution).
The mean Z-scores over 500 individual tests for each EM-SM pair of evolutionary models were compared and the MS values for the resolution of the preferred models were calculated as described above. The MS values and the mean Zscores were used to assess if the specification of the correct tree topology led to the best estimates of model fit within the 15-component SM models sets, each characterized by the same substitution model. The results have been summarized in table 4.

Comparison of Individual Test Values
Computation of the final test statistics (eq. 16), computation of all the MS values and the mean test values helping to summarize the discriminating ability of the tests compared here have been conducted with the help of test_stage2.pl script (available at: https://github.com/vadimgoremykin/absolute-model-data-fit).

An Empirical Example
Empirical tests have been conducted based on a subset of a concatenated alignment of mammal nuclear genes presented by Song et al. (2012). Although large (1,297,456 aligned columns), the alignment presented by the authors (available from the DataDryad database) has only 10,042 columns which do not contain missing or ambiguous characters. In an attempt to increase the length of the alignment subset Table 3 An Example of Ability of the Presented Test to Identify Preferred Models a in the Presence and Absence of Model Misspecification as a Function of P-Factor Value Size Preferred simulation models in this comparisons assume the correct tree topology of the "empirical" model (GTRþR constrained with the full topological constraint congruent to Tree 1 in supplementary fig. S1, Supplementary Material online) and substitution model components shown above the table. The values shown are the percentages of times when a preferred SM showed better fit to the each of 500 replicates representing "empirical" model in comparison to a distinct SM assuming the same substitution model scheme as the preferred model. Shown are the worst (lowest) MS values for the separation of a preferred SM from any other SM which assumes the same substitution model scheme.
which can be used by the presented test, I have removed five species from the original file and produced a longer alignment partition which contains only DNA alphabet-specific characters. The partition, henceforth referred to as "empirical data set," is available as Supplementary Material online. This data set has 32 OTUs and 29,900 aligned positions. The comparison of the aforementioned replicates to a shorter mitochondrial data set was not conducted because it contained many indels and missing genes/gene regions. The test presented here requires presence of only alphabet-specific characters in each alignment.
The methodological purpose of the experiments presented in this section was to 1) illustrate sampling of conflicting trees necessary to conduct the test, 2) to check for uniformity of the test results under different P-factor values in comparison to an implicit biological model, and 3) to check if the test rejects an obviously wrong evolutionary scenario. An additional goal was to identify the best-fitting model.  The mean model fit estimates between each of the 500 replicates representing an empirical model (EM) and a preferred SM. c The percentages of times when a preferred SM showed better fit to the each of 500 replicates representing EM in comparison to a distinct SM assuming the same substitution model scheme as the preferred model. Shown are the worst (lowest) MS values for the separation of a preferred SM from any other SM which assumes the same substitution model scheme.

(B) MS Values c for the Preferred Models Shown in Subtable
Five cladograms (provided in supplementary fig. S2, Supplementary Material online) used to generate model trees were selected based on the results of unconstrained phylogeny reconstruction experiments performed on the empirical data set. Tree reconstruction experiments assuming a CATþGTR model with rates among sites modeled via a Dirichlet process (henceforth referred to as "CATþGTRþD" model) were performed with the help of Phylobayes v. 3.3 (Lartillot et al. 2009). Maximum likelihood-based tree phylogeny reconstruction experiments were performed using IQ-TREE (v. 1.6, Nguyen et al. 2015). Neighbor-joining trees were built with the help of Seaview alignment editor (Gouy et al. 2010). The cladogram 1 corresponds to the tree obtained under the GTR model with rates across sites heterogeneity modeled via FreeRate model (Soubrier et al. 2012), which could be specified as GTRþR4 in IQ-TREE command line. The above-mentioned FreeRate model assuming four rate classes was selected as the best-fitting under the Bayesian Information Criterion (BIC) by the ModelFinder (Kalyaanamoorthy et al. 2017) pipeline run under specification of empirical data set as input alignment file. The cladogram 2 corresponds to the tree obtained under the GTRþG model. The cladograms 3 and 4 represent alternative consensus topologies recovered from two chains run under CATþGTRþD model by discarding the first 500 cycles as "burn-in"-which was found to be sufficient for both chains-and building consensus trees based on next 1,500 cycles. The cladograms 1-4 differ only in placement of the root of placental mammals and in placement of tree shrew. The tree shrew appears as sister to primates on cladograms 1 and 2 and as sister to the rest of Euarchontoglires, a group which unites tree shrews, primates, lagomorphs, and rodents, on cladograms 3 and 4. The Atlantogenata (a group which includes Xenarthra and Afrotheria) appears as sister to the rest of placental mammals in the cladograms 1 and 3, and Xenartra assumes this position in cladograms 2 and 4.
The cladogram 5 represents the Neighbor-Joining tree topology which was recovered from Jukes and Cantor, K80 and LogDet distances using the Seaview alignment editor. This tree topology was included in analyses to provide an example of an obviously incorrect evolutionary hypothesis assuming rodents as the sister group to the rest of placentals (Churakov et al. 2010;Goremykin et al. 2010). The expectation was that the test would indicate a poor fit for this case.
Model parameters for the replicate generation under the optimal ML substitution model (GTRþR4) were determined in IQ-TREE searches run based on the empirical data set under iterative specification of each of the five above-mentioned topological constraints. Five hundred replicates were generated for each evolutionary model. The model-based replicate distributions obtained were compared with the empirical data set to assess absolute model fit employing P-factor values (eq. 8) 1,000, 10,000, and 100,000.

Identification of the Correct Substitution Models
The ability of the novel test to correctly identify the optimal substitution model among a set of alternatives was assessed, since arguably this is the most common goal of model fitbased investigations in the evolutionary studies. I checked if, under the correct tree topology specification for all evolutionary models compared, the SM showing the best fit to EM-based replicates would be the one that shares the same substitution model scheme with the EM. A less obvious hypothesis was also tested. This was that, within a set of SMs assuming different substitution model components and the same wrong specification of a model tree topology, the SM assuming the same substitution model scheme as the EM would show the best fit. The series of experiments performed to check all above hypotheses involved comparisons of a set of 10 SMs having 10 different substitution model components, all iteratively assuming model tree topologies 1 to 15 to each of 30 EMs.
In all these experiments, which have been conducted by comparing the mean values over 500 estimations of modeldata fit of each SM (represented by a distribution of 500 replicates) to each of 500 EM-based replicates, the test presented here identified the SMs sharing the same substitution model with EMs as the best fitting model without a single exception.
These preliminary observations indicated that the novel test employed in this study was able to register an improvement in model-data fit due to specification of the correct substitution model component of the full evolutionary models under the selected experimental setup. Having checked this aspect of performance of the test, I proceeded to determine whether the test had sufficient discriminatory power to show improvement in model-data fit for the correct model tree topology.

Identification of the Correct Full Evolutionary Models
Simulated data sets for 150 full evolutionary models, including 10 distinct substitution model schemes, were iteratively compared with each of the 30 EMs. In all cases, the lowest mean test scores-estimated over 500 comparisons of each SM to each of 500 replicates simulated under each EM-identified the correct SMs without error. Had the test results been random, the probability of obtaining such results (30 correct identifications, each time out of 150 models) would have been 1/150 30 ¼ 5.2Â10 À66 .
MS values were calculated that indicate the percentage of times when a correct SM, sharing the same tree and substitution model with EM, showed better fit to each of 500 replicates in comparison to any other distinct SM. The lowest MS values for each EM obtained in these experiments (table 1) ranged from 99% to 85% with the mean value 92%. Thus, the proposed test metric was also able to register the positive influence of the correct tree topology and of the correct substitution model scheme onto model-data fit estimates.

Identification of the Correct Tree Topology
Each of the 30 EMs was iteratively compared with 10 SM sets (300 comparisons in total). Each SM set included 15 full evolutionary models sharing a distinct (correct or misspecified) substitution model scheme. In all 300 experiments, the lowest mean test scores-estimated over 500 comparisons of each SM to each of 500 replicates simulated under each EMidentified the SMs assuming correct tree topologies.
The MS values indicating the percentage of times when each SM having correct or wrong substitution model component which assumed the correct tree topology (a "preferred SM") showed better fit to each of 500 replicates simulated under a distinct EM in comparison to any other SM assuming the same substitution model scheme as the preferred SM have been also computed in these experiments. The results of these comparisons are presented in table 2. The MS values in the table 2 show the worst measure of separation between each preferred SM and any other SM which has the same substitution model component and a wrong tree component registered in these experiments.
The above results indicate that the proposed test run employing default P-factor value of 10,000 was able to detect an improvement in the model-data fit due to a correct tree topology specification. This was indicated by the MS values for each preferred model (shown in table 2) regardless of the substitution model misspecification for all the models used in the experiments. The chance of correct model tree component identification for the models showing the highest level of misfit was not much different compared with the models exhibiting no substitution model misspecification (table 2). The chance of identifying the correct model tree component estimated at P-factor values set to 10,000 and 100,000 for a subset of models were identical and very similar to the results obtained employing a P-factor value ¼ 1,000 (table 3).
Comparisons were also made using GC test metric. In these experiments, the MS values for the preferred models obtained were generally low, and in many cases produced the lowest possible value (zero) (table 4). This indicates that the discriminatory power of the GC test was not sufficient to reliably reveal the improvement in the model-data fit due to the choice of the correct model tree topology, even in the case in which the SM substitution model components were correctly specified.

Testing Evolutionary Relationships among Mammals
The test identified as best-fitting to the empirical data set the evolutionary model assuming the tree topology presented in supplementary figure S2 (Supplementary Material online) as cladogram 3. The topology corresponds to a tree (shown in fig. 4) recovered under a CATþGTRþD model using Phylobayes v. 3.3 program (Lartillot et al. 2009).
The tree topology supports the clade subtending Atlantogenata as sister to the rest of the placental mammals and the clade subtending tree shrew as sister to the rest of the Euarchontoglires. This result was obtained with P-factor values (eq. 8) ranging in size by two orders of magnitude: 1,000, 10,000, and 100,000. The ranking of evolutionary models under the test metric using the above P-factor values was identical (in descending order of fit estimated for the evolutionary models assuming full topological constraints shown in supplementary fig. S2: 3, 4, 1, 2, 5, Supplementary Material online). The model-data fit estimates (table 5) for each evolutionary model obtained employing the above P-factor values were very similar.
It should be noted that the worst assessment of modeldata fit was registered for the evolutionary model which assumed model tree topology 5 supporting rodents as sisters to the rest of the placental mammals. The placement of rodents at the base of the eutherian subtree is considered a classical example of LBA (Churakov et al. 2010;Goremykin et al. 2010).

Discussion
Using data to test a model which is assumed to describe its generation is an important principle in statistical analysis. Absolute model-data fit assessment showing how well simulated data fit the observed can be used to quantify model strengths and weaknesses and can be used as a guide for model improvement. However, absolute tests of data model fit are rarely used in the field of phylogenetics. This is arguably so because the task of identification of the correct evolutionary tree, which is fundamental for the discipline, is reportedly difficult to accomplish using previous model-data fit indicators. Concerns about their discriminatory power were raised more than a decade ago (Foster 2004;Waddell et al. 2009;Ripplinger and Sullivan 2010) yet no method of model-data fit assessment, able to reliably discriminate phylogenetic hypotheses assuming different trees, has been forthcoming.
The discriminatory power of model-data fit indicators strongly depends on which aspects of the data are used to compare models. Here, I tested how well test statistics derived from distributions of characters in observed and simulated data were able to detect phylogenetic signal due to treelike divergence of sequences.
In the experiments performed here, the estimator based on the multinomial likelihood failed to reliably identify the correct model trees under the easiest conditions (assuming availability of the correct empirical model and no substitution model scheme misspecification). The evolutionary models selected by the GC test as the most adequate included wrong model trees as their components in the vast majority of analyses (table 4). These observations indicate that multinomial likelihood-based statistic derived from the frequencies of site patterns is poorly suited to detect tree-based phylogenetic signal in the multiple sequence alignments. Since the phylogenetic information contained in the patterns themselves is irrelevant for multinomial likelihood inference, the GC test statistic does not capture the data properties which are relevant for tree building.
The main advantage of the novel test of model-data fit proposed here over the GC test is its accuracy. However, for its successful application certain conditions should be fulfilled. The test presented here was designed to enable discrimination of the distinct, fully specified evolutionary models. Each such model should assume a fixed parameter combination and a fully resolved tree with fixed branch lengths. In order to allow for meaningful model comparison among the evolutionary hypotheses, all model parameters should assume values registered at the ML optimum under each evolutionary model specification. Also, the test accuracy observed in comparison of explicitly postulated models was partially due to the fact that the multiple sequence alignments compared (represented by replicates) had no alignment errors. When the observed alignment contains character states not predicted by DNA substitution models, these must be removed before the method can be used for model evaluation. In this case, a necessary data preparation step might include, in any combination, removal of corresponding sequences and/or removal of sites. The preferred trimming scheme should be selected by an analyst to fit the purpose of the study prior to analysis.
Under such conditions the novel estimator has shown an ability to identify the correct evolutionary models. The probability that this result can be explained by a tolerance of the experimental setup to a random error was small (5.2Â10 À66 ).
Under the conditions studied, this test statistic identified the correct model tree in cases where the substitution model was correctly specified and also misspecified. The ability to identify the true tree topology within broad margins of model  NOTE.-Each test value shown in the table was obtained for the evolutionary model assuming model tree topology shown in the leftmost column employing Pfactor value (eq. 8) shown above the corresponding column of values. Tree topologies of five full topological constraints used in these experiments (as shown in the leftmost column) are presented in supplementary figure S2, Supplementary Material online. misspecification observed here is encouraging that correct identification of the best-fitting evolutionary model is possible when the correct model is unknown (as might be the case with biological data). Some degree of model misfit is unavoidable when substitution process in biological data is modeled.
Nevertheless the robustness of the test's results in the presence of model misspecification does not remove the requirement for realistic SM substitution models in the analyses of the biological data. When EM is unknown or implicit the tree representing the preferred evolutionary hypothesis should be selected as a component of the best-fitting SM.
The discriminatory power of the test reported here encourages the application of the novel test for hypothesis testing in phylogenetics. Given the rarity with which assessment of absolute model-data fit are currently employed in the field, its introduction into the mainstream phylogenetic practice would help to diminish the gap between a large number of phylogenetic relationships claimed to be correctly resolved and the ancillary indicators (bootstrap support, posterior probability support, and recovery of congruent trees with different methods) often used to corroborate such claims. I hope that the observations made in this study can help to develop more realistic substitution models and to help with efforts to test and more reliably reconstruct the Tree of Life.
Concluding, I would like to mention that the results obtained here in analyses involving empirical data support Atlantogenata as the basal-most clade among placental mammals and sister group relationship between tree shrews (Tupaia) and the rest of Euarchontoglires. The former phylogenetic relationship has often been recovered in recent studies (e.g., Song et al. 2012;Morgan et al. 2013;Tarver et al. 2016). The latter relationship has been recently recovered by Tarver et al. (2016) based on a data set comprising microRNA genes and in coalescent analyses presented by Mason et al. (2016).

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.