Alignment-Integrated Reconstruction of Ancestral Sequences Improves Accuracy

Abstract Ancestral sequence reconstruction (ASR) uses an alignment of extant protein sequences, a phylogeny describing the history of the protein family and a model of the molecular-evolutionary process to infer the sequences of ancient proteins, allowing researchers to directly investigate the impact of sequence evolution on protein structure and function. Like all statistical inferences, ASR can be sensitive to violations of its underlying assumptions. Previous studies have shown that, whereas phylogenetic uncertainty has only a very weak impact on ASR accuracy, uncertainty in the protein sequence alignment can more strongly affect inferred ancestral sequences. Here, we show that errors in sequence alignment can produce errors in ASR across a range of realistic and simplified evolutionary scenarios. Importantly, sequence reconstruction errors can lead to errors in estimates of structural and functional properties of ancestral proteins, potentially undermining the reliability of analyses relying on ASR. We introduce an alignment-integrated ASR approach that combines information from many different sequence alignments. We show that integrating alignment uncertainty improves ASR accuracy and the accuracy of downstream structural and functional inferences, often performing as well as highly accurate structure-guided alignment. Given the growing evidence that sequence alignment errors can impact the reliability of ASR studies, we recommend that future studies incorporate approaches to mitigate the impact of alignment uncertainty. Probabilistic modeling of insertion and deletion events has the potential to radically improve ASR accuracy when the model reflects the true underlying evolutionary history, but further studies are required to thoroughly evaluate the reliability of these approaches under realistic conditions.

. Empirical protein domain family phylogenies used to simulate sequence data.
We simulated protein sequence data along empirical phylogenies with the indicated properties. Tree depth indicates the number of nodes on the longest path from the root to any leaf node on the phylogeny. Max distance to root indicates the largest cumulative number of expected substitutions/site along any path from the root to any leaf node. Branch length summary statistics are also indicated.  Table S2. Structure-guided and sequence-alignment methods typically underestimate correct alignment lengths and gap proportions while overestimating the proportions of variable and parsimony-informative sites. We simulated alignments of 5 protein domain families, using empirically-derived simulation conditions, and aligned the resulting replicate sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). For each alignment method and protein domain, we report the mean and standard error (in parentheses, over 10 replicates) in total alignment length (alignment_len), the proportions of gap characters (P(gap)), variable sites (P(variable)) and parsimony-informative sites (P(parsimony)). The number of sequences in each protein domain tree is also reported.  Table S3. Alignment distances differ across alignment methods and protein domain families. We simulated alignments of 5 protein domain families, using empirically-derived simulation conditions, and aligned the resulting replicate sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). For each sequence alignment, we measured the position-wise distance of that alignment to the correct simulated alignment, which estimates the probability of randomly selecting an incorrectly-aligned residue from the inferred alignment (see Methods). We estimated the distribution of alignment distances across replicate simulations using kernel density estimation. We report the mean (and standard error), median and mode of each alignment-distance distribution.  Table S4. Ancestral sequence reconstruction (ASR) error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequencealignment methods, and intermediate when structure-guided alignments or alignmentintegrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has  Table S5. Total ancestral sequence reconstruction (ASR) error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. For each protein domain family, we report the mean (and standard error), median and mode of the total ASR-error distribution.  Table S6. Ancestral sequence reconstruction (ASR) residue-reconstruction error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. For each protein domain family, we report the mean (and standard error), median and mode of the residue ASR-error distribution.  Table S7. Ancestral sequence reconstruction (ASR) insertion error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. For each protein domain family, we report the mean (and standard error), median and mode of the insertion ASR-error distribution.  Table S8. Ancestral sequence reconstruction (ASR) deletion error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. For each protein domain family, we report the mean (and standard error), median and mode of the deletion ASR-error distribution.  Table S9. Support for erroneously-inferred ancestral states is reduced by alignmentintegrated ancestral sequence reconstruction. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. The posterior probability (PP) of each erroneously-inferred ancestral state was determined using an empirical-Bayesian approach (see Methods). The distribution of posterior probabilities for erroneously-inferred ancestral states was calculated by kernel density estimation. We report the mean (and standard error), median and mode of the posterior-probability distribution for each type of ASR error.  Table S10. Support for the correct ancestral state is increased by alignment-integrated ancestral sequence reconstruction when an incorrect ancestral state is inferred. We simulated extant and ancestral sequences for 5 protein domain families, using empiricallyderived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignmentintegrated ancestral sequences were generated by combining inferences from the 7 sequencealignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. For each error type, when an erroneous ancestral state of that error type was inferred as the maximumlikelihood state, we calculated the posterior probability (PP) of the correct state using an empirical-Bayesian approach (see Methods). The distribution of posterior probabilities for the correct ancestral state, when ASR errors of each type were made, was inferred by kernel density estimation. We report the mean (and standard error), median and mode of the posteriorprobability distribution of correct ancestral states for each type of ASR error.  Table S11. Support for the correct ancestral state is reduced by alignment-integrated ancestral sequence reconstruction when the correct state is inferred by maximumlikelihood. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structureguided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignmentintegrated ancestral sequences were generated by combining inferences from the 7 sequencealignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to identify ancestral sequence reconstruction (ASR) errors (see Methods). When the correct ancestral state was inferred as the maximumlikelihood state, we calculated the posterior probability (PP) of the correct state using an empirical-Bayesian approach (see Methods). The distribution of posterior probabilities for the correct ancestral state, when no ASR error was made, was inferred by kernel density estimation. We report the mean (and standard error), median and mode of the posteriorprobability distribution of correct ancestral states for each type of ancestral state (residues, gaps).  Table S12. Structure-guided and alignment-integrated methods reduce errors in estimated structural and functional properties of ancestral sequences, compared to sequence-alignment methods. We simulated replicate extant and ancestral sequence data sets by evolving the DSRM1 protein domain family along its empirically-determined phylogeny, using a structure-guided alignment of empirical DSRM1 sequences to determine the amino-acid composition and pattern of insertions/deletions (see Methods). Structural and functional properties were unconstrained during the simulations. We modeled the structure of each simulated ancestral sequence and estimated its structural stability (ΔG) and dsRNA-binding affinity (pKd) using computational approaches (see Methods). For each alignment method, we inferred the maximum-likelihood ancestral sequence at each node on the phylogeny, modeled the structure of the inferred ancestral sequence at each node, and estimated structural stability and binding affinity from the modeled structure. Structural stability and binding affinity errors were calculated as the absolute value of the difference between the value calculated from the correct simulated ancestral sequence and that calculated from the inferred ancestral sequence. The distributions of structural-stability and binding-affinity errors were calculated by kernel density estimation. We report the mean (and standard error), median and mode of each error distribution.  Table S13. Errors in estimated structural stability and binding affinity of ancestral sequences are weakly correlated with errors in the inferred ancestral sequences. We simulated replicate extant and ancestral sequence data sets by evolving the DSRM1 protein domain family along its empirically-determined phylogeny, using a structure-guided alignment of empirical DSRM1 sequences to determine the amino-acid composition and pattern of insertions/ deletions (see Methods). Structural and functional properties were unconstrained during the simulations. We modeled the structure of each simulated ancestral sequence and estimated its structural stability (ΔG) and dsRNA-binding affinity (pKd) using computational approaches (see Methods). For each alignment method, we inferred the maximum-likelihood ancestral sequence at each node on the phylogeny, modeled the structure of the inferred ancestral sequence at each node, and estimated structural stability and binding affinity from the modeled structure. Structural stability and binding affinity errors were calculated as the absolute value of the difference between the value calculated from the correct simulated ancestral sequence and that calculated from the inferred ancestral sequence. ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletionerrors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each type of ASR error and alignment method, we determined the best-fit linear relationship between ASR error rate and structural stability (left) or binding affinity (right) using ordinary least squares linear regression. We report the slope (and standard error), intercept and correlation (rsquared) of each linear regression.  Table S14. Ancestral sequence reconstruction error rates were positively correlated with increasing branch lengths and weakly correlated with increasing insertion-deletion rates in 3-taxon simulations. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths and the same insertion-deletion (indel) rate across the phylogeny (see Methods, Supplementary Fig. S12). The resulting extant sequence data was aligned using 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at the single node on the 3-taxon phylogeny. Additionally, alignmentintegrated ancestral sequences were generated by combining inferences from the 7 sequencealignment methods (see Methods). In each case, we compared the inferred ancestral sequence to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). Sequence-wide error rates (errors/site) were computed by dividing the total number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Top table reports results from linear-regression of indel rate vs ASR error rate, for each branch length and alignment method. Bottom table reports results from linearregression of branch length vs ASR error rate, for each indel rate and alignment method. In each case, we report the slope (and standard error) of the best-fit regression line, the p-value obtained by testing if the slope is significantly different from zero, the intercept of the best-fit regression line, and the correlation (r 2 ) between the predictive (branch length or indel rate) and response (ASR error rate) variable. Figure S1 Part 1, CARD domain. Empirical protein domain family phylogenies used to simulate sequence data. We simulated protein sequence data along the indicated empirical phylogenies. Branch lengths are scaled to the expected number of substitutions/site. Part 1 shows the empirical phylogeny of the CARD domain family; other domain family phylogenies are shown in parts 2-5, below.

Figure S1
Part 2, DSRM1 domain. Figure S1 Part 3, DSRM2 domain. Figure S1 Part 4, DSRM3 domain. Figure S1 Part 5, RD domain. Figure S2. Structure-guided and sequence-alignment methods typically underestimate correct alignment lengths and gap proportions while overestimating the proportions of variable and parsimony-informative sites. We simulated alignments of 5 protein domain families, using empirically-derived simulation conditions, and aligned the resulting replicate sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). Colors indicate the proportion of the correct-alignment's value by which the inferred value deviates from the correct-alignment's value, with negative numbers (blue) indicating underestimation of the correct-alignment's value, and positive numbers (red) indicating overestimation of the correct-alignment's value. A value of 1 (-1) indicates over-(under-) estimation of the correct value by 50%. Figure S3 Part 1, CARD Domain. Alignment distance distributions differ across alignment methods and protein domain families. We simulated alignments of 5 protein domain families, using empirically-derived simulation conditions, and aligned the resulting replicate sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). For each sequence alignment, we measured the position-wise distance of that alignment to the correct simulated alignment, which estimates the probability of randomly selecting an incorrectly-aligned residue from the alignment (see Methods). We estimated the distribution of alignment distances across replicate simulations using kernel density estimation. Part 1 shows results for the CARD domain family; results for other domain families are shown in parts 2-5, below.  Figure S4 Part 1, CARD Domain. Total ancestral sequence reconstruction (ASR) error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequencealignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. Here we report the total ASR-error distributions. Part 1 shows results for the CARD domain family; results for other domains are shown in parts 2-5 below.

Figure S5
Part 1, CARD Domain. Ancestral sequence reconstruction (ASR) residuereconstruction error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empiricallyderived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignmentintegrated ancestral sequences were generated by combining inferences from the 7 sequencealignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. Sequencewide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. Here we report the ASR-error distributions for residue errors. Part 1 shows results for the CARD domain family; results for other domains are shown in parts 2-5 below.

Figure S6
Part 1, CARD Domain. Ancestral sequence reconstruction (ASR) insertion error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequencealignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. Here we report the ASR-error distributions for insertion errors. Part 1 shows results for the CARD domain family; results for other domains are shown in parts 2-5 below.

Figure S7
Part 1, CARD Domain. Ancestral sequence reconstruction (ASR) deletion error rates are low when the correct alignment is known in advance, higher when the alignment is inferred using sequence-alignment methods, and intermediate when structure-guided alignments or alignment-integrated ASR is used. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structure-guided and 7 different sequencealignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. Error rate distributions were inferred by kernel density estimation. Here we report the ASR-error distributions for deletion errors. Part 1 shows results for the CARD domain family; results for other domains are shown in parts 2-5 below.  Figure S8. Support for the correct ancestral state is reduced by alignment-integrated ancestral sequence reconstruction when the correct state is inferred by maximumlikelihood. We simulated extant and ancestral sequences for 5 protein domain families, using empirically-derived conditions, and aligned the resulting extant sequence data using structureguided and 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at each node on the phylogeny. Additionally, alignmentintegrated ancestral sequences were generated by combining inferences from the 7 sequencealignment methods (see Methods). In each case, we compared the inferred ancestral sequences to the correct simulated sequence to identify ancestral sequence reconstruction (ASR) errors (see Methods). When the correct ancestral state was inferred as the maximumlikelihood state, we calculated the posterior probability (PP) of the correct state using an empirical-Bayesian approach (see Methods). The distribution of posterior probabilities for the correct ancestral state, when no ASR error was made, was inferred by kernel density estimation. We report kernel density distributions for correctly-inferred ancestral gap states, residues and total (gaps + residues). Figure S9. Alignment-integrated and structure-guided approaches produce less error in inferred structural and functional properties of ancestral proteins than single sequencealignment methods. We simulated replicate extant and ancestral sequences by evolving an RNA-binding protein domain along its empirically-determined phylogeny, using a structureguided alignment to determine the amino-acid composition and pattern of insertions/deletions. Ancestral sequences were inferred using structure-guided alignment, 7 different sequencealignment methods and alignment-integration. We modeled the structure of each ancestral sequence and estimated its dsRNA binding affinity (pKd) using a computational approach. Errors in binding affinity were calculated by comparing values estimated from the correct ancestral sequences to those estimated using each alignment method. We used kernel density estimation to calculate the frequency distribution of binding-affinity errors. Figure S10. Simulated ancestral DSRM1 protein domains generated a range of values for structural stability and dsRNA-binding affinity. We simulated replicate extant and ancestral sequence data sets by evolving the DSRM1 protein domain family along its empiricallydetermined phylogeny, using a structure-guided alignment of empirical DSRM1 sequences to determine the amino-acid composition and pattern of insertions/deletions (see Methods). Structural and functional properties were unconstrained during the simulations. We modeled the structure of each simulated ancestral sequence and estimated its structural stability (ΔG) and dsRNA-binding affinity (pKd) using computational approaches (see Methods). The distributions of simulated structural stability (left) and binding affinity (right) values were inferred using kernel density estimation. Table inset reports the mean (and standard error), median, mode, standard deviation and absolute-maximum of each distribution. Figure S11. Errors in estimated structural stability and binding affinity of ancestral sequences are weakly correlated with errors in the inferred ancestral sequences. We simulated replicate extant and ancestral sequence data sets by evolving the DSRM1 protein domain family along its empirically-determined phylogeny, using a structure-guided alignment of empirical DSRM1 sequences to determine the amino-acid composition and pattern of insertions/ deletions (see Methods). Structural and functional properties were unconstrained during the simulations. We modeled the structure of each simulated ancestral sequence and estimated its structural stability (ΔG) and dsRNA-binding affinity (pKd) using computational approaches (see Methods). For each alignment method, we inferred the maximum-likelihood ancestral sequence at each node on the phylogeny, modeled the structure of the inferred ancestral sequence at each node, and estimated structural stability and binding affinity from the modeled structure. Structural stability and binding affinity errors were calculated as the absolute value of the difference between the value calculated from the correct simulated ancestral sequence and that calculated from the inferred ancestral sequence. ASR errors were divided into 4 "errorType" categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletionerrors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each type of ASR error and alignment method, we determined the best-fit linear relationship between ASR error rate and structural stability (left column) or binding affinity (right column) using ordinary least squares linear regression (dotted lines). Results for total ASR error rates are reported in the main text. Figure S12. A minimal 3-taxon simulation provides a model system for investigating ancestral sequence reconstruction errors. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths (b substitutions/site) and the same insertion-deletion (indel) rate across the phylogeny (r indels/substitution/site). Red dot indicates the single ancestral node, and blue bars indicate possible aligned extant sequences generated by the simulation process.
While large-scale simulations are useful for evaluating the potential impact of alignment errors on ancestral sequence reconstruction studies in practice, it is difficult to systematically isolate the relevant variables necessary to determine why alignment errors generate ASR errors from such large-scale systems. Small model problems are needed to reliably infer causative mechanisms. In phylogenetic inference, the '4-taxon problem' represents the smallest possible tree with multiple unrooted topologies, and studies of this simple 'toy' problem have led to major advancements in our understanding of the causes of phylogenetic inference errors and their potential solutions.
Here we propose a '3-taxon problem' as the equivalent minimalist system for examining ancestral sequence reconstruction accuracy. For each node on the phylogeny, the most commonly-used marginal ASR algorithms use information from the 3 directly-connected nodes (typically 2 descendent nodes (eg, A,B) and one node representing the local 'root' (eg, C)) to infer the ancestral state probability distribution. The simple 3-taxon problem therefore allows us to learn potentially generalizable information about the factors impacting ASR accuracy while limiting the number of simulation variables to a small number. Figure S13. Alignment-integrated ancestral sequence reconstruction reduces total ASR error rates in 3-taxon simulations across a variety of conditions. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths and the same insertiondeletion (indel) rate across the phylogeny (see Methods, Supplementary Fig. S12). The resulting extant sequence data was aligned using 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at the single node on the 3-taxon phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequence to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each combination of branch-length and indel-rate, we plot the average rate of total ASR errors over 100 replicate data sets for each alignment method, with purple indicating low ASR error rates, and yellow indicating high rates of ASR errors. Figure S14. Alignment-integrated ancestral sequence reconstruction reduces residue ASR error rates in 3-taxon simulations across a variety of conditions. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths and the same insertiondeletion (indel) rate across the phylogeny (see Methods, Supplementary Fig. S12). The resulting extant sequence data was aligned using 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at the single node on the 3-taxon phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequence to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residue-insertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each combination of branch-length and indel-rate, we plot the average rate of residue ASR errors over 100 replicate data sets for each alignment method, with purple indicating low ASR error rates, and yellow indicating high rates of ASR errors. Figure S15. Alignment-integrated ancestral sequence reconstruction reduces insertion ASR error rates in 3-taxon simulations across a variety of conditions. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths and the same insertiondeletion (indel) rate across the phylogeny (see Methods, Supplementary Fig. S12). The resulting extant sequence data was aligned using 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at the single node on the 3-taxon phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequence to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each combination of branch-length and indel-rate, we plot the average rate of insertion ASR errors over 100 replicate data sets for each alignment method, with purple indicating low ASR error rates, and yellow indicating high rates of ASR errors. Figure S16. Alignment-integrated ancestral sequence reconstruction reduces deletion ASR error rates in 3-taxon simulations across a variety of conditions. We simulated protein sequences along a 3-taxon phylogeny with equal branch lengths and the same insertiondeletion (indel) rate across the phylogeny (see Methods, Supplementary Fig. S12). The resulting extant sequence data was aligned using 7 different sequence-alignment methods (see Methods). We used each alignment to infer the most likely ancestral sequence at the single node on the 3-taxon phylogeny. Additionally, alignment-integrated ancestral sequences were generated by combining inferences from the 7 sequence-alignment methods (see Methods). In each case, we compared the inferred ancestral sequence to the correct simulated sequence to estimate ancestral sequence reconstruction (ASR) error rates (see Methods). ASR errors were divided into 4 categories: 1) residue errors, in which both correct and inferred ancestral sequences have a residue at a given position in the alignment, but the residues differ; 2) insertion errors, in which the inferred sequence has a residue at a given alignment position, but the correct sequence has a gap state; 3) deletion errors, in which the inferred sequence has a gap, but the correct sequence has a residue, and 4) total errors, which combine residueinsertion-and deletion-errors. Sequence-wide error rates (errors/site) were computed by dividing the number of errors by the length of the pairwise alignment of the inferred and correct ancestral sequences. For each combination of branch-length and indel-rate, we plot the average rate of deletion ASR errors over 100 replicate data sets for each alignment method, with purple indicating low ASR error rates, and yellow indicating high rates of ASR errors.