Please Mind the Gap: Indel-Aware Parsimony for Fast and Accurate Ancestral Sequence Reconstruction and Multiple Sequence Alignment Including Long Indels

Abstract Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.

Ancestral sequence reconstruction accuracy for tree height 1.2 and 1.7      Multiple sequence alignment quality for tree height 0.8, 1.2 and 1.7       The bottom figure displays the proportion of accurately inferred insertion and deletion events for the same method selection.For PRANK +F we used the events given by the tool and additionally reconstructed ancestors with indelMaP based on the PRANK +F alignment.All methods received the same estimated guide tree.

Figure S2 :
Figure S2: Deletion error for all parameter combinations for trees with tree height 0.8

Figure S4 :
Figure S4: Overall character reconstruction accuracy for all parameter combinations for tree height 1.2.

Figure S5 :
Figure S5: Overall character reconstruction accuracy for all parameter combinations for tree height 1.7.

Figure S12 :
Figure S12: Correlation between added distance to the child nodes and substitution error for all parameter combination with tree height 0.8 and indel rate 0.01.

Figure S13 :
Figure S13: Correlation between added distance to the child nodes and substitution error for all parameter combination with tree height 0.8 and indel rate 0.05.

Figure S14 :
Figure S14: Correlation between added distance to the child nodes and deletion error for all parameter combination with tree height 0.8 and indel rate 0.01.

Figure S15 :
Figure S15: Correlation between added distance to the child nodes and deletion error for all parameter combination with tree height 0.8 and indel rate 0.05.

Figure S16 :
Figure S16: SPSs and TCSs scores for all parameter combinations with tree height 1.2.

Figure S17 :
Figure S17: SPSs and TCSs scores for all parameter combinations with tree height 1.7.

Figure S18 :
Figure S18: Relative insertion and deletion rates for all parameter combinations with tree height 0.8

Figure S22 :
Figure S22: Proportion of accurate inferred insertion and deletion rates for all parameter combinations with tree height 1.2.

Figure S23 :
Figure S23: Proportion of accurate inferred insertion and deletion events for all parameter combinations with tree height 1.7.

Figure S24 :
Figure S24: SPSs and TCSs scores for a subset of parameter combinations with tree height 0.8 and 64 taxa.All methods received the same estimated guide tree as guide tree for alignment.

Figure S25 :
Figure S25: The top figure illustrates a comparison between relative indel rates, specifically the estimated rate over the simulated rate, based on Historian, indelMaP, and PRANK +F.The bottom figure displays the proportion of accurately inferred insertion and deletion events for the same method selection.For PRANK +F we used the events given by the tool and additionally reconstructed ancestors with indelMaP based on the PRANK +F alignment.All methods received the same estimated guide tree.

Figure S26 :
Figure S26: Effect of different combinations of gap opening factor and gap extension factor on the Total column score.

Figure S27 :
Figure S27: Computational time benchmark of MSA methods, for an extended data set with 400 and 800 taxa trees.

Figure S28 :
Figure S28: Computational time benchmark of ASR methods, under various parameter combinations with tree height 0.8.

Figure S29 :
Figure S29: Computational time benchmark of ASR methods, for an extended data set with 400 and 800 taxa trees.