Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families

Abstract The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family.


INTRODUCTION
The ongoing explosion in the number of sequenced organisms highlights the need for reliable and thorough automated genome annotation pipelines. Most of the vertebrate genome finds its ultimate origin in transposable elements (TEs) (1)(2)(3)(4)(5), which have an enormous impact on genome activity and evolution (6)(7)(8)(9). Due to the volume and diversity of TEs, complete annotation of genomes depends on accurate identification and modeling of TE families (10). A central aspect of that process is the gathering of instances of each family, and the creation of multiple sequence alignments (MSAs) of those instances; these MSAs are often used to derive a consensus sequence (1,(11)(12)(13) and a profile hidden Markov model (pHMM) (14) for each family. Profile HMMs have been demonstrated to outperform consensus sequences for the identification of distant copies (15), however consensus sequences remain an important tool for interpreting sequence features such as open reading frames, splice sites, and transcription regulatory sites. In addition, many analysis tools are tailored to work with sequences rather than pHMMs and therefore are the focus of this work. Regardless of the sequence modeling methods, a high quality family-level TE MSA is critical for sensitive annotation of genomic copies, precise classification of TE families, reconstruction of encoded proteins, and family age estimation; this motivates an intense interest in the accuracy of methods producing MSAs for these sequence families.
Computational MSA approaches seek to optimize one of several scoring models, and an optimal solution of commonplace models is computationally intractable (16). Over the years, a multitude of MSA tools have been developed, each employing its own set of heuristics for achieving good alignment speed. The combination of heuristic and scoring functions leads to varying alignment accuracy. The alignment of multiple TE instances poses unique challenges, in that these copies can exhibit high sequence divergence, are often very fragmentary, and are dominated by neutral mutation patterns. Here, we seek to evaluate the efficacy of several commonly used tools in recovering accurate MSAs of neutrally evolving fragments of transposable element sequences.
TEs are prodigious generators of repetitive sequences in most genomes; their relationships can be difficult to recover due to rapid lineage bursts, complex recombination histories, and high rates of neutral mutation. The generation of an MSA from copies of a TE family is an important step in reconstructing the ancestral state of the TE and generating sequence models for genome annotation (15). TE sequences complicate alignment in several important ways: (i) Instances are often fragmented due to poor insertion fidelity, large deletions, or interruptions by insertions of other TEs. (ii) Due to their mostly neutral decay, there are generally no conserved regions that can anchor the alignment or open reading frames free from indel accumulation. (iii) Copies are often derived from a TE that was rapidly evolving in a genome; therefore, they represent a mixture of ancestral forms. (iv) Low complexity regions and internal repetition are common features. 5. The oldest detectable TE copies have accumulated over 35% substitutions since their arrival and given their neutral decay, have a substitution level of more than 70% between copies. In addition, rarer nonlinear events such as microduplication and inversion can further confound alignment.
Most MSA tools follow the progressive alignment approach, whereby a guide tree is estimated from the unaligned sequences and used to control the order in which sequences are merged into an increasingly complete MSA (17). These approaches often re-estimate the guide tree from the MSA and iterate this process until convergence. All tools evaluated except Refiner and Dialign employ this general framework. The differentiating factor for many of these tools is the objective function for scoring the pairwise alignments. Clustal Omega (18), Muscle (19), Kalign (20) and Dialign-TX (21) employ matrix-based scoring schemes to either pairs of sequence symbols or a column profile. T-Coffee (22) introduced a consistency-based scoring scheme in which a library of global/local pairwise alignments is used to generate position-specific scoring matrices (PSSMs) for the progressive alignment phase. Variations on this approach were later adopted by ProbCons (23), and MAFFT (24). Dialign bypasses the construction of a guide tree, and instead constructs an MSA by assembling pairwise collinear segment-to-segment alignments. Refiner (25), based on an ad-hoc approach that we have employed in the curation of TE families for over a decade, follows a pattern that we call iterative transitive alignment: all sequences are locally aligned to a single template sequence and the MSA is produced by aligning sequences to each other based on their alignments to the common template. The sequence representing the centroid of the set is used as the initial template. When complete, a consensus is computed from the resulting MSA, and the process is iterated using this new consensus as the template; iteration continues until convergence.
Sequence simulation tools have themselves evolved over time. Early efforts focused primarily on the generation of sequences along a fixed phylogeny, allowing for mutations based on time reversible substitution models (40)(41)(42)(43). These led to more sophisticated evolvers with support for insertion and deletion (indels) mutations, empirical substitution matrices, and branch dependent mutation rates (44- 50). Context dependent mutation rates have also been developed in some simulators; for instance the Evolver package (51) models special mutation rates for highly mutagenic CpG dinucleotides, and Trevolver (52) implements a triplet substitution model that accounts for first-order flanking effects.
Several metrics have been widely used in the assessment of predicted MSA datasets. These include the Sum-of-Pairs Score (SPS, aka developer score) (53,54), Column Score (CS) (53), and the Alignment Metric Accuracy (AMA) metric (55). SPS is the fraction of aligned residue pairs in the reference alignment that are correctly aligned in the predicted alignment. The CS score is the fraction of aligned columns in the reference alignment that are perfectly reconstructed in the predicted alignment and is often used as a measure of specificity in reconstruction of the MSA. A single misaligned sequence in a MSA can decimate the CS score, so that CS provides limited power to discriminate between alignments of highly diverged sequences. The AMA metric is the fraction of characters in the reference alignment that are correctly arranged in the predicted alignment, either aligned to another character or not aligned (i.e. aligned to a gap character). This is an appealing metric, in that it penalizes overalignment in MSA (failure to allow gaps as appropriate), but it does not factor in gap position and can therefore be thrown off by radically different gap positions within the predicted MSA (56). Motivated by the standard practice of generating centroid sequence models (consensus sequences) for MSAs of TE families, herein we introduce a new metric, 'consensus score loss' (CSL). CSL assesses the quality of the MSA by comparing the consensus produced from a predicted MSA to the consensus derived from the reference MSA ( Figure 1). We also demon-strate an additional approach for assessing the quality of an MSA in the case of TEs with coding capacity and apply it to natural copies of several ancient families of DNA transposons.
To the best of our knowledge, no formal evaluation of MSA tools has been previously conducted in the context of TE sequence families. To study this, we developed a new sequence simulator (TEForwardEvolve), and used it to generate a benchmark of simulated MSAs. Using this benchmark, supplemented with a small collection of ancient mammalian DNA Transposon families, we evaluated: MUSCLE, MAFFT, Dialign-TX, Kalign, FSA, Clustal Omega, ProbCons and T-Coffee, along with our Refiner approach. We demonstrate that MAFFT generally outperforms other generic alignment tools, and that our Refiner method produces comparable results for low-divergence sequences, and superior alignments when confronted with high levels of sequence fragmentation and sequence divergence.

Tree generation
A custom-made tool (genRandomTETrees.pl) was used to simulate phylogenetic trees for DNA Transposon and LINE TE families. For DNA Transposons, which exhibit a starlike phylogeny (Figure 2A), the tree is expanded by randomly choosing a parent node from the existing tree and appending a new child node with a random branch length between 0 and 10. Once the target number of extant nodes has been reached (100), the post-extinction phase of the sequence lifecycle is simulated by adjusting extant (leaf) node branch lengths to reach the target root-to-leaf length (100). Branch lengths in the tree do not equate to a specific unit of time; rather, they establish the relative duration of each branch. For the purposes of sequence simulation, the notional duration of a branch (i.e. the amount of mutation) is controlled by the simulator parameter 'generations per unit time' (GPUT).
To simulate master-gene model phylogenies such as seen in LINE families ( Figure 2B), the tree is expanded by adding a randomly determined number of children (2)(3)(4)(5) with randomly chosen branch lengths (0-5) to the current parent node (initially the root of the tree). One of the new children is randomly picked to be the new parent node (or 'master gene') and the process is iterated until the target number of extant nodes is reached (100). The branch lengths for extant nodes are then adjusted in a similar manner to the DNA transposon trees.

Sequence simulation
While there are many sequence evolution simulators currently available, none provide all the features necessary for realistic TE sequence simulation in one package: nucleotide simulation, indel simulation, context dependent (trinucleotide) substitution matrices, and fragmentation simulation. Inspired by the release of TRevolver (52), which supports tri-nucleotide substitution context, we developed TEForwardEvolve, which supplements tri-nucleotide substitution with simulation of indels and fragmentation.
For each class of TE, the simulator is provided a prototypical TE sequence (in this study: Tigger1/Charlie1 for DNA transposons and L2/CR1 for LINEs), a simulated phylogenetic tree, a context-dependent substitution rate matrix, indel parameters, the number of generations represented in the tree, and fragmentation parameters. The substitution rate matrix consists of all triplet pairs where the center base is allowed to change and the edge bases provide 1bp of flanking context (64 × 64 matrix); rates were derived from a study of 160 000 non-coding sites in a set of mammalian genomes (57). Indel lengths were modeled using a power law (Zipfian) probability distribution (insertion/deletion mean length = 1.7, insertion/deletion max length = 20) with an occurrence rate of 0.20 (insertion rate = 0.08, deletion rate = 0.12) relative to an average substitution rate of 1. Finally, fragmentation is simulated by selecting fragment sizes from a log normal distribution, a minimum fragment size and a randomly chosen start position to select only a portion of the parent sequence for duplication. Optionally, a minimum number of full-length sequences may be set such that fragmentation begins only after the minimum number of full-length sequences has been generated.
To study the impact of sequence substitution level, TEFowardEvolve was run with increasing values for the generations-per-unit-time (GPUT) simulation parameter (100-6000). This translates to an average Kimura divergence range of 1-52% for Tigger1 and 1-44% for L2 simulations. To study the impact of fragmentation, TEForwardEvolve was run with increasing levels of fragmentation (mean copy lengths ranging from 1200 down to 75, with a standard deviation of 300, minimum fragment size = 50, minimum full-length sequences = 2) at two substitution levels (GPUT 100, 3000). For each parameterization, 10 replicate simulations were run.

MSA evaluation
For each sequence evolution simulation, TEForwardEvolve provides a reference MSA for comparison to the MSAs predicted by the alignment tools. The qscore (19) tool was used to compute various metrics on the predicted MSAs including SPS, which is the fraction of aligned residue pairs in the reference alignment that are correctly aligned in the predicted alignment.
We also define a new score metric 'consensus score loss' (CSL), which assesses the quality of the predicted MSA by comparing (aligning) the consensus derived from it to the consensus derived from the reference MSA. If the predicted MSA is accurate, its induced consensus will be highly similar to the consensus from the reference MSA, and an alignment of the two consensus sequences will produce a high score, whereas an inaccurate predicted MSA will produce a low score; CSL characterizes MSA quality by computing the amount of the optimal consensus alignment score that is lost by using a predicted MSA. Specifically: let C p be the consensus sequence derived from the predicted MSA, and C r the consensus sequence produced from the reference MSA (Figure 1). To measure similarity of C p to C r , CSL aligns them to each other via Needleman-Wunch global pairwise alignment (58) C p ∼ C r , and aligns C r to it-  self to produce alignment C r ∼ C r . If the predicted MSA is perfectly accurate, C p will be identical to C r , so that score(C p ∼ C r ) = score(C r ∼ C r ). An inaccurate predicted MSA will cause score(C p ∼ C r ) < score(C r ∼ C r ). CSL quantifies this by calculating the fraction of the ideal score that is lost with the predicted alignment: (score(C r ∼ C r )score(C p ∼ C r )) / score(C r ∼ C r ) In the case of an extremely poor predicted MSA, the predicted consensus may lead to a negative global alignment score(C p ∼ C r ), so that >100% of the score is lost. For the analysis presented here, alignment was performed with a custom scoring matrix and gap parameterization described in the supplemental materials.

MSA tools and parameters
The MSA tools covered in this evaluation are shown in Ta

Refiner methodology
Our Refiner method works by establishing a single template sequence, aligning all sequences to that template, then producing an MSA based on the way that those sequences align to the template (in what we call a transitive alignment).
In the first pass, the template is chosen from the input sequences by picking the sequence with the best cumulative pairwise alignment score to all other sequences or roughly the centroid of the set. The resulting MSA is used to produce a consensus, which is used as the template for iterative rounds of transitive alignment, in the form of Expectation Maximization. The tool currently supports either RMBlast (59) or ABBlast (60) for pairwise alignment, although any sensitive aligner would suffice. This process is repeated until convergence; see (25,61) for details. The final reference sequence is the consensus for the family.
In this local-alignment strategy, characters that do not align to the template sequence are either arbitrarily aligned to each other (internal insertions) or not included in the alignment at all. This is not a problem in the context of RepeatModeler, since these are not part of high occupancy columns, so not part of the final consensus. For the purpose of MSA SPS evaluation, all characters must be present in the final alignment, so we simply add all such characters to the MSA such that they are not aligned to any other character.
The consensus caller used by Refiner employs two stages. The caller initially identifies the highest scoring character ('A','C','G','T','N' or '-') for each column from the subset of sequences aligning over it. The first step uses a matrix that reflects observed neutral DNA substitution patterns and is similar to matrices developed for RepeatMasker. For organisms with CpG methylation, which causes high conversion of CG to CA and TG, a second pass evaluates all dinucleotides in the initial consensus sequence for reassignment to 'CG' by registering the frequency of the most common products of CpG mutation, aligned CA and TG dinucleotides (61).

Simulated trees and sequences
We chose two TE classes to simulate, the Long Interspersed Nuclear Element (LINE), and the DNA transposon, to determine if their starkly different phylogenies lead to differences in the relative performance of MSA tools. The relationship of DNA transposon copies (Figure 2A) is typically a (near) star phylogeny (62,63), with any branching occurring randomly and early on. This reflects the fact that, due to random selection of the genomic template by the transposase, class II transposons tend to exhibit a short burst of activity before going extinct (1,64), and leave many neutrally decaying copies in the genome.
DNA transposon star phylogenies may be contrasted with those of LINEs ( Figure 2B), in which most copies are derived from a single dominant lineage of LINE TEs (the so-called master-gene model of evolution (65,66)); the resulting phylogenies approach those of pseudogenes of a rapidly evolving cellular gene (67). Phylogenetic trees approximating the evolutionary patterns of DNA transposons and LINE families were randomly generated using a custom-made tool (see Methods).
Sequences were simulated along these trees using a forward evolution sequence simulator seeded with a classspecific TE consensus sequence (see Materials and Methods). The DNA transposon sequence simulation was seeded with the Tigger1 family consensus (68), and the LINE tree was seeded with the L2 consensus (1). Simulation was run with ten replicates at 18 evolutionary time increments, producing 180 simulated sequence sets and reference MSAs (100 sequences each). The evolutionary time increments generated sequences ranging from 0.01 to 0.5 average Kimura (69) sequence divergence. Evaluation with Char-lie1 (DNA transposon) and CR1 (LINE) produced similar results, and are presented in the supplementary material (S2.2; Figures 1 and 2).

Alignment reconstruction accuracy
We computed SPS for all methods over a wide range of sequence divergence and for both TE classes (Figure 3). SPS results were significantly separated for both the DNA transposon simulations (P-value = 2.49e-22) and the LINE simulations (P-value = 1.98e-29; both p-values computed using the Kruskal-Wallis H test (70)). Furthermore, MAFFT and Refiner significantly outperformed other methods, according to a Wilcoxon signed rank test (71) see Supplementary material S1.1 for the full pairwise comparison table).

Alignment as a basis for consensus sequences
Consensus sequences can be derived from MSAs by choosing the most likely ancestral base at each position in the MSA (considering only positions with high occupancy) (72); this has, for example, long been the source of family consensus sequences used in annotating TEs. When TE copies form a star phylogeny, which appears to be the case for most class II transposon copies in mammals (73), the consensus will be identical to the ancestral sequence of the active TE. In case of the LINE MSA, a consensus may approach an average of the evolving active TE. We computed the CSL measure for each tool at a variety of divergence levels (Figure 4), showing the extent to which computed alignments support recovery of accurate consensus sequences.

Effect of sequence fragmentation
We evaluated the effect of fragmentation on MSA reconstruction as above, comparing the predicted MSA to the reference MSA, assessing both low divergence (1% avg Kimura divergence (69)) and high divergence (28% avg Kimura) sequences. The mean fragmentation size was varied from 75 to 1200 bp, based on observed fragmentation patterns in mammalian TE copies (Supplementary material S2.5). Figure 5 shows the effects of fragmentation level on SPS for a DNA transposon simulation and provides a visualization of the patterns for the fragmentation extremes. Most tools performed well on low-divergence sequences over a wide range of fragment sizes; at higher sequence divergence and fragmentation, Refiner outperforms all methods tested (Wilcoxon P-value ≤ 1.1e-11), with MAFFT, Dialign and FSA outperforming the rest. We also explored the effect of fragmentation on the accuracy of MSA-derived consensus sequences, as in the previous section (Figure 6). At high sequence divergence levels, Refiner was the only tool with partial retention of correct alignment score, showing that it effectively produces MSAs that yield accurate consensus sequences. We found similar results when we seeded the simulation with the LINE tree and the L2 sequence (Supplementary material S2.3).

Comparison with natural sequence
We selected four DNA transposons: Zaphod (74), Zaphod2 (75), Tigger10 (76) and Arthur2 (77) to compare reconstruc- Figure 4. Accuracy of derived consensus sequence with respect to sequence divergence. Comparison of the predicted MSA-based consensus with the reference MSA-based consensus, for a range of sequence divergences. (A) For simulated Tigger1 sequences with variable levels of Kimura divergence, this plot shows the fraction (score(C r ∼ C r )-score(C p ∼ C r ))/ score(C r ∼ C r ), which corresponds to how effective the computed MSA is at producing a consensus sequence (C p ) that agrees with the one for the simulated sequence (C r ). Scores are for the Needleman-Wunsch (NW) global alignment algorithm (see Materials and Methods). Center line of each band shows the mean loss of score for each tool, while the surrounding shaded region shows the 95% confidence interval. (B) The same fraction-of-optimal-score is captured, but for sequences simulated from L2.  Figure 4; input sequences are low-divergence fragments (1% Kimura divergence) as from Figure 5A. (B) Same as in A, but with high-divergence fragment inputs (28% Kimura divergence) as from Figure 5B. tion accuracy at the protein level. These were selected for their extreme age (Tigger10 and Arthur2 predate the common ancestor of marsupials and placental mammals) and the high divergence of human copies to each other are expected to pose a challenge for MSA tools. For these families, there exist high quality manually created consensus sequences, supplemented by copies from reconstructed ancestral mammalian genomes (78). As DNA transposons, they are expected to have a star phylogeny, so that an accurate MSA should recreate the ORFs of the active elements. For each family, 100 annotated genomic instances from the human genome were sampled randomly from all members of the family, and aligned with each tool. A consensus was generated from each predicted MSA, then blastx (default: matrix = blosum62, E = 10.0, gap open = 11, gap ext = 1) was used to compare the consensus to the curated transposase proteins from our RepeatMasker Repeat Protein Database. To avoid the potential for circularity due to the fact that we were involved in both the curation of the protein sequence and the present assessment of alignment, we also searched the MSA consensi against homologs for these proteins found in the RefSeq NR database. For each comparison, the union of blastx results with an evalue <0.001 were plotted (Figure 7) against the full length protein. Though none of the MSA-derived consensus sequences is able to produce a full-length blastx match to the related protein sequence, one tool (Refiner) demonstrates clearly superior MSA-based blastx results: (i) only consensus sequences derived from the Refiner MSAs had matches for all elements and, (ii) Refiner-based blastx alignments were much longer and higher-scoring. This surpris-ing result underscores the value of evaluating reverse translation quality among other metrics for TEs with coding regions.

DISCUSSION
Using our new method TEForwardEvolve, we simulated the evolution of TE families over a wide variety of sequence divergence and fragmentation to investigate the impact on MSA prediction quality. The evaluated tools exhibited similar patterns of performance degradation for both TE classes as the sequence divergence was increased, with statistically significant performance differences among them. For fulllength sequence inputs, MAFFT and Refiner maintained the highest alignment accuracy over the range of divergences. Surprisingly, the phylogenetic structure of the simulated trees did not produce a noticeable effect on alignment performance, suggesting that the results would hold for other TE classes.
SPS (aka the developer score) is an established and generally informative measure of global MSA reconstruction; however, it doesn't appear to fully reflect the performance of MSA tools in the context of consensus sequence prediction. This can be seen most clearly in the performance of Refiner at high sequence divergences, where it is outcompeted by MAFFT's SPS score while simultaneously producing superior accuracy of the MSA-derived consensus sequence. Local alignment appears to be key to explaining the difference. The consensus is negatively influenced by the tendency of many MSA tools to force mismatched sequence regions together in the MSA (55,79); local alignment avoids this at the expense of a complete alignment. In addition, short regions of misalignment in an MSA, while not penalized heavily by the SPS metric, can lead to incorrect consensus generation. For example: with highly fragmented sequences, FSA occasionally produced alignments where a majority of the sequences were incorrectly anchored at the start of the alignment for a short stretch (3 bp) followed by large gaps before continuing in roughly the correct location in the overall MSA (see Supplementary material S2.4). In cases such as this, the alignment artifact would cause a consensus caller to consider the sequences contained in the gap as probable insertions and generate a dramatically shortened consensus.
Our results suggest that even low levels of fragmentation, when combined with higher sequence divergence, poses a significant challenge to MSA tools. Faced with fragmentary input, Refiner produced MSAs that were significantly more accurate than other tools, highlighting the value of its iterative transitive alignment approach for this challenging form of input. For this study, fragmentation is simulated as a random process, whereas fragmentation processes exhibit biases for many TE classes (e.g. 5' truncation in LINE elements, or the generation of solo LTRs through recombination). We plan to extend our simulator, TEFowardEvolve, to explore the effect of these fragmentation architectures on MSA prediction as well as additionally consider SINE and LTR TE classes.
Finally, simulation results were validated with an evaluation using manually curated protein sequences, known homologs, and instances of the TE family sequences that once encoded them. Only one tool (Refiner) was consistently successful in computing an MSA-based consensus that matched the corresponding proteins in a blastx search. The classification of a TE family is greatly facilitated by comparing to existing TE protein databases, underscoring the importance of an accurate MSA reconstruction. This analysis was restricted to DNA transposons due to the limited availability of ancient TE protein reconstructions. Natural sequences present the most compelling benchmark; however, it is difficult to obtain large enough samples for a complete parametric analysis. We plan to further extend this aspect of the benchmark as a complement to the simulated datasets.