HMMerge: an ensemble method for multiple sequence alignment

Abstract Motivation Despite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem. Results We present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given ‘backbone’ alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments. Availability and implementation HMMerge is freely available at https://github.com/MinhyukPark/HMMerge. Supplementary information Supplementary data are available at Bioinformatics Advances online.

WITCH and UPP, in fact, share the same exact code for decomposing the backbone alignment. In this study, WITCH decomposed the input alignments down to subsets of size 10 while UPP decomposed the input alignments down to subsets of size 2. The decomposed subsets passed to HMMerge were obtained through a modification of the decomposition code that was used in PASTA [3]. Tables   Table S1: Simulated DNA/RNA dataset overview Here, we show the basic empirical statistics about the datasets used in this study. All datasets have 1000 sequences. Length is the length of the true alignment averaged over the replicates. The first 15 rows are ROSE model conditions, which have 20 replicates each; the remaining rows have 10 replicates each. Results shown are averaged across the replicates. The p-distances refer to the normalized Hamming distance, computed prior to fragmentation.     Figure S1: Total alignment error for seven benchmark methods on ROSE simulated datasets with introduced fragmentation Alignment error (average of SPFN and SPFP) of all of the sequences of the final alignment produced by each method. These are highly fragmented ("HF" for short) datasets created from ROSE simulated datasets introduced in the SATé study [2]. All datasets have 1000 sequences, with substitution rates that roughly increase from left-to-right. The error bars indicate standard error over 20 replicates. MAFFT here is MAFFT-linsi.

S4 Impact of larger eHMM
In this section we explore the impact of using larger eHMMs. The default usage has HMMs on the minimal alignment subsets only from the UPP(50) ensemble, but here we explore the impact of including also the HMM on the full backbone alignment, or going further and including all the HMMs in the UPP(50) ensemble.
We show in Figure S15 that including the HMM for the full backbone alignment in the eHMM enables jumping between HMMs, which results in improved accuracy.
In contrast, in our studies on biological and simulated datasets where we added the HMM for the full backbone (see Table S3), we did not see any difference in the alignment error; this suggests that this extreme case given in our figure does not occur in these datasets. On the other hand, using the UPP(50) eHMM, which includes every HMM ever created during the hierarchical decomposition rather than just the HMMs for the minimal sequence sets, did sometimes improve accuracy (see Table S3). This observation is consistent with studies in [4], which explored UPP with changes to the eHMM (including using only the HMMs for the minimal sequence sets, as used here). Thus, there is likely room for improvement in accuracy for HMMerge through using larger eHMMs.

S5 Scalability of HMMerge
Although HMMerge was able to complete on the datasets we studied, which ranged up to 5751 sequences for the 5S.T dataset, we found that HMMerge required more memory and a longer runtime than WITCH, and WITCH itself required a longer runtime than UPP. Furthermore, some HMMerge analyses we attempted, such as aligning datasets using the entire UPP(50) ensemble, required more memory than was available to us in the University of Illinois Campus Cluster queue, which typically have at most 64GB, and some analyses exceeded the four-hour time limit. For these compute-intensive analyses, we had to use a high memory machine with up to 1TB of memory and which allowed longer analyses. For example, HMMerge was given up to 512 GB of memory on some HMMerge runs on the INDELible datasets, which have long sequences. In general, HMMerge did not require more than 512GB of memory except for some runs of experiments that used the entire UPP(50) ensemble where HMMerge was given up to 1TB.