DCAlign v1.0: aligning biological sequences using co-evolution models and informed priors

Abstract Summary DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions. Availability and implementation DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.


I. INTRODUCTION
A common task in Bioinformatics is to cast evolutionary-related biological sequences into a Multiple Sequence Alignment (MSA).The objective of this task is to identify and align conserved regions of the sequences by maximizing the similarity among the columns of the MSA.State-of-the-art alignment methods, like HMMER for proteins [4], and Infernal [12] for RNAs, use hand-curated MSAs of small representative subsets of sequences to be aligned (the so-called seed alignments).Whereas for proteins, HMMER builds the Hidden Markov Model (HMM) by using only the seed alignment, Infernal needs also secondary structure information to generate a Covariance Model (CM).In both cases, HMM (for proteins) or CM (for RNAs) are used to align query sequences.However, homologous sequences show signals of correlated mutations (epistasis) undetected by profile models.Conservation and co-evolution signals are at the basis of Direct Coupling Analysis (DCA)-based statistical models [3,9].Recently, these models have been used to align biological sequences [10] and perform remote homology search [17] by alignment of the sequences to a seed model, or by pairwise alignments of seed models [15].The method in [10], viz.DCAlign, returns the ordered sub-sequence of a query unaligned sequence which maximizes an objective function related to the DCA model of the seed.In this latter case, standard DCA models fail to adequately describe the statistics of insertions and gaps.To alleviate this limitation, we added to the objective function gap and insertion penalties learned from the seed alignment.While for the insertions, the computational complexity is negligible, inferring gap penalties is a timeconsuming problem (see [10] and Supplementary text).Here, we treat penalties in terms of informed priors computed from the seed sequences.The parameters for gaps and insertions, extracted from the seed alignment, are determined in an unsupervised manner.Finally, to further speed up the learning of the seed-based objective function, we obtain the parameters of the DCA model using pseudo-likelihood maximization [5] instead of Boltzmann Machine Learning [6,11].DCAlign v1.0, is a computational pipeline that allows for the computation of the seed-model parameters in a few minutes, contrary to its original implementation which required at least a day of computation in the best scenario.The alignment problem is then solved approximately through a message-passing algorithm (see Supplementary text).

II. METHODS
Our alignment algorithm estimates the optimal ordered sub-sequence compatible with a DCA model and empirical knowledge of insertions and gaps of the seed.Let A be an unaligned sequence of length N , and S be its aligned counterpart of length L (which is the length of the seed MSA).We only consider the L ≤ N case.At each i = 1, . . ., L, we define a Boolean variable x i ∈ {0, 1} and a pointer n i ∈ {0, . . ., N + 1}.The variable x i indicates whether the position i is a gap '-' (x i = 0) or a match, i.e. a symbol in A. When i is a match, n i identifies where S i matches A, i.e. S i = A ni ; instead, for x i = 0, the value of n i is used for keeping track of the last matched symbol in A. Let us define a pointer-difference variable as ∆n i,j = n j − n i for i = 1, . . ., L and j > i.Each auxiliary variable ∆n i,j quantifies how many symbols of the unaligned sequence A are present between two i, j positions of the aligned counterpart S. If a configuration of the n is given, the full set of the pointer differences reveal the presence of insertions and gaps between any columns i and j of the alignment (see Supplementary text).

A. Seed modeling
Together with a DCA model of the aligned seed (see Fig. 1, central panel), for every site i (in red), we compute the ∆n i,j for j > i for all the seed sequences, and we learn an empirical probability P i,j (∆n i,j ) as shown in the bottom central panel of Fig. 1 (this procedure is computationally very fast).The color gradient is associated with the value of j, the lighter the color, the larger is j.In Fig. 1 (bottom central panel) we consider as an example three sequences differing in the nature of the ∆n i,i+1 .

B. Alignment procedure
We can express the alignment problem in terms of the following optimization problem: where H is the DCA model describing the seed (see Fig. 1, top central panel), Z is a normalization factor, and β is a free parameter whose relevance will be discussed below.The maximization only runs over the feasible assignment of the variables, i.e. we impose that n i+1 > n i for every column i.The informed prior will guide the optimization process towards solutions that, among those that maximize the Boltzmann distribution associated with H, reproduce the statistics of the seed pointer differences.Unfortunately, the problem thus stated is unfeasible as the normalization function Z cannot be efficiently computed.Similarly to the first DCAlign version, we use an approximate messagepassing algorithm coupled with an annealing scheme over β (i.e.we iteratively increase β) to get the best alignment for the query sequence A (see Supplementary text and Fig. S2).

III. RESULTS
We can classify the type of tests performed to assess the performance of our computational strategy into three different categories: • Comparison with the previous implementation: As in [10], we compared our results against HMMER, Infernal (the last algorithm only for RNA sequences) on four Pfam (PF00035, PF00677, PF00684, PF00763), and Rfam (RF00059, RF00162, RF00167, RF01734) families.A detailed description of the dataset is contained in Tabs.S2-3.We utilized the following comparison metrics: (i) the positive predictive value (PPV) of the DCA-based contact prediction [3,9], (ii) the proximity measures between the generated and the seed MSAs.As far as the contact map prediction is concerned, we observe either a mild improvement or a similar performance.With respect to the proximity measures, we notice a negligible increase in the average distance between seed sequences and generated alignments (see Figs. S3-6, and Tabs.S7-10).
• Leave-one-out experiment: As a stress test for DCAlign v1.0 we also compared our results to twenty-five groundtruth MSAs either extracted from benchmark sets [2,7,16] or built from structural alignments [1] (see Tabs.S2, S4-5).The numerical experiments consist of iteratively excluding one of the sequences of the reference alignment and training HMM, CM, or DCAlign using the remaining sequences.The excluded sequence is then aligned and quantitatively compared to the ground truth (viz.the structural alignment, or the benchmark sets).The emerging picture depends on the data type considered: for benchmark sets all computational strategies seem to perform reasonably well.In particular, HMMER (resp.Infernal) and our algorithm provide similar outcomes for protein (resp.RNA) domains (see Figs. S7-10, Tabs.S11-12).However, when we consider structural alignments as our reference ground truth, our method significantly outperforms HMMER as shown in Figs.S11-12 and Tabs.S13-14.
• Divergent sequence alignment: Finally, to assess our algorithm's remote homology detection performance, we considered three RNA benchmark sets (the seed of Rfam RF00162 [8], Twister type P1 [13], tRNA [14], see Tab.S6) from [17].Results suggest that Infernal is the best-performing method on two of the three datasets, while our method achieves the best alignment for the tRNA case.Note that Infernal is trained using secondary structure information that our algorithm does not use.All results are presented in Fig. S13 and Tab.S15.
From a computational efficiency point of view, the time needed to train the algorithm is significantly smaller than both our old implementation and CM-Infernal (see Supplementary text and Fig. S14).However, the time necessary to align a sequence is equivalent compared to DCAlign, and probably to other computational strategies taking into account epistasis [15,17].

FIG. 1 .
FIG.1.Schematic representation of the DCAlign v1.0 pipeline.From a (given) hand-curated alignment (the seed, shown in the left panel), our algorithm learns (i) a DCA model H exploiting the one-site and two-site statistics of the seed (upper central box), and (ii) the gap and insertion penalties by means of the empirical distribution of the pointer differences P (∆nij) for i = 1, . . ., L, and j > i (bottom central box).The three sequences represent the three scenarios that can occur between position i and j = i + 1: some insertion can appear, no insertion and no gap is present, or i + 1 contain a gap, so ∆ni,i+1 = 0.For j > i + 1 (lighter blue cases), both insertions and matched symbols contribute to the computation of the ∆ni,j, while gaps do not carry any contribution (see Fig.S1for a more detailed example).The alignment problem is then mapped into a constrained optimization problem over the (x, n) variables.The constraints on the variables and an example of alignment are shown in the right panel.
IV. CONCLUSIONDCAlign v1.0 is a new implementation of the DCA-based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.The new modeling significantly drops the pre-processing time and guarantees a qualitatively equivalent alignment of a set of target sequences.