From pairwise to multiple spliced alignment

Abstract Motivation Alternative splicing is a ubiquitous process in eukaryotes that allows distinct transcripts to be produced from the same gene. Yet, the study of transcript evolution within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, we generalize the concept of pairwise spliced alignments (PSpAs) to multiple spliced alignments (MSpAs). MSpAs have several important purposes in addition to empowering the study of the evolution of transcripts. For instance, it is a key to improving the prediction of gene models, which is important to solve the growing problem of genome annotation. Despite its essentialness, a formal definition of the concept and methods to compute MSpAs are still lacking. Results We introduce the MSpA problem and the SplicedFamAlignMulti (SFAM) method, to compute the MSpA of a gene family. Like most multiple sequence alignment (MSA) methods that are generally greedy heuristic methods assembling pairwise alignments, SFAM combines all PSpAs of coding DNA sequences and gene sequences of a gene family into an MSpA. It produces a single structure that represents the superstructure and models of the gene family. Using real vertebrate and simulated gene family data, we illustrate the utility of SFAM for computing accurate gene family superstructures, MSAs, inferring splicing orthologous groups and improving gene-model annotations. Availability and implementation The supporting data and implementation of SFAM are freely available at https://github.com/UdeS-CoBIUS/SpliceFamAlignMulti. Supplementary information Supplementary data are available at Bioinformatics Advances online.


T-Coffee-based multiple spliced alignment:
The SFAM tcoffee algorithm is composed of three steps: 1. Generate the primary library of residue pairs: • Library generation for the SFAM tcoffee p (p for pairwise) method: The method makes directly use of the blocks of the pairwise spliced alignment given as input to SFAM tcoffee. For any CDS c ∈ C of a gene g ∈ G, and any gene h ∈ G such that g = h, we consider the pairwise spliced alignment parameters: 2 (match score), 0 (mismatch score), -10 (gap opening penalty), -1 (gap extension penalty)). (gpos c→g (s c i ), gpos c→g (e c i )) induced by the multiple sequence alignment. Each pair of aligned residues in the pairwise alignment is added to the library with a weight that is the PID of the pairwise alignment.

2.
Compute a multiple sequence alignment M of G: Using the T-Coffee algorithm with default parameters and the library of residue pairs computed at the previous step, a multiple sequence alignment of all gene sequences of G is computed; 3. Compute a multiple spliced alignment of C ∪ G given the multiple sequence alignment M of G: The set of all segments of all gene structures is partitioned such that for any pair of genes (g, h) ∈ G 2 , and are included in the same group if: • More than half of the residues of segment S(g)[i] are aligned with residues of segment S(h)[j]; or If a group contains multiple segments of a gene structure S(g), these segments are merged into a single segment of g whose start location (resp. end location) is the minimum start (resp. maximum end) location of all these segments. Each resulting group is then defined as a multi-block of the multiple spliced alignment.
For instance, given the following alignment of genes g and h from Figure  h:(46,57)}}, and the resulting multiple spliced alignment is depicted in Figure 1(B).

Graph-based multiple spliced alignment:
The SFAM mblock is composed of four steps: 1. Compute the alignment graph graph(X ); , e x i2 ) corresponding to two non-overlapping segments of the same sequence x, a set of low confidence edges are removed from cc in such a way to disconnect the two vertices. The procedure is as follows: as long as the two vertices are connected, iteratively find a shortest path between them, and remove an edge e that first maximizes connect(e) and then minimizes P ID(e). The rationale behind this step is that a multi-block must contain at most one segment of each sequence. Therefore, a connected component containing two non-overlapping segments of the same sequence cannot represent a multi-block. The result of this step is a new graph denoted graph (X ). For instance, in Figure 2, the edge (g : (41, 46), h[c3] : (17, 24)) will be removed in order to disconnect vertices g : (41, 46) and g : (51, 55).
4. Consider connected components of graph (X ) as candidate multi-blocks, and build the multiple spliced alignment in a progressive manner: For each connected component cc of graph (X ), a candidate multi-block composed of the segments (vertices) in cc is built. The resulting set M of candidate multi-blocks is ordered by decreasing multi-block size. The multiple spliced alignment A is initialized to an empty chain.
At each iteration until M is empty, the first multi-block a ∈ M is removed. If a is consistent with A then a is added to A, otherwise a minimum number of gene segments with their corresponding CDS segments are removed from a to make the latter consistent with A. Then, the resulting multi-block, that has a lower size than a, is added to M while preserving the order of multi-blocks by decreasing size in M.