FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets

Abstract Motivation In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. Results FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. Availability and implementation Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.


Suffix Array and LCP Array Construction
The construction of the suffix array, as outlined by Manber and Myers [1993], serves as a fundamental component in bioinformatics and various applied string processing fields.This is attributable to the capacity of suffix arrays, in conjunction with additional data structures, to effectively tackle stringrelated problems such as pattern matching.Commonly employed additonal data structures include the Longest Common Prefix array (LCP) and the Document Array (DA) [Muthukrishnan, 2002].Despite the existence of numerous efficient algorithms for creating suffix arrays of individual strings and associated additional data structures, a scarcity of algorithms focusing on constructing suffix array for string collections remains.Currently, gsufsort [Louza et al., 2020] stands as the most expeditious algorithm for generating suffix array of string collections along with additional data structures.As an open-source software, gsufsort supports fasta, fastq, and text files containing multiple strings as input, subsequently constructing suffix array and additional data structures before writing them to disk.FMAlign2 utilizes gsufsort to address suffix array and additonal data structures for multiple sequences provided as input.
The gsufsort algorithm facilitates the construction of a suffix array for a collection of strings through the concatenation of these strings.Considering a collection of m strings, s 1 , s 2 , • • • , s m , derived from an alphabet Σ = [1, σ] comprising ASCII symbols with respective lengths n 1 , n 2 , • • • , n m .The concatenation of these symbols results in a string S = s 1 $s 2 $ • • • $s m $#, where $ represents the separator between distinct sequences, and # serves as the end-marker for the sequence.It is important to note that # and $ do not appear in any string s i , and # < $ < α for any other symbol α ∈ Σ.The total length of S is given by m i=1 (n i + 1) + 1 = N .Before delving further into the intricacies of gsufsort, we take a moment to revisit the definitions of some data structures.Given a string S with a length of N , the suffix initiating at position i is denoted as S i , where 0 ≤ i < N .The suffix array (SA) of a string S with a length of N comprises an array containing a permutation of [0, N − 1] that presents the suffixes of S in lexicographic order.The length of the longest common prefix shared by strings R and T is represented by lcp(R, T ).The LCP array for S supplies the lcp between consecutive suffixes in the order of SA, such that LCP[0] = 0 and LCP The gsufsort algorithm employs the gSACA-K method [Louza et al., 2017] to construct the SA for the concatenated string S. In doing so, it resolves ties between equivalent suffixes originating from distinct strings s i and s j based on their respective ranks, denoted by i and j.Notably, the gSACA-K algorithm is capable of concurrently computing the LCP and DA during the SA construction process, ensuring that LCP values do not surpass the separator symbols.The gSACA-K algorithm operates in O(N ) time complexity, utilizing O(σ) working space.

LCP inervals Finding
The LCP-interval [Abouelhoda et al., 2002], or Longest Common Prefix interval, is a fundamental construct in the analysis of strings and their comparative properties.Formally, an LCP-interval is an Interval[i..j] of lcp-value l min in the LCP array, which satisfies the following properties: push < left, right − 1 > into intervals 18: end if An LCP-interval, as defined, encapsulates a contiguous subarray of the LCP array that adheres to specific conditions concerning the LCP values at the interval's boundaries and the elements within the interval.Specifically, within the LCP-interval, all elements possess LCP values that are greater than or equal to the lcp-value l min .Moreover, the LCP values immediately preceding (LCP[i]) and succeeding (LCP[j + 1]) the interval are strictly less than l min .Consequently, an LCP-interval represents a region in the LCP array where the elements consistently exhibit LCP values that surpass those at the boundaries, which are comparatively smaller.
The process of identifying LCP-intervals, which represent contiguous sections of the LCP array that satisfy certain conditions, is a critical aspect of string analysis.Algorithm S1 provides an efficient method for extracting these intervals from a given LCP array.
The algorithm operates in a linear pass over the LCP array, leading to a time complexity of O(n), where n is the size of the LCP array.The space complexity is also O(n), as in the worst-case scenario, every position in the array could form a valid interval.This efficient design makes the algorithm highly practical for processing even large-scale data sets.

MEMs Left Expension
Figure S1: The method for extending LCP-intervals towards the left to identify a broader set of potential MEMs.Initially, each substring within an LCP-interval is inspected, focusing on the characters to its immediate left.If these characters are uniform, the interval extends leftwards by reducing the substring's start position and increasing its length.The process iteratively repeats until characters are no longer identical or the string's left boundary is reached, leading to the transformation of extended LCP-intervals into MEMs.
After identifying the LCP-intervals, our subsequent step is to explore their possibilities by extending towards the left, aiming to pinpoint a broader set of potential MEMs.By design, the LCP-intervals are bound such that they can't extend any further to the right.In contrast, leftward expansion is unrestricted and forms a crucial part of our comprehensive string data analysis.The leftward expansion is conducted in the following way.As shown in Fig S1 , for each LCP-interval, we look at each substring it contains.We then inspect the characters immediately left of each substring.If all these characters are the same, we extend the LCP-interval to the left: the beginning position of each substring in the interval is reduced, and the length of each substring is increased.This process repeats iteratively, inspecting the newly included characters and further extending the interval to the left if they are all identical.The extension halts when we reach a stage where the characters left of the substrings are no longer identical or when we hit the left boundary of the original string, past which further extension is not possible.Through this detailed process, we extend the LCP-intervals to include a larger set of substrings, transforming them into MEMs.

MEMs Filtering
Although FMAlign2 employs MEMs and MUMmer [Marçais et al., 2018] uses MUM to segment sequences in a manner that seems quite similar, segmenting multiple sequences is a more complex problem than segmenting just two sequences.Considering the added challenge of segmenting multiple sequences, it's vital to emphasize the role of MEMs colinearity and size.If two MEMs each contain at most one substring per sequence and these substrings consistently maintain the same relative order across all sequences without overlap, We refer to these two MEMs as being colinear.Two MEMs being colinear means they won't conflict when segmenting sequences.Let's define the size of the i-th MEM as the product of its substring length l i and the number of its substrings k i : Given that larger MEMs are more likely to contribute to a MSA, our goal is to select a set of MEMs that not only have the largest size but are also colinear with each other.We will use Figure S2 to provide a specific example to illustrate the steps of MEMs filtering.

MEMs preprocessing
First, we perform a two-step preprocessing on each MEMs.
(1) We compute the average position of substrings within their respective sequences.For sequences with multiple substrings, we retain only the substring closest to this average position, guaranteeing that each MEMs contains at most one substring per sequence.For instance, in Figure S2A, the fourth sequence has two substrings under MEMs II.In Figure S2B, the substring that is farther from the average position is removed.
(2) We establish a minimum sequence coverage threshold c(0 ≤ c ≤ 1).MEMs that do not cover a proportion greater than c of the total sequences n will be discarded.Subsequently, Given m retained MEMs, we sort them from left to right based on the average position of their substrings.In our example, the value of c is set to 0.6, meaning each MEMs must span across three sequences.Therefore, since MEMs IV spans only two sequences, it is entirely removed in Figure S2B.

Global and Local Mode
Algorithm S2 DP Algorithm of Global Mode Require: dp[0, . . ., n] = 0 and prev[0, . . ., n] = −1, with n as |M EM s| Ensure: N ewM EM s containing colinear MEMs with largest size 1: for each j ∈ [0, i) do 5: if After preprocessing, we introduced two dynamic programming strategies to filter MEMs: the global mode, which focus on the entire MEMs, and the local mode, which concentrates on substrings within an individual sequence.
Global Mode -In the global mode, we precompute a MEMs adjacency matrix A ∈ R m×m by examining the positions of substrings in two MEMs.If A[i, j] = 1, it signifies that the i-th MEMs and the j-th MEMs are colinear; otherwise, its value is 0. Based on the recursive formula 2, we could get the set of MEMs with the largest size.Therefore, as shown in Figure S2C, when there is an overlap between MEMs I and MEMs II, since MEMs I has a larger size, MEMs II is entirely removed.
After verifying colinearity, we obtain the colinear MEMs set with the largest size as final result.The detailed explanation of the global mode is provided in the Algorithm 2.
Local Mode -The global mode is straightforward, removing entire MEMs when faced with noncolinear instances.In contrast, the local mode, considering the permissible deletion of some MEM substrings, employs dynamic programming on all substrings of a sequence.For a given sequence with s MEMs substrings, these are sequentially numbered based on their start positions.We then construct a matrix A ∈ R s×s to discern overlaps; an entry A[i, j] = 1 implies non-overlap between the i-th and j-th substrings, and 0 indicates overlap.Using the recursive formula 3, we can obtain the set of non-overlapping substrings on this sequence with the maximum combined length.After executing the dynamic programming n times, all substrings not selected will be removed.MEMs with a substring count falling short of the floor value of c×n are eliminated, and the remaining MEMs are sorted based on average positions.As shown in Figure S2D, the dynamic programming on the third sequence will remove the substring of MEMs II, but the remaining number of substrings in MEMs II still meets the minimum sequence coverage and is thus retained.Ensuring substrings on each sequence align with the MEMs' order, we obtain the colinear MEMs set with the largest size.
5 Sequence-Profile Alignment    Show parameters details: ./FMAlign2 -h Parameters Details: -i [file path] [required] The path to the input file.
-t [int] [default: cpu number] The maximum number of threads that the program can run.It is recommended to set this to the number of CPUs.
-l [int] [default: square root of mean length] The minimum length of MEMs, the default value is the square root of the mean length of the sequences.
-c [float] [default: 1] A floating-point parameter that specifies the minimum coverage across all sequences.Acceptable values range from 0 to 1.
-f [mode] [default: global or local] The filter MEMs mode.If the sequence number is less than 100, the local mode is used; otherwise, the global mode is used.
-d [int] [default:0] The depth of recursion, which can generally be ignored.
-v [int] [default:1] The verbose option, can be set to 0 or 1.This can usually be ignored.
-h [help] To print help information.

Figure S2 :
Figure S2: Four states of MEMs in filtering.(A) MEMs before preprocessing.(B) MEMs after preprocessing.(C) MEMs after global mode.(D) MEMs after local mode.

Figure S4 :
FigureS3: Sequence-Profile Alignment Example: (A) In this example, there are 4 partial chains, with the second and third chains not spanning the entire sequence.(B) After cutting the sequence by partial chains, a total of 5 segments and two fragments are generated.The three segments that will be involved subsequently are marked as A, B, C. Each Segment will undergo multiple sequence alignment, which is not shown in the figure.(C)The sequences with fragments are the first and second ones, and since the fragment on the second sequence is shorter than that on the first, we prioritize aligning fragment 1 from the second sequence to the backbone.The creation of fragment 1 is due to the cutting by partial chains 1 and 3; hence, we first splice segments A and B as well as partial 2 to form Segment A&B.Fragment 2 and Segment A&B are then aligned using sequence-profile alignment.(D) Fragment 1 is aligned to the backbone using the same steps, with this backbone being formed by the previous alignment results, partial chain 3, and segment C.
dp[j] + size > dp[i] and MEMs[i] and MEMs[j] are colinear then EndIndex ← index of MaxSize in dp 12: Initialize N ewM EM s as an empty list 13: while EndIndex > 0 do

Table S1 :
Detailed information of dataset

Table S2 :
Time consumption and average SP scores for mitochondrial genome database

Table S3 :
Peak memory usage for mitochondrial genome database

Table S4 :
Peak memory usage for virus genome datasets -' indicates that the method cannot complete the alignment on the dataset due to memory overflow.Windows user:./FMAlign2.exe-i /path/to/data [other options]