Parsnp 2.0: scalable core-genome alignment for massive microbial datasets

Abstract Motivation Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. Results To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability and implementation Parsnp v2 is available at https://github.com/marbl/parsnp.

1 Supplementary Materials 1.1 Identifying the final core-genome coordinates from a set of partitioned alignments In order to identify the coordinates of the final core genome with respect to a reference sequence, we compute the intersection of the aligned reference regions for each partition.Let q be the number of partitions of n query genomes resulting in p genomes per partition, and R i = {(b 1 , e 1 ), (b 2 , e 2 ), ..., (b m , e m )} be the begin and end coordinates of the reference for each of the m LCBs in partition i.
We compute the interval intersection of all R i interval sets to obtain the coordinates of the final coregenome intervals, R * .Intervals which are smaller than 10 base pairs are removed from R * .The output alignments of each partition are then trimmed to match the coordinates in R * .At the end of this step, each partition will have an alignment file with |R * | LCBs, where each LCB corresponds to one of the intervals in R * .

Merging partitioned alignments
As the merging process is identical across each LCB, we will describe the case for merging a set of LCBs corresponding to the same reference interval.We use two operations, appending and stacking, where appending LCB a is the process of adding the rows in LCB b to LCB a .We LCB i as the alignment in partition i.

28
In the case of LCBs where there are no indels in the reference for every LCB, the task is straightforward: 29 stack all of the LCBs together.However, special care must be taken when indels are present in the reference , then instead of stacking all columns and then appending to 37 LCB * , we append the stacked columns to a temporary LCB X.Once we come across a column where no 38 LCB has any indels in the reference, we compute a multiple sequence alignment A of the sequences in X 39 using SPOA [1], where each sequence is the concatenation of non-gap characters in each row.The alignment 40 A is then appended to LCB * , X is reset to the empty LCB and the process continues.1).For all three datasets, there were many nearly-identical sequences included 45 which yielded a large number of effectively zero-length branches.To account for this, we used the normalized 46 weighted Robinson-Foulds distance (N-wRF) [3], a branch-length-weighted metric, to compare phylogenies.
30 sequences.Consider column j of LCB i , denoted as LCB i [j] = [c 1 , c 2 , c 3 , ...c 1+p ], where LCB i [j][k] = c k 31 represents the character in row k, column j of LCB i , and row 0 always corresponds to the reference.If 32 LCB i [j][0] = " − " and LCB i [j][k] ̸ = " − ", then we have no information on how the base at LCB i [j][k] is 33 related to the query sequences from other partitions.34 To remedy this, we assemble the output LCB, LCB * column by column.At column j * in LCB * , we keep 35 track of the corresponding column in each of the partition LCBs, denoted by j of phylogenies under different partition sizes 42 To ensure that downstream results would be similar across different partition sizes, we compared the 43 phylogenies obtained from RAxML [2] using the core variants output by the different Parsnp runs for the 44 bacterial datasets (Table

47Figure 1 :
Figure 1: The Parsnp outlined with dotted are XMFA is the genomealignment format used by Parsnp and Mauve [5].MAF is another common genome-alignment format often used by non-core multiple-genome methods.