Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies

Abstract Whole genome sequencing has increasingly become the essential method for studying the genetic mechanisms of antimicrobial resistance and for surveillance of drug-resistant bacterial pathogens. The majority of bacterial genomes sequenced to date have been sequenced with Illumina sequencing technology, owing to its high-throughput, excellent sequence accuracy, and low cost. However, because of the short-read nature of the technology, these assemblies are fragmented into large numbers of contigs, hindering the obtaining of full information of the genome. We develop Pasa, a graph-based algorithm that utilizes the pangenome graph and the assembly graph information to improve scaffolding quality. By leveraging the population information of the bacteria species, Pasa is able to utilize the linkage information of the gene families of the species to resolve the contig graph of the assembly. We show that our method outperforms the current state of the arts in terms of accuracy, and at the same time, is computationally efficient to be applied to a large number of existing draft assemblies.


Introduction
The increasing availability of DNA sequencing technologies has profoundly transformed biomedical research and in particular microbial genomics.The ability to decode the whole genomes of a large number of bacterial isolates enables the study of the genetic mechanisms of antimicrobial resistance in drug-resistant pathogens ( 1 ,2 ), which have been rapidly emerging and are considered to be one of the main threats to public health in the next decades.Whole genome sequencing is also an effective tool for the surveillance of infectious diseases, and direct infection control measures in clinics (3)(4)(5).Substantial global efforts have resulted in large amounts of sequencing data for genome assemblies being generated to provide insights into the resistant causes and effects.To date, short-read sequencing technology using Illumina platforms remains the most common method for whole genome sequencing owing to its high throughput, sequence accuracy, and costeffectiveness.However, the relatively short read length cannot unambiguously resolve the repetitive sequences that are frequently present in most genomes.As a result, most genome assemblies are fragmented into large numbers of contigs and require additional information to improve their contiguity and completeness.
Much recent research has resulted in both technological and computational methodologies to generate better complete genome assemblies.Third-generation sequencing technologies such as Oxford Nanopore Technology and PacBio provide long reads spanning repetitive genomic regions.These long reads can be used for de novo assembling complete prokaryote genomes by long-read assemblers ( 6 ,7 ), or for scaffolding fragmented assemblies from short-read sequencing (8)(9)(10).However, their higher costs and less mature ecosystems make long-read sequencing less attractive for largescale sequencing projects.Long-range sequencing technologies such as mate-pairs, Hi-C ( 11 ), 10X Genomics linkedreads ( 12 ), and optical mapping ( 13 ) capture linkage information for inferring the distance and orientation between contigs in scaffolding assemblies.These technologies however often provide low-resolution information while having specific errors and biases, providing challenges to the computational analysis ( 14 ,15 ), in addition to extra steps in data generation.
Computational approaches are also developed to utilize available public data resources as the reference to guide the scaffolding process.The rationale is to use an available, more complete assembly of the closest-possible genome as the backbone to place the contigs in the correct order.For de novo assembly, identifying a close-related genome can be a difficult task.Even if possible, the issue of reference bias can become severe for those that are highly structurally variable, such as in multi-drug resistance strains.To alleviate this problem, a class of scaffolding methods that use multiple references for scaffolding have been developed.The prominent methods suitable for microbial genomes include Ragout ( 16 ,17 ) and Multi-CSAR ( 18 ,19 ).
The idea of using the genome variant graph as a comprehensive but compact reference model to alleviate the bias issue of using individual references has been proposed in various applications of human genomics ( 20 ,21 ), microbial genomics ( 22 ,23 ), and error correction for long-read sequencing metagenomic data ( 24 ).This technique has shown advantages over the traditional linear reference in e.g.alignment, variant calling and typing analyses.However, due to costly computation, the use of the variant graph usually requires high-performance computing resources and extensive running time.Instead, a pangenome graph at the gene level is adequate for almost all bacterial genomics analyses.It can provide the graph structure regarding the proximity relationships of all genes involved.A pangenome graph can be built from a set of bacterial genome assemblies by using tools such as Roary ( 25 ), Panaroo ( 26 ) or PanTA ( 27 ).The latter can construct the graph progressively by adding new genomes into an existing pangenome.
Here we introduce Pasa, the first computational method that makes use of the population information of a species to improve draft assemblies of new isolates belonging to the species.Specifically, it uses the pangenome graph built from existing genomes to resolve the assembly graph of a newly sequenced isolate for scaffolding.Pasa builds the former graph by using PanTA ( 27 ) on a large set of genomes and using it as the guide to resolve the assembly graphs generated by SPAdes assembler ( 28 ).We demonstrate the utility of Pasa by applying it to scaffold the draft assemblies of bacterial isolates of species with various levels of genomics diversification, namely, Klebsiella pneumoniae , Esc heric hia coli and Streptococcus pneumoniae .In all cases, we show that pangenome graphs are helpful in resolving the assembly graph for better scaffolding quality compared to other multi-reference-based methods.

Overview of algorithm
To assemble the target genome sequenced by a short-read technology, an assembler such as SPAdes ( 28 ) and Velvet ( 29 ) constructs a de Bruijn graph from overlapping reads, and then identifies contigs by finding walks through the de Bruijn graph that correspond to continuous sequences.Most genomes of both eukaryotic and prokaryotic organisms contain an abundance of repetitive sequences whose sizes are longer than the length of the reads.In such cases, a walk going into a repetitive sequence has multiple possible paths and thus the corresponding contig cannot be extended unambiguously, leading to the fragmentation of the assembly to multiple contigs.
Genome scaffolding involves ordering and orientating the contigs in a fragmented assembly, and thereby joining the disconnected contigs to improve the assembly's completeness and contiguity.Pasa (PAngenome-based ScAffolding) achieves this by exploiting the connectivity information obtained from the population genomes of the species.Figure 1 illustrates an overview of the approach of Pasa.In the first step, Pasa runs PanTA ( 27 ) on the reference genomes to obtain a pangenome, which is a collection of gene clusters.Pasa then uses the pangenome to build a pangenome graph that captures the structural variant landscape within the population.In this pangenome graph, nodes are clusters of genes, and two nodes are connected by an edge if they are adjacent in any genome from the population (Figure 1 A).In scaffolding the assembly of a target genome T , Pasa identifies the positions of gene clusters containing genes from the target genome T in the pangenome graph, thereby establishing the relative orientation of the genes in T .It then infers the relative positions of the contigs harboring the genes.Pasa then computes the matching score for each pair of contigs in T .The matching score measures the likelihood of two contigs being adjacent, i.e. the higher the score between two contigs, the more likely they are adjacent in the target genome.In addition, Pasa leverages the information from the assembly graph to obtain a more reliable estimation of the matching scores.Finally, it solves the constrained maximum matching problem to obtain an ordering of the contigs, which is then further refined to obtain the final scaffolds.The details of Pasa are outlined below, and the Supplementary Material contains the corresponding command lines and instructions.

Construction of the pangenome graph
Pasa builds the pangenome graph of the species from the genome assemblies of a collection of isolates including target genome T. The population genomes are annotated with Prokka ( 30 ), which generates a gff file for each genome assembly.It then constructs the pangenome of the species using a pangenome building method, which infers the gene families of the collection by clustering genes in these genomes based on sequence similarity .Essentially , any pangenome construction method such as Roary ( 25 ) and Panaroo ( 26 ) can be used but we opt to use PanTA ( 27 ) for its speed and scalability.Particularly, PanTA can build the pangenome of the species before hand, and in its progressive mode, can add the target genome into the existing species pangenome with minimum computation, instead of rebuilding the entire pangenome for every new target genome (see Supplementary Information 2.1) Next, Pasa orients the gene-level genomes such that they have the most common consecutive gene pairs.The orientations of the gene-level genomes are determined by the following procedure: The algorithm begins with the first genome, and its orientation is chosen arbitrarily.Pasa then identifies an orientation of the second genome that maximizes the number of common pairs of consecutive genes with the first genome.
Similarly, Pasa finds an orientation of the third genome that has the largest number of common pairs of consecutive genes with the first two genomes, and the procedure is repeated for the remaining genomes (Figure 1 A).Our later experiments reveal that the order in which the genomes are employed to create the pangenome graph has only a marginal effect on Pasa's performance (see Section 'Evaluation on simulated data' and Supplementary Figure S1 ).Pasa then constructs a directed graph with weighted edges, where nodes represent clusters of genes, two nodes are connected by an edge if they are adjacent in any genome from the population, and the weight of each edge corresponds to the number of times two nodes are adjacent in the oriented genomes.As a result, the edges along the conserved regions of DNA sequences throughout evolution are expected to have substantial weights.Lower-weight edges, on the other hand, are likely to correspond to infrequent or underrepresented portions of the genome.To reduce the impact of spurious edges, which may arise from incorrect assemblies and misclustering, Pasa removes edges with weights that are less than 20% of the number of reference genomes.For example, if the pangenome graph is constructed using 100 reference genomes, any edges with weights below 20 (20% of 100) will be removed from the graph.It is worth mentioning that Roary ( 25 ) and Panaroo ( 26 ) also construct a graphical representation of the pangenome, they however only consider a simple undirected unweighted graph to represent the gene arrangements in the population.Pasa, in contrast, utilizes a directed graph with weighted edges.This graphical representation is chosen because it effectively captures the connectivity information from the population genomes.The edge weights in the pangenome graph are later used to determine the optimal order of the contigs in the target genome.

Construction of contigs overlap graph
Pasa employs SPAdes to produce the draft assembly from sequencing data for its stability and high performance; however any assembler that generates the assembly graph can also be used.Based on the de Bruijn assembly graph from SPAdes, Pasa builds a sequence overlap graph of all final contigs, or the contigs graph for short throughout the scope of this article to distinguish it from the other graph structures.The SPAdes assembly graph is originally saved in a FASTG file, namely assembly _ graph .fastg .The sequences in this file are edges from the assembly graph, also known as preliminary contigs before the repeat resolution.The final contigs are then constructed from the subsequent repeat resolving step, for each of them comprises a unique path of preliminary contigs traversing this graph ( contigs .paths ).By combining these sources of information, we can construct the graph of interest with contigs as vertices and their overlapping connections as edges.A formal description and further details for implementation can be found in Supplementary Methods Section 2.3

Alignment of contigs in T to the pangenome graph
The draft assembly of the target genome T is annotated with prokka and is added into the pangenome graph by the 'add' function in PanTA.The orientation of a contig, when aligned to the pangenome graph, is selected in such a way that it maximizes the number of common consecutive gene pairs with the pangenome graph.

Multiplicity estimation
Pasa estimates the multiplicity of contigs, which refers to the number of copies of a contig in the target genome.Pasa uses the length and coverage information to estimate the multiplicity of the contigs.In particular, it utilizes a simplified model from ( 8 ).Specifically, let D be the median coverage of the five largest contigs.Then, for each contig, its multiplicity (number of copies) is defined as the ratio of its read coverage and D , rounded to the nearest integer.Intuitively, Pasa creates k different copies for each unresolved repeat in the target genome, where k is the estimated copy number of the repeat in the complete genome.As some repeats could already be resolved by the NGS assembler, the corresponding synteny blocks in the target contigs will be surrounded by other unique synteny blocks.Pasa uses this 'context' information to map repeat instances in contigs to the corresponding repeats in reference genomes.See additional details about the repeat resolution algorithm in the Methods section 'Repeat resolution algorithm'.

Matching scores
Given a collection of contigs C = { c 1 , c 2 , . . ., c n } , Pasa assigns a score to each pair of contigs in C.This scoring function utilizes both population information in the pangenome graph and sample-specific information in the contigs graph.The score indicates the likelihood that the two contigs are adjacent in the original genome.The score of two contigs c i and c j is defined as where d ( c i , c j ) is the shortest distance in nucleotides between the two contigs in the contigs graph , and s ( c i , c j ) mimics the number of reference genomes that support c i and c j to be adjacent.Specifically, s ( c i , c j ) = d g ( c i , c j ) / n g ( c i , c j ), where d g ( c i , c j ) is the shortest distance between the terminal gene in c i and the starting gene in c j in the pangenome graph, and n g ( c i , c j ) is the number of genes between the two end genes plus one.Note that the score ( c i , c j ) should be zero if c i and c j are not adjacent in the contigs graph.However, since only contigs with at least one gene are able to align to the pangenome graph and hence being considered, such a connection between c i and c j is still possible if the contigs between them are the ones that do not have any genes.Therefore, Pasa allows a connection between c j and c j if their distance in the contigs graph is at most 5000 bp, which is empirically twice the length of the largest contig that does not have any genes (in all bacterial genomes in our experiments).The score of two contigs is high if they are close to each other in many reference genomes.Pasa employs multiple genes in the contigs rather than just two end genes to obtain a more accurate score between the two contigs.In summary, Pasa first utilizes the sample contigs graph to determine whether two contigs will be adjacent in the complete target genome.If such a connection between the two contigs is possible, it uses the population pangenome graph to compute a score for each pair of contigs in the target genome.

Constrained maximum matching model
Similar to Ragout ( 17 ) and Multi-CSAR ( 19 ), Pasa splits each contig into two parts, a head and a tail node.This allows for finding an assembly that corresponds to a maximum weight matching in C. Specifically, given the set of contigs . ., n } and the node c t i is matched to the node c h j iff c i is just before c j (clockwise direction) in the target genome.Hence, the order of the contigs C in the target genome induces a perfect matching in G and finding the most likely arrangement of the contigs C corresponds to finding the maximum weight matching in G , with score ( c i , c j ) representing the weight of the edge (c t i , c h j ) .It should be noted that Ragout and Multi-CSAR do not take relationships among more than two contigs into account.Therefore, if score ( c i , c j ) and score ( c j , c k ) are high, they will join c i − c j − c k no matter the relationship between c i and c k .If c j is a repeat contig, this may be problematic.Pasa, on the other hand, exploits the long-range dependency between contigs.The pair ( c i , c k ) is not considered in the matching if c j is a repeat contig and s ( c i , c k ) ≤ γ.Here, γ is a user-defined parameter with a default value of 1.0.The higher value of γ will result in a lower number of misassemblies, it is however less likely to produce a complete assembly.The default value of γ is chosen based on the simulated experiments.Note that the default value of γ is used in all experiments.Pasa uses a modified version of the greedy algorithm to find the constrained perfect matching (see Algorithm 2).

Refinement
There are certain types of contigs that are not considered in the model.These include (i) contigs with no gene and (ii) contigs with genes that are not part of an existing gene cluster.To include these contigs in the final assembly, Pasa uses the contigs graph, which has been constructed from all input contigs in the target genome.The genome traverses the graph with a certain unknown path.However, since initial scaffolds are now available, Pasa uses these scaffolds to restore small or repetitive fragments.Given a contigs graph and a set of merged scaffolds from the previous step of the algorithm, for each pair of consecutive contigs from these scaffolds, Pasa finds all possible paths connecting them in the contigs graph that do not contain contigs from the scaffold.If there exists such a path, it inserts all the intermediate contigs along this path between the two contigs.
The refinement is further strengthened as follows: for each joined contig from the above refinement step, Pasa considers all single-copy contigs in C that made the joined contig, called backbone contigs .For two close backbone contigs in the joined contig (i.e.their distance in the joined contig is less than 2 kbp), the joined contig is split into two at one of the two backbone contigs if their matching score is < 3.0.This refinement step is omitted in Pasa's sensitive mode.

Evaluation on simulated data
We compared the performance of Pasa to existing state of the art reference-based scaffolders including Multi-CSAR ( 19 ) and Ragout (v2) ( 16 ).We excluded MEDUSA ( 31 ) from the comparison because it was shown to perform poorly in previous benchmarks (17)(18)(19).We evaluated Pasa and competing methods on a simulated dataset constructed from the genomes across three different bacterial species, namely Klebsiella pneumoniae , Esc heric hia coli and Streptococcus pneumoniae ; these species were chosen to represent differing levels of genomic diversity: conservative, moderate, and divergent, respectively.For each species, we randomly chose 10 complete genomes from the NCBI Reference Sequence Database as the test genomes.The accessions of the genomes are listed in Supplementary Table S1 .We also ensured that the test isolates were not included in the set of samples in the reference genomes.The customized Jupyter notebook for downloading and preparing data is included in https://github.com/amromics/pasa (in the 'Reproducibility' directory).
For each test genome, we simulated Illumina sequencing reads using ART (v2.5.8) ( 32 ) with the following configuration: paired-end sequencing, read length of 100 bp, fold coverage of 70 and mean fragment size of 400 bp.We ran SPAdes assembler (v3.13.0) ( 28 ) on the simulated sequencing data to construct the draft assembly for the genome.We then applied the competing methods to scaffold the draft assembly using the information from the reference genomes of the species.
We evaluated the scaffolders using commonly used metrics including NGA50 statistics, the aligned length, the number of resulting contigs and the number of misassemblies.The NG50 metric is the length (in kbp) for which the collection of all contigs of that length or longer covers at least 50% of the genome.We also used the NGA50 metric, which is similar to NG50, but uses aligned blocks instead of contigs for the calculation.The total aligned length is the total number of aligned bases in the scaffolds.This value is usually smaller than a value of total length because some contigs may be unaligned or partially unaligned to the reference.In principle, NGA50, the aligned length, and the contig number are metrics to access the contiguity of scaffolds whereas the number of misassemblies measures the accuracy of the scaffolding.These metrics were calculated using QUAST (v5.0.2) ( 33 ).
Table 1 and Figure 2 report the average performance of the scaffolding methods on the test genomes for each species.We found that Pasa (s) in sensitive mode achieved the most complete assemblies, as measured by NGA50 and the aligned length metrics.Pasa and Multi-CSAR showed similar performance in terms of contiguity of scaffolds, which both outperformed Ragout (Table 1 ).The NGA50 metrics of Ragout were consistently low across all test genomes.Ragout's usage of synteny blocks to bridge the contigs may be one of the causes.Because these synteny blocks are frequently short, only a small number of contigs are linked in Ragout.
Multi-CSAR is comparable with both versions of Pasa in terms of contiguity metrics.It however produced the largest number of misassemblies, generally 4-6 times higher than Pasa and Ragout (Figure 2 ).The misassembly rates of Pasa and Ragout were very low since both methods incorporate the connectivity information in the assembly graph into the pipeline.Specifically, Ragout made fewer errors than Pasa in the K. pneumoniae dataset but was more erroneous than both versions of Pasa in the other bacterial species.Figure 2 shows that relocations are the most common source of misassemblies, which is expected given that all techniques employed reference genomes to bridge the contigs, and those reference genomes may have different gene orders than the target genome, resulting in the incorrect order of contigs in the scaffolds.Both Ragout and Pasa made few translocations and inversions misassemblies, with a low number of errors (typically < 3).
As expected, Pasa in its conservative mode is less likely to produce a complete assembly than its sensitive mode but it carries a low risk of misassembly and is appropriate for contexts where assembly accuracy is important.In sensitive mode, the refinement is less restricted.This mode is most likely to complete the assembly but carries a slightly greater risk of error.For each bacterial species, the NGA50, aligned length, and # contig are averaged over the 10 genomes assembled from synthetic sequencing data.All reported numbers are rounded to the nearest integer.The length of the best corrected scaffolds is highlighted in bold.It is suitable for cases where completeness is more important than accuracy.In summary, Pasa and Ragout performed similarly in terms of misassemblies, whereas Pasa outperformed Ragout in terms of contiguity (up to 6-fold in K. pneumoniae isolates).Despite the fact that Multi-CSAR produced fewer contigs than Pasa and performed similarly to our method in terms of contiguity, the misassembly rates were substantially greater, highlighting the challenges of resolving repeats using the reference alone.We further addressed the dependence of Pasa's completeness and accuracy on the order of the reference genomes from which Pasa computes the pangenome graph.We computed the NGA50 metrics and number of misassemblies for five E. coli isolates.Overall, Pasa's performance was robust to the order of the reference genomes used in constructing the pangenome graph ( Supplementary Figure S1 ).

Evaluation on real
Next we compared Pasa to the competing methods on real data where the genomes were sequenced Illumina technology.We obtained sequencing data for three isolates belonging to three species: Staphylococcus aureus , Klebsiella pneumoniae and Salmonella enterica .For each isolate, we ran the competing pipelines on the short-read sequencing and benchmarked the results against the complete genome.The complete genome and the sequencing data for the S. aureus isolate were obtained from the Genome Assembly Gold-Standard Evaluation (GAGE) benchmark study ( 34 ).The S. aureus dataset in the GAGE benchmark was previously sequenced and finished using conventional Sanger technology, and later resequenced Illumina technology.For the other two species, we selected the complete genomes of two isolates from the RefSeq database with Accession number GCF_003030145.1 and GCF_000439415.1, respectively.The sequencing data of these two isolates was obtained from the SRA archive database (Run accessions: SRR9042857, SRR9043663).
We used NGA50 metrics and the number of misassemblies as the representative metrics for contiguity and accuracy.Figure 3 summarizes the performance of the scaffolding tools on real datasets.Specifically, for the S. aureus dataset from the gold-standard database, we found that Pasa made fewer incorrect joins than all other scaffolders while achieving the best scaffolding results in NGA50 metrics (Figure 3 , left).Note that this dataset had a low NGA50 metric due to the low sequencing coverage (45-fold) and the early, less mature sequencing technology (Illumina Genome Analyzer II).For the K. pneumoniae dataset, both Pasa and Pasa (s) were significantly more accurate than Multi-CSAR; they produced genome assemblies with 10-fold and 2-fold fewer misassemblies, respectively while still exhibiting only slightly lower NGA50 (Figure 3 , center).It can be seen that the NGA50 and the number of misassemblies of Pasa and Ragout consistent with the report of K. pneumoniae in the simulation study.This indicates that our simulation strategy captures real data well.For the S. enterica dataset, Pasa (s) outperformed the methods in terms of NGA50 metrics.The NGA50 metrics of Pasa (s) were around 2800 kbp, which were approximately four times higher than Multi-CSAR ( ≈700 kbp) and seven times higher than Ragout ( ≈400 kbp).In addition, the numbers of misassemblies of Pasa and Pasa (s) were the same, and were slightly higher than Ragout but much lower than Multi-CSAR (Figure 3 , right).In summary, the results of the real datasets were consistent with those from the simulated datasets.Multi-CSAR generally produced the largest number of misassemblies and Ragout had the worst performance in the NGA50 metrics.In contrast, Pasa generally produced more complete assemblies than the competing methods while maintaining a low error rate.

Scaffolding using the pangenome of a related species
When reference pangenome genomes are unavailable, a closely related species can be used as a substitute.Scaffolding using a related reference pangenome is crucial in practice since reference genomes for some species, particularly uncommon and novel bacterial species, are not always readily available.To show our scaffolder's ability to use related reference pangenomes, we applied Pasa to scaffold the genome assembly of a K. quasipneumoniae isolate using the pangenome constructed from a population of K. pneumoniae species as the reference.We compared the results with Multi-CSAR and Ragout.The results are shown in Figure 4 .Despite using a reference pangenome of a different species, Pasa obtained reasonable results in terms of contiguity and accuracy.In particular, the NGA50 of Pasa (sensitive) was two times higher than that of Multi-CSAR and 6-7 times higher than that of Ragout.Ragout had the lowest number of misassemblies, followed by Pasa and Pasa (sensitive).When reference pangenome of the same species was used, the NGA50 of Pasa and Multi-CSAR increased significantly (Figure 4 , right).Accordingly the NGA50 of Pasa and its sensitive version were still better than that of Multi-CSAR and exceeded that of Ragout.Pasa had slightly more misassemblies than Ragout but less than Multi-CSAR.Overall, this shows the practical usage of our scaffolder in the scaffolding the genomes of novel species.

Time and memory usage
We demonstrate the scalability of Pasa to large data sets.Table 2 gives the average CPU times (in minutes) and peak memory usage (in GB) of the scaffolders on the synthetic data sets from the three bacterial species used in this study.Since the running time and memory usage of Pasa and its sensitive mode were similar, we only reported the performance of Pasa.It can be shown that Pasa completed a genome in around 2 minutes, which was 8 times faster than the second fastest method, Multi-CSAR.Note that the running time of Pasa did not include the running times of the pangenome construction steps because the reference pangraph only needs to be built once and can then be applied to datasets of the same bacterial species.Nevertheless, the total running time of Pasa and the pangenome construction steps (Pasa+Pangraph) was around 4 min, which was approximately four times faster than Multi-CSAR.Ragout was the slowest, taking around 2 h to finish a genome.Table 2 (right) shows peak memory usage for the same simulated data sets.Multi-CSAR required the least amount of memory ( < 0.2 GB), followed by Pasa (0.6-0.8 GB) and Pasa with the pangenome construction steps (around 1.3 GB).Ragout used substantially more memory (7-9 GB) than the other methods.

Discussion
Obtaining the complete genome assemblies is important in microbial genomics, especially in the context of antimicrobial resistance research and surveillance.Complete genomes provide crucial information of gene positions in the chromosomes and plasmids to elucidate the development and transmission of drug resistance.Pasa offers an efficient way to readily improve the contiguity and completeness of the large number of bacterial genomes that are sequenced by the affordable and high-throughput Illumina sequencing technology.Instead of relying on extra laboratory experiments that are time  Average running times (in minutes) and memory usage (in GB) of Pasa, Multi-CSAR, and Ragout are reported on the synthetic data sets used in this study.*Running times exclude the construction of the pangenome graph step as it is performed only once and then reused for subsequent genomes.The last row (Pasa+Pan.)reports the total running times and peak memory usage of Pasa and the pangenome graph construction step.consuming and costly, Pasa utilizes the connectivity information of the existing population genomes to resolve the assembly graph of the short-read assembly.Pasa is also fast and efficient, making it suitable to scaffold the large number of shortread assemblies available.
The use of population information for the scaffolding problem has been exploited in several reference-based scaffolder methods such as Multi-CSAR and Ragout.When compared with these two methods, the benefits of using Pasa are twofold: first, it produces a more complete genome with fewer errors; second, it has the lowest running time.The key difference between Pasa and scaffolders such as Multi-CSAR and Ragout is that Pasa employs a compact graphical representation of the pangenome from the population genomes.In addition, Pasa solves the constrained maximum matching problem, which allows modeling longer range dependencies between contigs in the target genome.The method may further be improved by developing an exact algorithm for the constrained maximum matching problem.In the future, it will be interesting to investigate the performance quality of our tool Pasa using different assemblers.
Prokaryotic genomes are known for enormous intraspecific variability owing to many variation events such as horizontal gene transfers, differential gene losses and gene duplication ( 35 ).Pangenome analysis was introduced as a methodology to capture the diversity of bacterial genomes ( 36 ) and has been a dispensable tool in microbial genomics studies ( 37 ) to generate biological insights such as understanding the evolution of bacterial species ( 38 ,39 ), variant detection ( 40 ) and studying antimicrobial resistance ( 41 ).In this work, we extend the utility of the pangenome by introducing a novel application to exploring the species information to scaffold and improve the quality of new genome assemblies.
The construction of the pan-genome graph allowed us to learn the gene order information in a population.The conservation of gene order can thus be used to order contigs along a chromosome by inferring their positions based on the location within the pangenome graph of the genes found in the contigs.The target genome however diverges from the population, we therefore also explored the sample-specific information by means of the assembly graph constructed from the target genome.Exploiting the information from both sample-specific information and population resulted in a highly efficient scaffolding method, Pasa.We demonstrated that Pasa could improve the contiguity and accuracy of the genome scaffolding.We observed consistent results on both real and simulated datasets for various bacterial species.Furthermore, we demonstrated that Pasa does not require the reference pangenome to be the same species as the target genome by utilizing a related species as the reference.This is especially useful for studying isolates from rare and novel species.It is also interesting to investigate Pasa ability to resolve a fused assembly graph to extend its application to scaffold a mix of different strains of the same species.
The vast majority of existing microbial genome assemblies are produced by short-read sequencing technology; many more genomes are being sequenced in research laboratories, medical organizations and public health agencies.Pasa is developed to meet the need for efficient bioinformatics methods to make sense of the data.It was demonstrated to be able to improve the contiguity and accuracy of the genome scaffolding by using the information from both the pangenome graph and the contig graph.We observed consistent results on both real and simulated datasets for various bacterial strains.Pasa offers an accurate and efficient method to scaffold and improve the completeness and continuity of these genomes, without the need for extra laboratory experiments.

Figure 1 .
Figure 1.Ov ervie w of the Pasa algorithm.Genes are represented b y arro ws and colored b y the orthologous groups.( A ) Pasa takes the pangenome built from a collection of genomes from a species as input.It then constructs the pangenome graph, where nodes represent gene families and edges represent genomic neighborhood.( B ) The contig graph represents the possible connections between the contigs in the target genome.Pasa employs information of the gene order of conserved regions in the pangenome graph to resolve the multiple connections in the contig graph.

Figure 2 .
Figure 2. Average number of misassemblies (relocations, translocations, inversions) of Pasa and competing methods on ten datasets across three different strains.A relocation is a misassembly e v ent (breakpoint) where the left flanking sequence and the right flanking sequence align o v erlap or are a w a y from each other on the same chromosome of the reference genome.A translocation is a misassembly e v ent (breakpoint) where the flanking sequences align on different chromosomes.In v ersion corresponds to a breakpoint where the flanking sequences align on opposite strands of the same chromosome.

PAGE 7 OF 10 Figure 3 .
Figure 3. Performance of scaffolders on real datasets.Scatter plots show the NGA50 metrics and the number of misassemblies made by Pasa and competing methods on S. aureus (left), K. pneumoniae (center) and S. enterica (right).

Figure 4 .
Figure 4. Performance of scaffolders using the related and exact reference.Scatter plots shows the NGA50 metrics and number of misassemblies made by Pasa and competing methods on K. pneumoniae (close) and K. quasipneumoniae (exact).

Table 1 .
Average performance of the evaluated scaffolders on the simulated datasets

Table 2 .
Comparison of running times and peak memory usage on 30 synthetic data sets