OMMA enables population-scale analysis of complex genomic features and phylogenomic relationships from nanochannel-based optical maps

Abstract Background Optical mapping is an emerging technology that complements sequencing-based methods in genome analysis. It is widely used in improving genome assemblies and detecting structural variations by providing information over much longer (up to 1 Mb) reads. Current standards in optical mapping analysis involve assembling optical maps into contigs and aligning them to a reference, which is limited to pairwise comparison and becomes bias-prone when analyzing multiple samples. Findings We present a new method, OMMA, that extends optical mapping to the study of complex genomic features by simultaneously interrogating optical maps across many samples in a reference-independent manner. OMMA captures and characterizes complex genomic features, e.g., multiple haplotypes, copy number variations, and subtelomeric structures when applied to 154 human samples across the 26 populations sequenced in the 1000 Genomes Project. For small genomes such as pathogenic bacteria, OMMA accurately reconstructs the phylogenomic relationships and identifies functional elements across 21 Acinetobacter baumannii strains. Conclusions With the increasing data throughput of optical mapping system, the use of this technology in comparative genome analysis across many samples will become feasible. OMMA is a timely solution that can address such computational need. The OMMA software is available at https://github.com/TF-Chan-Lab/OMTools.

Keywords: optical mapping; comparative genomics; structural variation; copy number variation; haplotypes; single-molecule analysis Findings Background Optical mapping captures the labeling patterns of long DNA molecules and has become a complementary approach to sequencing-based methods [1]. DNA labels are usually created by a short but specific nicking restriction enzyme (nickase), but alternative labeling strategies, such as methylations [2,3] and sequence-specific labeling based on CRISPR technology [4] have also been described. With optical map ranges from 100 kb to as high as 1 Mb, optical mapping is well suited to assist with sequence scaffolding in genome assembly and in the detection of large structural variations.
Multiple alignment (MA) is a process in which multiple queries are aligned without relying upon a reference. This type of comparison is especially useful when a standard reference is not available or is of poor quality. Also, in certain variable regions, the alignment of multiple queries to a reference could only show a 1-to-1 difference between the individual query and the reference. In contrast, MA of these queries helps to differentiate groups of patterns among the queries. Unlike pairwise alignment, for which many algorithms have been designed [5][6][7][8][9][10], MA remains an underdeveloped method for optical mapping. The other available tool is included in the proprietary software Bionumerics v7 [11], but this method is not sensitive to genomic rearrangement and requires complete genomes as inputs.
We developed a novel MA algorithm for population-scale analysis of optical mapping data: optical mapping by multiple alignment (OMMA). The OMMA program was designed for comparing assembled optical maps, which are usually longer than raw optical maps, and with the intrinsic errors that have already been addressed during the de novo assembly steps. We describe herein the algorithm and demonstrate its effectiveness with both simulated and experimental data. We also demonstrate how OMMA can be used to resolve complex genomic features and reconstruct phylogenomic relationships.

OMMA pipeline
The complete OMMA pipeline has 2 steps-the preparation step and the MA step (Fig. 1). During the preparation step, optical maps ( Fig. 1A) are aligned in a pairwise manner (Fig. 1B). Two segments from 2 different optical maps are said to match if their left and right labels are both aligned. In the MA step, information about the matching segments is used to produce chains of MA-blocks as the MA results; an MA-block is defined as a collection of segments, each from a different optical map, that all match each other. This step can be further divided into 3 main substeps (modules) (Fig. 1C-E). In the first substep, MA-blocks are formed on the basis of the results of the preparation step (Fig. 1C). In the second substep, the MA-blocks are sorted to maximize matching and to minimize rearrangement events (Fig. 1D). Finally, in the third substep, proximate MA-blocks that are similar to each other are merged (Fig. 1E). The details of the pipeline are described in the Methods.

Performance analysis
In this section we describe the accuracy of MA and phylogenetic tree reconstruction. We evaluated performance by comparing the results to multiple sequence alignments with complete DNA sequences supplied as input. Genomic sequences of Acinetobacter baumannii strains were selected for in silico digestion to produce simulated genomes that mimic the assemblies of optical maps. For analysis of accuracy, the MA of these simulated genomes based on optical mapping patterns was compared with their respective multiple sequence alignment as the gold standard. We used 2 measures-precision and recall (see Methods for definitions)-to evaluate the performance. Our MA method for error-free genomes shows highly accurate results, with precision of 93.1% and recall of 93.7%. The precision and recall remain high upon introduction of missing sites and extra sites in various error rates. For disruptive errors including segmental insertion and segmental deletion, the precision and recall remains high until the error rate reaches a level so high that would not be seen in real data (Fig. S1).
Next, we evaluated the phylogenetic tree reconstruction method based on the MA results of OMMA. We assessed the accuracy of the reconstruction method using 100 sets of 32 simulated genomes generated by the introduction of accumulated mutations according to a virtual phylogeny of these genomes (see Methods for details). Based on the locations of the nicking sites, a simulated optical map was generated for each of these 32 genomes, and OMMA was used to form an MA. The phylogenetic tree of the 32 simulated genomes was then reconstructed using the unweighted pair-group method with an arithmetic mean approach based on the similarities among the genomes suggested by OMMA results (Fig. S2; see Methods for details). The overall accuracy of phylogenetic tree reconstruction was 96.5% or 99.4% with or without errors introduced, respectively, which indicates that the reconstruction method was accurate.

Computational resources analysis
We separated the pipeline into pairwise alignment and MA for computational resources analysis in A. baumannii, Escherichia coli, and Saccharomyces cerevisiae. Pairwise alignment takes longer and more memory to run, while the MA requires fewer computational resources (Fig. S3-5). As the number of genomes increases, the number of pairwise alignments increases exponentially, and hence the number of segment links also increases exponentially. Therefore, CPU time and memory usage grows exponentially in both pairwise alignment and the MA process.

Regions with multiple haplotypes at the population level
Pairwise alignment only provides evidence for the presence or absence of variations from a reference, without considering the different labeling patterns contained in the queries. MA, however, provides a summary of all haplotypes present in the region. We illustrated the improved clarity of MA using OMMA at chromosome 1p44, which contains the olfactory receptor genes ( Fig. 2A), compared with the traditional reference-based align- ment view in IrysView (Fig. 2B). These genes are known to contain many small deletions and duplications [13]. The haplotype differences in this region are described separately in 3 subregions. The first subregion overlapped the gene olfactory receptor family 2 subfamily T member 1 (OR2T1). There are various haplotypes, including 3 major haplotypes that we denote as Types 1, 2, and 3, of which the Type 1 haplotype was the most abundant (70.8%). Interestingly, the African contigs contained no Type 2 or Type 3 haplotypes (Fig. 2C). The second subregion contained the gene olfactory receptor family 2 subfamily G member 6 (OR2G6). Although hg38 was composed of a segment pattern to which we refer as medium-medium-medium ( Fig. 2A; Subregion 2) for the middle 3 MA-blocks at this region, the contigs from various human populations actually reflected 3 segment patterns: long-medium, medium-long, and medium-short-medium. The Type 1 haplotype (59.0%) was more dominant than the Type 2 (25.0%) and Type 3 (16.0%) haplotypes (Fig. 2D). The third subregion spanned 4 genes (OR2T34, OR2T10, OR2T11, and OR2T35). The Type 1 and Type 3 haplotypes differed by only a single label. Despite this minor difference, only the contigs from the African and American populations had the Type 3 haplotype. The Type 2 haplotype was a deletion from the other types. The African contigs had a greater abundance (40.7%) of the Type 2 haplotype than all other populations (15.7%) (Fig. 2E). We also observed relationships in the haplotype distribution across 2 subregions. For example, Type 2 haplotypes in Subregion 3 are more likely to be followed by the Type 1 haplotype (30.5%) than by the Type 2 (3.1%) and Type 3 (9.5%) haplotypes in Subregion 2.

Copy number variations
OMMA not only allowed direct visualization of the presence of CNVs; it also enabled the deduction of the exact copy number of the query. In the second complex case, we illustrated the characterization of CNV based on MA of contigs spanning the gene ANKRD30A, which contained tandem repeats including a very large repeat unit of size 11.1 kb [14]. The MA of contigs containing the CNV easily revealed the variable length of the contigs within this region, in comparison to adjacent conserved regions that had mostly the same length in the various contigs (Fig. 3A).
By choosing 2 flanking MA-blocks that corresponded to the boundary of the CNV region, the number of segments between them was used to deduce the number of copies in each contig. As an example, a contig from the sample NA19795 from the American population contains 16 alternating long and short segments. Because each repeat unit contains 2 segments (1 long and 1 short), we deduced that this contig has 8 copies. In comparison, the reference genome hg38 contains 8 segments, or 4 copies. The number of tandem repeat copies in other contigs ranged from 2 to 10. Based on the copy numbers in different contigs of each population, we investigated the correlation between copy number and ethnicity and found that European contigs had fewer copies than other populations (with marginal statistical significance; Tukey test, P = 0.06) (Fig. 3B).

Reconstruction of patterns in genomic regions not found in the reference
In the third case, we used MA to characterize the haplotype differences in the subtelomeric region of chromosome 20p, the sequence of which does not exist in the reference human genome. This region has been explored using optical maps and has been shown to display a pattern that is not found in the reference genome [15]. In our analysis, the extension of MA of the contigs beyond the sequence-containing portion of the reference chro-mosome 20p allowed us to discover a large indel (Type 1: without inserted pattern; Type 2: with inserted pattern) as a major haplotype difference among the contigs (Fig. 4A). The reliability of this extension is confirmed by alignment of molecules (Fig.  S9). Notably, the African contigs contained only the Type 1 haplotype, whereas the contigs from other populations contained both haplotypes (Fig. 4B).

OMMA revealed conservation of genomic structures and predicted colicin and bacteriophage integration
We applied OMMA to optical maps generated from 21 drugresistant A. baumannii genomes of various strains (Fig. 5). Briefly, the optical mapping data were generated and assembled into a single consensus optical mapping assembly for each of the 21 samples. OMMA combined the 8,315 segments from all strains into 823 MA-blocks.
The different regions in the MA could be roughly divided into 3 categories: (1) conserved among most strains, (2) conserved among close strains only, and (3) highly variable and not conserved across strains (Fig. 5). Although the segments shared by most strains likely represented the evolutionarily conserved regions, the segments shared only among the close strains could be the key to separation among the clusters of strains. The highly variable segments, in contrast, could be hot spots for the integration of genomic islands.
The categories could be partially supported by different conservation levels of the MA-blocks. Most MA-blocks were conserved, and 23.8% of MA-blocks contained segments from all genomes, which accounted for 49.5% of all segments (Fig. S12). These mainly constitute the region in category 1. Notably, some MA-blocks were evolutionarily conserved only within a certain set of genomes, as shown by the major peaks of MA-blocks with 8 (12.1%) and 13 (12.8%) segments, that constitute the region in category 2. This local conservation is likely the result of our experimental strains that were distributed into 2 main groups with 8 and 13 strains, respectively. (See the phylogenetic tree analysis section below.) The remaining MA-blocks that are not conserved constitute the region in category 3.
Segments that were not completely conserved that involved size changes were of greater interest than those with only label additions or deletions because the former corresponded to the indel of a large piece of DNA, which would be more likely associated with the gain or loss of an entire set of functional elements, whereas the latter would be associated with the introduction or disruption of a single functional element due to single-nucleotide variations or small indels. We deduced the identity of these functional elements by combining sequencing and optical mapping data (Fig. 5). For example, a potential insertion was likely the integration of a 9,286-bp sequence, including the Colicin V secretion gene cvaA. Another example was the integration of a 40,243-bp sequence, including several bacteriophage genes.

OMMA-based phylogenetic tree reconstruction revealed an evolutionary relationship among strains
Next, we applied our phylogenetic tree reconstruction method to reconstruct the phylogenetic tree of the 21 strains of A. baumannii based on their OMMA alignment (Fig. 5). The reconstructed tree was separated into 2 major clusters of 13 and 8 strains. The larger cluster was further separated into 2 subclusters of 9 and 4 strains, in line with their respective multilocus sequence typing (MLST) information. Specifically, the tree from optical mapping  . In fact, the 4 STs in the first cluster were highly similar and differed only in the single-nucleotide polymorphisms of a single gene. One advantage of our method is that it separates strains on the basis of whole-genome structures, unlike the traditional MLST method, which relies on only a selected set of genes. As a result, our method provides substantially greater detail about the evolutionary relationships among the strains. For example, the strains classified as ST96 by MLST could be further divided into multiple groups based on our results.

Discussion
Our OMMA method combines multiple queries into a single comparison, which assists with a wide range of analyses, including those in comparative genomics and population genomics. Multiple alignment provides a comprehensive view of the comparison of queries. With variations captured across different queries, our method is able to reflect the level of variability or conservation within a certain region and thus locate potential hot spots for genomic variations.
OMMA is also a useful tool for population genomics analyses. Variable haplotypes can be classified quickly, and their rela- tive abundance in various populations can then be directly visualized and analyzed. With the phylogenetic tree reconstructed, the elements responsible for the differentiation into clusters can also be determined. One major advantage of the use of optical mapping over sequencing is its ability to directly visualize and study the genome structure. The assembled optical mapping contigs are usually much longer than sequence assemblies, which usually break into pieces at short repetitive regions. It also becomes more challenging when we need to characterize complex structural changes in larger genomes like the human genome by examining sequence assembly.
The OMMA pipeline offers great flexibility in customizing procedures. The flexibility of this method allows the task of MA of a very large genome to be divided into smaller jobs. The results can then be combined into a single MA during the final step. Our method also supports aligning queries with rearrangements including inversions and intrachromosomal translocations. In addition to the applications demonstrated here, our MA method could be extended for other studies, such as pan-genome construction and genomic island prediction. Our method has limitations, including a low tolerance to very large segmental duplications (e.g., 250 kb) with multiple copies. However, with the recently launched direct label and stain (DLS) chemistry that no longer introduces double-strand DNA breaks at 2 closely located nicks, the improved quality of assemblies should reduce the effect of this limitation on MA.

Conclusions
In this paper, we describe the first rearrangement-tolerant and nonproprietary MA method for optical mapping of data. The method's accuracy was assessed using an A. baumannii dataset with high precision and recall. We demonstrate the application of the MA results to phylogenetic analyses and complex region analyses (e.g., multiple haplotypes, CNVs, and novel genomic region). OMMA could serve as a fundamental tool for the further development of more specific analytic methods for the study of comparative and population genomics.

OMMA overview
The OMMA program was developed to compute segmentmatching information to generate a series of MA-blocks as a collection of matching segments of entries. The entire pipeline is separated into a preparation step and an MA step that can be further divided into 3 main substeps (modules) (Fig. 1). In the first substep, the matching segments from pairwise alignments or other sources are used to construct the blocks. In the second substep, these blocks are sorted to minimize rearrangement events. Finally, in the third substep, proximal segments are merged if their sizes are similar.

Preparation step
At the preparation step before the core modules are run in MA, we must generate some clues about which segments on differ-ent queries should be put together. Usually, this process considers not only the size matching of a pair of target segments alone but also accounts for the size matching of their proximate segments. We define a piece of evidence that helps to determine the matching of segments as a "source." A source is composed of a set of "segment links" that are denoted as a matching pair of segments from different queries. Our method accepts 2 sources that provide clues about the segment links based on similar labeling patterns: pairwise alignment among queries and the MA results.

Pairwise alignment result as sources
All queries are aligned in a pairwise manner (Fig. 1B), followed by derivation of the segment links from the alignments of each pair of queries. The pairwise alignment results were generated by OMBlast [5], which could output partial alignments (local alignments between a pair of queries) that are critical to MA of regions with rearrangement.

Multiple alignment result as a source
The MA result (generated by the OMMA pipeline) can also be used as a source because the results are intuitively a collection of matching segments. The result is particularly useful in largescale MA, such as whole-genome MA in humans, for which 1step MA is not computationally feasible. By dividing the large number of queries into separate subtasks, several MAs on a smaller scale can be achieved, and the results can serve as the sources for a global MA.

Construction of MA-blocks based on segment links
For MA of Q queries, each MA-block is represented by a binary vector of length Q, where a "non-empty" entry or "1" indicates that a query participates in this block, and an "empty" entry or "0" indicates that it does not. A non-empty entry can be occupied by only 1 segment from the query. No more than 1 segment from the same query can be assigned to each MA-block. An undirected graph is constructed with the segments as vertices and the segment links as edges (Fig. 1C). Each connected component is treated as an MA-block "candidate." A "valid" MAblock is defined as a collection of segments with ≤1 segment from the same query. If an MA-block candidate fulfills this criterion, all segments in the candidate are directly converted into an MA-block. Otherwise, each individual segment in the candidate is instead assigned to a separate MA-block.

Segment links from multiple sources
The selection of a proper set of parameters in pairwise alignment is difficult. To solve this problem, segment links from various sources can be supplied in the MA processes. Briefly, segment links from the most confident sources (such as pairwise alignment results with more stringent parameters) are first used to construct MA-blocks. The segment links from less confident sources (such as pairwise alignment results with more lenient parameters) are then used to connect the MA-blocks. Two connected MA-blocks are merged if their binary vector does not overlap (i.e., if they do not have segments from the same queries after merging).

MA-block sorting
The MA-blocks are sorted to better illustrate the overall pattern of the queries. Here, we briefly demonstrate the idea using Fig. 1D before discussing the details of implementation. With different MA-blocks generated from the previous step, it is obvious to place MA-block [A1, B1, C1] before MA-block [A2, B2, C2] (i.e., segment A1 is followed by segment A2 on query A, and the same applied to query B and C) because the segments from all queries in these 2 blocks are consecutive. However, the problem becomes more complicated after rearrangement such as inversion or translocation. For example, we must choose what follows the MA-block [A3, B3, C3]; it could be MA-block [A4, C8] (i.e., because segment A3 is followed by segment A4 on query A), MAblock [B4] (i.e., because segment B3 is followed by segment B4 on query B), or MA-block [A8, B9, C4] (i.e., because segment C3 is followed by segment C4 on query C).
More formally, the MA-blocks are sorted to minimize the number of rearrangement events and maximize the matching events. For B MA-blocks constructed from the previous module, consider a non-empty qth entry in the bth MA-block and a nonempty qth entry in the (b + x)th MA-block, where x is a positive integer and any qth entries are empty in the (b + 1)th to (b + x − 1)th MA-blocks. A "matching event" occurs if the 2 entries represent consecutive segments in the original qth query and have the same orientation. In contrast, a "rearrangement event" occurs when the 2 entries are in different orientations or if they are not consecutive segments ordered in the original qth query. Note that the sum of matching and rearrangement events in the final chain remains constant for the same set of queries. This problem is equivalent to a nondeterministic polynomialtime hard (NP-hard) traveling salesman problem, in which all vertices (MA-blocks) must be traversed exactly once, except that no limitation is set on the start and end points. Because determination of an optimized solution is computationally intensive, the nearest neighbor-joining algorithm was devised to approximate a suboptimal solution.

Nearest neighbor-joining algorithm
In the nearest neighbor-joining algorithm, the MA-blocks are connected into multiple chains. The goal is to repeatedly connect chains until a single chain remains as the solution to the order of the MA-blocks. In the beginning, 1 chain is created for each MA-block, resulting in C chains. The connection candidates from each pair of chains are added into a priority queue. In each round of connection, a pair of chains that results in the highest connection priority is pulled from the queue and connected to a new chain, followed by an update on the connection candidates for the new chain. The connection process is repeated for C − 1 rounds to connect all C chains into 1 single chain.

Connection priority for 2 chains of MA-blocks
The connection priority for 2 chains of MA-blocks is based entirely on the relationship of the entries in the 2 chains. In total, Q relationships are built on the basis of Q sets of entries compared within the same query. Only the last non-empty entry from the former chain and first non-empty entry from the latter chain are compared. We define the relationship of the entries from a particular query as matching and rearrangement if the 2 selected entries are consecutive and nonconsecutive segments, respectively. In addition to matching, if the 2 entries are taken from the last MA-block of the former chain and first MA-block of the latter chain, they are further categorized as direct matching. The relationship is set as empty if no non-empty entry exists in either chain. An example is shown in Fig. S13, in which the relationships of the 4 sets of entries from queries A, B, C, and D represent matching, direct matching, rearrangement, and empty relationships, respectively. The parameters are defined for the count of relationships between the 2 chains as follows: r n m : matching (note that a direct matching relationship is also counted here) r n d : direct matching r n r : rearrangement r n rr : reference rearrangement (if reference is provided by the user) r n e : empty The parameters above always meet the following criteria: r n m + n r + n e = Q r n d ≤ n m r n rr ≤ n r The connection priority for a pair of chains is listed below: 1. At least 1 matching relationship exists (n m ≥ 1). 2. No rearrangement is found (n r = 0). 3. If no rearrangement is found, n d is maximized, followed by maximizing n m . 4. If rearrangement is found, (a) the connection results in no rearrangement in reference (n rr = 0, if reference is provided by the user). (b) n r is minimized, followed by maximizing n m , followed by maximizing n d .

Directed graph representation for multiple alignment
The MA results can be represented as an acyclic directed graph. Consider an acyclic directed graph constructed with MA-blocks as the vertices. A directed edge from MA-block b to b + x is built if 2 non-empty entries exist from the same query q in bth and (b + x)th MA-blocks, where x is a positive integer and all entries from query q are empty in the (b + 1)th to (b + x − 1)th MA-blocks. An edge is built between 2 MA-blocks with a weight equal to the sum of the matching and rearrangement relationships. Its direction follows the order of the MA-blocks. Such a representation simplifies the interpretation of the actual variation and noise, which can be removed by filtering the edges with minimum weight.

Merging
We wish to merge MA-blocks without introducing new rearrangements. The order of the MA-blocks other than the merging target remains unchanged in the merging actions. With such a constraint, the merging step can effectively increase the sensitivity of the final MA result. From the directed graph representation of the sorted MA-blocks, the merging of 2 MA-blocks leads to new rearrangements if the set of descendant vertices of 1 MAblock intersects the set of ancestor vertices of another MA-block.

Merging of MA-blocks by proximity
Consider a directed graph representation of the sorted MAblocks (Fig. 1E), with MA-blocks as vertices. Two MA-blocks that share ≥1 incoming neighbor or ≥1 outgoing neighbor are defined as proximate. Two proximate MA-blocks are merged if the average size of the segments in 1 MA-block matches those in the other MA-block. The merging of 2 MA-blocks can be disruptive and lead to rearrangement. To avoid the introduction of new rearrangements, we must check for overlapping of the ancestor and descendant vertices. The ancestor and descendant vertices of an MA-block are obtained by traversing the directed graph, and the ancestor and descendant vertices are updated by dynamic programming. After each merging step, the directed graph and the ancestor and descendant vertices are updated.

Merging of MA-blocks by segment links
Another merging strategy relies on segment links, as described above in the MA-block construction step. This time, the connections are taken as evidence for the potential merging of 2 MA-blocks that include 2 segments that form a connection. Like merging proximate MA-blocks, the checking and update steps described in the previous section are used to prevent the introduction of rearrangements during the merging of 2 connected MA-blocks. Figure S1: The performance of multiple alignment on genomes with various types of errors at different rates Figure S2: Phylogenetic tree reconstructed based on multiple alignment by OMMA of simulated genomes Figure S3: CPU time and memory usage for pairwise alignment and multiple alignment of different number of A. baumannii genomes Figure S4: CPU time and memory usage for pairwise alignment and multiple alignment of different number of E. coli genomes Figure S5: CPU time and memory usage for pairwise alignment and multiple alignment of different number of S. cerevisiae genomes Figure S6: Multiple alignment by OMMA of the contigs with (A) and without (B) hg38 of Figure 2 using the same parameters Figure S7: Multiple alignment by OMMA of the contigs from all populations at the olfactory receptor region (1q44) Figure S8: Multiple alignment of contigs by OMMA from all populations at gene ANKRD30A Figure S9: Alignment of optical maps from individual GM19921 on chromosome 20 using IrysView