Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences

Background As the number of sequenced genomes rapidly increases, chromosome assembly is becoming an even more crucial step of any genome study. Since de novo chromosome assemblies are confounded by repeat-mediated artifacts, reference-assisted assemblies that use comparative inference have become widely used, prompting the development of several reference-assisted assembly programs for prokaryotic and eukaryotic genomes. Findings We developed Chromosomer – a reference-based genome arrangement tool, which rapidly builds chromosomes from genome contigs or scaffolds using their alignments to a reference genome of a closely related species. Chromosomer does not require mate-pair libraries and it offers a number of auxiliary tools that implement common operations accompanying the genome assembly process. Conclusions Despite implementing a straightforward alignment-based approach, Chromosomer is a useful tool for genomic analysis of species without chromosome maps. Putative chromosome assemblies by Chromosomer can be used in comparative genomic analysis, genomic variation assessment, potential linkage group inference and other kinds of analysis involving contig or scaffold mapping to a high-quality assembly. Electronic supplementary material The online version of this article (doi:10.1186/s13742-016-0141-6) contains supplementary material, which is available to authorized users.


Background
Chromosome assembly is an important part of virtually any eukaryotic genome project. The number of assembled genomes increases each year and many of them are anchored to physical chromosome maps [1]. A robust de novo chromosome assembly requires not only matepair reads with different insert sizes, but also physical and genetic maps [2][3][4]. The large number of high quality assembled 'reference genomes' leads to an alternative approach -a reference-assisted chromosome assembly. Using this approach, the benefits of assembled chromosomes can be exploited without additional sequencing or map construction. These benefits include a known number of linkage groups and an estimated distance between markers, which is important for inferences of linkage and synteny. An assisted assembly also connects and orders large numbers of small contigs or scaffolds based on comparative analysis. In many cases, the initial number of contigs and scaffolds can exceed several hundred thousand following de novo assembly; working with such a fragmented genome can prove challenging [5]. Arranging contigs and scaffolds into putative chromosomes using information from the reference genome of a closely related species reduces the overall number of fragments from thousands to hundreds or dozens and also simplifies the annotation and analysis of different genomic features such as repeats, genes, single-nucleotide polymorphisms, copy number variations and segmental duplications.
A disadvantage of this approach is the introduction of occasional assembly errors driven by evolutionary chromosomal rearrangements. Even a closely related reference can differ in synteny from the target genome to some degree. The number of introduced assembly artifacts generally correlates with the evolutionary distance between the target and reference genomes [6] although rates of chromosome rearrangements are hardly clocklike, at least for mammals [7,8]. These assembly artifacts are easily corrected if a physical map for the target genome is developed, using a tool such as the single molecule nextgeneration mapping system (Irys) developed by BioNano Genomics [9].
SyMap was designed to facilitate reference-assisted chromosome assembly for eukaryotic genomes; however, it has important limitations. SyMap uses MUMmer [20] or NUCmer [21] for the alignment phase, requires a separate structured query language (SQL) database to work efficiently and takes a very long time to align large genomes to each other.
The most promising approach for reference-assisted assembly is based on using several reference genomes instead of a single one. RACA implements such an approach, using alignments of target, reference and outgroup genomes as inputs to generate predicted chromosome fragments (PCFs) [19]. However, RACA also requires additional evidence from mate-pair libraries for joining genome fragments, while most de novo sequenced genomes have no such libraries available. Furthermore, RACA requires extensive computations for assembling chromosomes.
In this paper we introduce Chromosomer -an opensource cross-platform software that automates the reference-assisted building of genomic chromosomes and is especially effective for large genomes (> 1 giga base pairs). Chromosomer constructs draft chromosomes based only on alignments between fragments (contigs or scaffolds) to be arranged and a reference genome, thereby improving analytical and annotation opportunities for the index species assembly. Although Chromosomer does not use any sophisticated models or algorithms for chromosome assembly, we show that its results are comparable with state-of-the-art assemblies and can be used for further genomic analysis.

Algorithm
To map fragments to a reference genome, Chromosomer uses results of pairwise alignments between the fragments (contigs and scaffolds) and the chromosomes of the reference genome. The alignments are required to have associated score values that reflect the length and identity of the aligned regions (for example, the BLAST bit score [22]). In addition, the start and end positions of aligned regions in both the fragments and the reference chromosomes are required.
Chromosomer analyzes alignment positions and scores to map fragments to a reference. The mapping process takes the following steps (see Fig. 1).

From pairwise alignments, determine fragments that
can be anchored to a reference according to the ratio of their first and second greatest alignment scores. If the ratio is greater than the predefined threshold, which is the algorithm parameter, then the fragment is anchored to a position corresponding to its  alignment with the greatest score. Otherwise, the fragment is considered unplaced if these two alignments are located on different reference chromosomes or unlocalized if both alignments are located on the same chromosome. 2. Using fragment anchors, map the fragments to the reference chromosomes (see Fig. 2a and b). Unlocalized and unplaced fragments are excluded from the assembly. 3. Resolve overlaps between mapped fragments by inserting gaps between them (see Fig. 3a and b). 4. Produce a map describing fragment positions at a reference genome and output assembled chromosome sequences and lists of unlocalized and unplaced fragments.
Besides reference-assisted chromosome assembly, Chromosomer also offers the following options: • transfer annotations from fragments to assembled chromosomes using a fragment map; • visualize a reference-assisted chromosome assembly as a genome browser track containing fragment positions; • obtain statistics on a reference-assisted chromosome assembly.
We further describe several aspects of the Chromosomer workflow: mapping fragments to reference chromosomes, transferring annotations from fragments to the assembled chromosomes and defining parameters that tune the Chromosomer assembly process. We consider all sequence coordinates to be zero-based and half-opened (that is, the first nucleotide is considered as position 0 and the last nucleotide position is equal to the sequence length).

Mapping fragments to reference genome
Assume we have a fragment of length L base pairs (bp) and an anchor between it and a reference chromosome that is formed by the alignment of the [ S A , E A ) region in the fragment and the [ S A , E A ) region in the reference. The S A and E A terms denote start and end coordinates of the alignment in the fragment and the S A and E A terms denote start and end coordinates of the alignment in the reference genome. We derive fragment coordinates S F and E F in the reference genome for two cases: a direct fragment orientation that is the same as in the reference (Fig. 2a) and an orientation that is reversed relative to the reference (Fig. 2b). Equations for S F and E F in the direct orientation case are: Equations for S F and E F in the reversed orientation case are:

Transferring annotations to assembled chromosomes
Next, assume we have a fragment of length L bp that is mapped to position [S F , E F ) in an assembled chromosome. We are interested in where the region [ S R , E R ) of the original fragment will be placed. Let S R and E R be the start and end positions of the region in the chromosome; then the following equations hold if the fragment is mapped in direct orientation (Fig. 4a):  If the region is mapped in the reverse orientation, then S R and E R satisfy the following equations (Fig. 4b):

Assembly parameters
Chromosomer introduces two parameters that influence the assembly process. The first parameter is the alignment score ratio threshold, which is used to distinguish anchored and unplaced fragments. If the score ratio of the two fragment alignments with the highest scores exceeds the threshold, then the fragment is considered anchored, otherwise it is considered unplaced and is excluded from further analysis. The alignment score ratio threshold must be a positive number greater than one.
The second parameter is the insertion size -the size of a gap which is inserted between overlapping regions (see Fig. 3b). The insertion size is recommended to be equal to or greater than the sequencing library size.

Chromosomer assembly evaluation
To evaluate the performance of Chromosomer, we assembled the following bacterial, yeast and mammalian genomes.
We also assembled the bacterial and yeast genomes using ABACAS and compared ABACAS-derived assemblies with Chromosomer-derived ones. Although ABACAS is not designed for assembling multichromosome genomes, we used separate ABACAS runs  for each chromosome from the reference genome. The Chromosomer assembly of Tibetan antelope was compared with the RACA assembly presented in [19]. The Chromosomer-derived chimpanzee chromosomes were assessed by comparison with the GenBank assembly and by checking the coding region accuracy. LASTZ [23] was used to perform whole-genome alignments for assessing chromosomes obtained with Chromosomer.

Escherichia coli assembly
The E. coli Sakai strain genome was assembled in two steps. First, we assembled its reads (SRA accession numbers SRR530851 and SRR587217) to scaffolds using the SPAdes assembler [24] (Additional file 1). Next, we applied Chromosomer and ABACAS to assemble the scaffolds using the E. coli K-12 strain genome assembly (RefSeq accession number NC_000913.3) as a reference (Additional files 2 and 3). Finally, we compared the derived assembly with the RefSeq assembly of the E. coli Sakai strain (RefSeq accession number NC_002695.1). The dot plots of LASTZ whole-genome alignments between the derived assemblies and the RefSeq assembly are given in Fig. 5a and b. The comparison of the assemblies is given in Table 1.

Saccharomyces cerevisiae assembly
The S. cerevisiae CLIB324 strain genome was assembled from its scaffolds (GenBank accession number GCA_000192495.1) using S. cerevisiae S288c strain genome as a reference (RefSeq accession number GCF_000146045.2). The chromosome sequences assembled by ABACAS and Chromosomer are given in Additional files 4 and 5, respectively. Dot plots comparing the LASTZ alignments of chromosome 1 between the reference genome and those from the ABACAS or Chromosomer assemblies are shown in Fig. 6a and b. The comparison of the assemblies is given in Table 2.

Pantholops hodgsonii assembly
The P. hodgsonii genome was assembled from its scaffolds (GenBank accession number GCA_000400835.1) using the B. taurus UMD3.1 assembly as a reference  Coverage (in bp) 7,920,800 9,595,323 The table shows statistics derived from LASTZ alignments of ABACAS and Chromosomer assemblies compared with the S. cerevisiae S288c strain assembly that was used as a reference and the net alignments between the scaffolds and the cow chromosomes from [19]. The fragment map of the Chromosomer-derived Tibetan antelope chromosomes is given in Additional file 6. We compared the Tibetan antelope chromosomes obtained by Chromosomer and the PCFs produced by RACA in Fig. 7. The comparison shows that both sets of chromosomes are similar to each other; however, RACAderived PCFs are longer than Chromosomer-derived ones and the reference cow chromosomes. This result may be due to the difference in the Chromosomer and RACA algorithms: while RACA tends to gather as many genome fragments as possible to a larger fragment, Chromosomer determines scaffolds that have sufficient evidence for being placed on a chromosome; otherwise, Chromosomer considers a scaffold unlocalized or unplaced and does not include it in a chromosome. Thus, Chromosomer preserves the structure of the reference chromosomes, see Fig. 8.
Although the assemblies are fairly similar, two main differences can be distinguished: 1. The PCFs assembled by RACA tend to be longer than the original reference genome (cow) chromosomes. 2. RACA predicted two chromosomal translocations in the Tibetan antelope genome compared with the cow genome: the first one between chromosomes 7 and 10 and the second one between chromosomes 21 and 27. The predicted translocations led to elongation of chromosome 7 and shortening of chromosome 10; chromosomes 21 and 27 are also related in the same way but to a lesser extent (see Fig. 7). The ability to detect cross-species rearrangements is a feature of RACA that is related to its more complex assembly model and integration of paired-end reads, which Chromosomer does not use.
In addition, Chromosomer demonstrated better time performance and required fewer computational resources than RACA. It took about 1.7 hours and 1.5 GB of random access memory (RAM) for Chromosomer to produce the chromosomes from the net alignments using one CPU (central processing unit). RACA spent 55 hours and required 59 GB of RAM using three CPUs to get the result from the same net alignments. We used the SuperMicro server for the benchmark (12 Intel Xeon E5-2690 CPUs and 396 GB RAM).

Pan troglodytes assembly
The P. troglodytes genome was assembled using Chromosomer from its scaffolds (GenBank assembly accession GCA_000001515.4) using the H. sapiens GRCh38.p2 assembly as a reference and the net alignment of the chimpanzee genome to the human genome from the    Only scaffolds consisting of two or more contigs were considered. A contig was considered misplaced if its neighboring contigs were different from the neighboring contigs in the GenBank assembly. For each scaffold, the percentage of its misplaced contigs and the percentage of the total misplaced contig length were calculated; the average values for all scaffolds of the specified contig number are shown UCSC Genome Browser [25,26]. The fragment map constructed by Chromosomer is given in Additional file 7. The dot plot of the alignment of chromosome 1 assembled by Chromosomer from the scaffolds to its GenBank sequence is given in Fig. 9.
We assessed the obtained chimpanzee chromosomes against two criteria: adjacency of contigs and scaffolds and gene integrity. We checked the adjacency for three levels: contigs within scaffolds (Fig. 10), contigs within chromosomes (Fig. 11) and scaffolds within chromosomes (Fig. 12). The figures show that Chromosomer performs best for assembling large genomic fragments such as scaffolds and may misplace short genomic fragments like contigs for the local structure. This conclusion is supported   (Figs. 13 and 14), which shows that Chromosomer might break genes located on multiple contigs.
From the examples shown above, we conclude that Chromosomer is comparable to existing reference-genome assembly tools and is able to assemble and process large genomes. Chromosomer may increase efficiency of genome annotation studies by replacing numerous genome fragments with draft chromosome assemblies. Fig. 13 P. troglodytes genes on contigs misarranged by Chromosomer. Genes located on two or more contigs were considered; there were 10,041 such genes. A gene was considered misarranged if the contigs it was located on were placed in an order that differed from the GenBank assembly Fig. 14 P. troglodytes genes on scaffolds misarranged by Chromosomer. Genes located on two or more scaffolds were considered; there were 240 such genes. A gene was considered misarranged if the scaffolds it was located on were placed in an order that differed from the GenBank assembly. Among genes located on two scaffolds, there were no misarranged genes