-
PDF
- Split View
-
Views
-
Cite
Cite
Junwei Luo, Yawei Wei, Mengna Lyu, Zhengjiang Wu, Xiaoyan Liu, Huimin Luo, Chaokun Yan, A comprehensive review of scaffolding methods in genome assembly, Briefings in Bioinformatics, Volume 22, Issue 5, September 2021, bbab033, https://doi.org/10.1093/bib/bbab033
- Share Icon Share
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Introduction
The basis of genomic research, including promoter cloning, gene identification, gene function analysis, gene expression regulation and comparative genomics analyses, is to obtain an accurate and complete reference genome [1–4]. The reference genome provides a persist structure which can be used for identifying specific sequence locations that are possibly related with diseases. For instance, we could align sequenced reads from one target in related reference genome to analyze DNA methylation, differential gene expression and structural variations. The reference genome also helps the analysis of the transcriptomic heterogeneity within populations of cells studies [5]. Hence, one accurate and complete reference genome can improve the quality and accuracy of the downstream analysis. Due to the complexity of the genome and the limitations of sequencing technologies, genome assemblers usually first produce fragmented long sequences (contigs). Next, scaffolding methods are utilized to infer the orientations and orders among these contigs and to produce some longer sequences (scaffolds). A scaffold comprises some ordered and oriented contigs, and the gap between any two adjacent contigs is filled with ‘N’s. Hence, scaffolding can make draft assembly more complete and contiguous and is an important step in genome assembly.
With the advancement of high-throughput sequencing (HTS) technologies, the reduction in the sequencing cost per genome has surpassed Moore’s law, and the cost of sequencing an entire genome as now dropped below $1000 genome [6]. The decreasing cost of HTS facilitates the widespread use of sequencing data, and typical HTS devices can generate over 100 gigabases of data in 24 h [7].
As a result of the development of HTS technologies, numerous scaffolding methods have been presented in the past decade, greatly advancing the field of genome assembly. Scaffolding methods generally utilize reads from sequencing technologies to link the different contigs and take advantage of the linked information to infer the order and orientation of contigs. The scaffolding result quality is highly dependent on the quality and length of the reads. However, different sequencing technologies produce reads with different characteristics [8–12]. Therefore, scaffolding methods based on different sequencing technologies have large differences in both their method designs and scaffolding results.
Short paired reads from next-generation sequencing (NGS), whose insert sizes can reach several kilobases, can span some repetitive regions and link different contigs. Moreover, paired reads from NGS have the advantages of low cost and high accuracy and have been used in many research fields. However, short reads reaches are only a few 100 bases in long, and the middle region between paired reads is unknown, resulting in important limitations, including ambiguous mapping, a limited ability to span long regions and amplification artifacts during library construction. Hence, the limitations of short paired reads make chromosome-level scaffolding impossible, and scaffolding errors occur frequently.
In the past few years, several new sequencing technologies have become available that are increasingly used to resolve problems in scaffolding. Comparing with NGS, long-range sequencing, represented by single-molecule real-time (SMRT) sequencing from Pacific Biosciences (PacBio; [13]) and nanopore-based sequencing from Oxford Nanopore Technologies (ONT; [14]), is capable of producing much longer reads. SMRT sequencing technology can produce an average read length of 10–30 kb [15], but reads can exceed 80 kb. Nanopore read length is depended on the length of the DNA molecules to be sequenced [16]. The longest read obtained by this method can be longer than 1 Mb [17]. These long reads can span most repetitive regions in the genome and link two or more different contigs, which greatly improves scaffolding quality [18]. However, the sequencing error rate for long reads is ~15% [17, 19–22], and obtaining accurate alignment between long reads and contigs is difficult. For error-prone long reads, high consensus accuracy can be achieved by hybrid error correction or selfcorrection methods, but the computation cost is expensive [23]. ONT has developed a new method that can reduce the error rate to ~3%, but the throughput is also reduced [16]. Recently, PacBio obtained a consensus sequence (HiFi read) from multiple passes of a long individual molecule. Its average length was ~10–25 kb, and the accuracy rate was >99.5% [24, 25].
The Hi-C protocol is capable of generating paired reads that are in close proximity in the 3-D space. The number of Hi-C paired reads from the same chromosome is significantly larger than that from different chromosomes [26, 27]. If two contigs come from the same chromosome, the number of paired reads linking them decreases approximately exponentially as a function of the genomic distance [28, 29]. Based on these two patterns, Hi-C paired reads can be used to distinguish contigs from the same chromosome as other contigs and deduce the order among contigs. Unfortunately, the true linear genomic distance between one set of paired reads varies greatly and ambiguously, as the minimum is only a few 1000 bases, and the maximum is a few 100 million bases, which prevents Hi-C paired reads from yielding accurate scaffolding results.
Recently, linked reads generated by chromium technology from 10X Genomics have become available for scaffolding [30]. This technology can capture a long-range sequence fragment with an average length of 100 kb and use the data to sequence some short barcoded reads. Therefore, linked reads with the same barcode can be utilized to link the different contigs. In addition, this technology cost less than that for long reads, and its accuracy is the same as that of NGS.
Optical mapping technologies [31], commercialized by Bionano and Opgen, can identify the locations of markers (restriction enzyme and nicking endonucleases) in a single molecule of DNA and provide a list of the ordered distances between any two markers. This information is finally used to construct a genomic optical map. Since the markers are known, it is easy to determine the distances between any two markers in each contig. Then, the contigs can be aligned in the optical map, and their orders and orientations can be deduced.
The technologies mentioned above commonly identify some or all parts of a long-range fragment and the identified parts can link the different contigs at a distance less than the length of the long-range fragment. However, the distribution characteristics of the identified parts, the lengths of the long-range fragments and the error rates substantially differ among these technologies. Current scaffolding methods commonly select one type of data mentioned previously as input, and various strategies are developed to exploit the links among contigs for scaffolding. Some scaffolding methods using hybrid data, protein sequences, or related reference genomes also exist. Although the field of scaffolding has been improved by these technologies, obtaining accurate scaffolding results remains challenging.
In this review, we summarize the widely used scaffolding methods and discuss both the strengths and weaknesses of sequencing data for scaffolding. Our purpose is to help readers gain a deeper understanding of the current development and future research in this domain. This article comprises the following parts. First, we introduce the difficulties of scaffolding. Second, according to read type, we divide the scaffolding methods into seven categories, including methods based on paired reads, long reads, Hi-C paired reads, linked reads, optical mapping, reference genomes and other data. Then, we review in detail how these methods take advantage of these reads to tackle the problems in scaffolding. Third, we review some scaffolding evaluation tools, and compare these scaffolding methods. Finally, the opportunities and directions of future research are discussed.
Scaffolding
After obtaining a set of contigs from an assembler, the strand from which they originate is unknown, and their locations in the genome are also unclear. However, many types of reads can link different contigs, and this information can be used to estimate the approximate distance between two contigs and whether they are on the same strand. Scaffolding methods take advantage of these links to deduce the order and orientation among contigs. Their input includes a set of contigs and a set of reads, and their final output is a set of scaffolds.
Before scaffolding, two issues should be considered. (i) The correction of contigs: generally, the contigs input to scaffolding methods may contain some misassemblies (MA), which may introduce incorrect links between contigs and complicate the process of scaffolding. Hence, it is important to detect MA in contigs, and a contig can be broken into two parts at the site of misassembly. At present, several methods are available to detect and correct MA in contigs. REAPR [32], NxRepair [33], PECC [34] and MEC [35] were designed based on paired reads.
And, (ii) Alignment tool: we need to utilize an alignment tool to align reads against the contigs and produce the link information. Bowtie [36], Bowtie2 [37] and BWA [38] can align short reads [paired-end (PE) reads, mate-paired reads, Hi-C paired reads and linked reads] produced by NGS against contigs. Basic local alignment with successive refinement (BLASR [39]), BWA, Minimap [40] and Minimap2 [41] can align long reads against contigs. For optical mapping data, each scaffolding method usually designs a scoring mechanism to determine the location of each contig on the optical map. However, obtaining accurate alignment is challenging, and the current alignment tools often score each alignment and list all possible alignments from which users can evaluate and select.
The difficulties of scaffolding include sequencing errors and repetitive regions in the genome, which are the primary reasons for the production of ambiguous alignments [42, 43], thus increasing the difficulty of inferring orders and orientations among contigs. If the sequencing accuracy is 100% and sequenced fragments are sufficiently long to span all repetitive regions, scaffolding is simple. However, the current sequencing technologies cannot maintain a high accuracy and yield and sufficiently long sequences simultaneously, as they always leverage sequencing error and length of sequenced fragments.
The reads produced by different sequencing technologies are very different and have different abilities to overcome the difficulties of scaffolding. Protein sequences and related reference genomes can also be used for scaffolding. Hence, the design of existing scaffolding methods is often based on one type of input data. Therefore, we divide existing scaffolding methods into seven categories based on data types: (i) scaffolding methods based on paired reads; (ii) scaffolding methods based on long reads; (iii) scaffolding methods based on Hi-C paired reads; (iv) scaffolding methods based on linked reads; (v) scaffolding methods based on optical mapping; (vi) scaffolding methods based on reference genomes and (vii) other scaffolding methods. We show the draft processes of these scaffolding methods in Figure 1. Generally, these scaffolding methods include two primary steps. First, alignment tools are utilized to obtain the links among contigs. Second, based on the alignment information, scaffolding methods use various strategies to deduce the order and orientation among contigs and produce the final scaffolds. We list the scaffolding methods in Table 1. Below, we will detail these different types of scaffolding methods.

Processes of six scaffolding method types. First, these reads are aligned against the contigs, or these contigs are aligned against optical maps or the reference genomes. Second, based on the alignment information, the order and orientation among contigs are deduced. Finally, scaffolds are output by these methods.
Scaffolding methods. The first column shows the type of scaffolding method, the second column shows the name of the scaffolding method and the publication year, and the last column shows the link used to obtain the method
Scaffolding methods. The first column shows the type of scaffolding method, the second column shows the name of the scaffolding method and the publication year, and the last column shows the link used to obtain the method
Scaffolding methods based on paired reads
With the continuous evolution of NGS technologies, sequencing costs continue to decrease and sequencing throughput continues to increase. NGS technologies have been widely applied to many research fields with great success [44–47]. NGS technologies can generally produce both single-end and PE reads. However, the read length of NGS is short. Single-end reads can link only two contigs that are closely together in the genome, and it is difficult to obtain accurate alignments in repetitive regions [48].
Paired reads (PE/mate-paired reads) come from two ends of a long sequence fragment and are not on the same strand. The size of the long sequence fragment, called the insert size, can reach several 1000 bases. Alignment of one set of paired reads is aligned in two different contigs usually indicates that the distance of the two contigs is smaller than the insert size. If the alignment orientations are the same, the two contigs are in the reverse orientation; otherwise, they are in the same orientation. Therefore, paired reads are typically used to infer order and orientation among contigs. Currently, many scaffolding methods based on paired reads have been developed. These methods generally include the following three steps.
Preprocessing
Due to the limitations of read length, repetitive regions and sequencing errors, one short read may be aligned at incorrect or multiple locations. Most scaffolding methods adopt stand-alone alignment tools (Bowtie, Bowtie2 and BWA) to obtain the positions of reads in the contigs. SOAPDenovo2 [49] uses its own alignment tool. SCARPA [50] and ScaffMatch [51] utilize a simple approach to address these ambiguous alignments, as alignment of one paired reads at multiple positions results in the deletion of alignment information. SSPACE [52] removes the paired reads that are both aligned in multiple positions. By analyzing the characteristics of paired reads that are aligned in the same region, MIP [53] removes inconsistent reads. To determine whether the contigs are repetitive regions, ScaffMatch first calculates the average read coverage of the contigs and then removes the contigs with high read coverage. For the detection and removal of MA in contigs, SOPRA infers whether the misassembly occurs by analyzing the number of paired reads aligned in the same contig.
Constructing a scaffold graph
Scaffolding methods usually involve construction of a graph to store links among contigs. These methods extract some paths from the graph, and each path corresponds to a scaffold. In different studies, this graph has several different names, such as a scaffold graph/scaffolding graph (MIP, Opera [54], SCAPRA, BESST [55], integer linear programming (ILP) [56], ScaffMatch, SWALO [57]), a contig connectivity graph (SOPRA [58]), a contig link graph (GRASS [59]), a contig graph (SLIQ [60]) and a directed graph (ScaffoldScaffolder [61]). In this review, we uniformly call these graph scaffold graphs.
In a scaffold graph, each vertex usually denotes a contig or one end of a contig, and each edge means that paired reads link related vertices. An edge in a scaffold graph determines the order and orientation of two linked vertices. In constructing a scaffold graph, the most important issues are how to determine whether to add an edge between two contigs, and how to set the edge weights. The weight of an edge reflects the reliability of the edge.
Scaffolding methods, such as MIP, Bambus [62], Bambus2 [63], Opera, SSPACE and ScaffoldScaffolder, add an edge between two vertices if the number of paired reads linking them exceeds one threshold to eliminate the misaligned paired reads. Commonly, this threshold is set arbitrarily by users. Unlike other methods, Opera involves the development of a simulation method to set this threshold based on genome size and sequence coverage. The weights of the edges in the above methods are set to be the number of links. In SCARPA each contig corresponds to two vertices, representing the 3′-end and the 5′-end of a contig. The forward and reverse chains of each contig in ScaffMatch correspond to two vertices. BESST and BOSS [64] are two scaffolding methods that use the distribution of paired reads to construct a scaffold graph. BESST considers paired reads that link two contigs to analyze the difference between the expected and observed standard deviation of gap distance and determines whether to add an edge based on the difference. However, BOSS further considers the paired reads in which only one mate read is aligned in the contigs to find the expected number of paired reads linking two contigs. Based on the ratio of the expected number and real number, BOSS determines whether to add an edge and sets the weight to be the ratio.
Determining orientation and order among contigs
For two contigs, the edge linking them determines whether they are on the same strand and the approximate distance between them. Due to misaligned paired reads, some incorrect edges are inevitably introduced in a scaffold graph, which commonly leads to some contradictions in orientation and order.
Hence, based on the different optimization strategies, scaffolding methods usually initially detect and the remove edges that could lead to the elimination of contradictions in the scaffold graph. Next, scaffolding methods use various strategies to select consistent paths from the scaffold graph, which determines the orientation and order of contigs in each path. These strategies can be divided into three categories as follows:
(i) Many different optimization algorithms are used for this problem. Bambus turns the contig orientation problem into the problem of coloring a bipartite graph. Bambus2 removes the minimum weighted edges to prohibit inconsistent directions in the graph and then uses the optimal linear arrangement model to determine the orientation and order of the contigs. To determine the orientations of the contigs, SOPRA aims to assign an orientation to each contig to satisfy as many paired reads as possible. In addition, it transfers finding an optimization solution to the maximum weight cut problem. To determine the order of the contigs, SOPRA continually assigns a position to each contig and utilizes a greedy approach to determine the distances between contigs that best agrees with the sizes of the gaps, as suggested by the edges. SLIQ first identifies and removes junctions from the scaffold graph. By deleting the edge with the lowest weight and strongly connected components, SLIQ guarantees that the scaffold graph is a directed acyclic graph. Finally, each connected component is extracted as a separate scaffold. SCARPA first determines only the orientation of all contigs. It transfers the contig orientation problem to a minimum odd cycle transversal problem and utilizes a fixed-parameter approach to delete some edges and to guarantee that there are no odd cycles in the scaffoldgraph. Second, SCARPA eliminates all directed cycles from the scaffold graph for ordering contigs. The problem is defined as the feedback arc set problem. Finally, by assigning a starting coordinate to each contig, SCARPA sets the order constraint for each contig and uses a linear programming approach to find a solution. Opera defines ordering and orienting contigs as a bounded version of the graph bandwidth problem. Then, Opera takes advantage of dynamic programming to find the exact scaffolding solution based on the scaffold graph. ScaffMatch translates the entire orientation and order inference problem into a maximum-weight acyclic 2-matching problem and iteratively removes some edges to prohibit cycles in the graph. In the final constructed acyclic graph, the vertices are linearized. ScaffoldScaffolder defines the contig orientation problem as a MAX-DIR problem, and aims to find at least k directed edges in a bidirected graph by assigning an orientation to each vertex. Scaftools [65] uses an iterative method of breaking cycles to determine the orientation and order relationship of the contigs in the contig connectivity graph.
(ii) Many scaffolding methods have been developed based on linear programming methods. The important issues in this kind of method are constraint functions and optimization objectives. For each edge in a scaffold graph, MIP obtains six constraints. A variable exists to relax each constraint. Then, an optimization objective is presented to minimize the amount of stretching in these constraints. Finally, MIP utilizes mixed-integer programming to find the solution. In addition, MIP divides the scaffold graph into biconnected components and then conducts scaffolding on each subgraph. GRASS presents a mixed-integer quadratic programming formulation of the scaffolding problem. Each edge is designed as a distance constraint, order constraint and orientation constraint. The optimization objective is to minimize the number of unsatisfactory constraints, whereas the expectation maximization (EM) approach is employed to solve the problem. SILP2 [66] builds a maximum likelihood model to remove links with low reliability. For links between two contigs, the reliability is calculated based on the number of links, the probability of repeated alignment, and the probability of abnormal coverage. Then, the model is solved by using ILP. BOSS utilizes an iterative strategy to find spurious edges by first constructing a subgraph that includes only edges with a large weight. Next, it iteratively adds the remaining edges to the subgraph from large to small weights. Each iteration includes a subgraph, and BOSS builds two linear programming models to resolve orientation and position contradictions in the subgraph. Finally, BOSS employs a heuristic approach to sort vertices (contigs) and then generates scaffolds.
(iii) Based on the scaffold graph, heuristic algorithms are designed to resolve the problem of fork paths. SSPACE utilizes a direct heuristic approach to extract paths from the scaffold graph. It selects the largest contig as the initial path and extends the path from its two ends by iteratively combining contigs. When there are multiple contigs with edges linking the end of the path, it calculates a ratio between the two best edge weights. If the ratio is below a user-defined threshold, the best candidate is added to the path. BESST constructs a scaffold graph that includes only the large contigs and selects paths from the graph using a method similar to SSPACE. Next, BESST utilizes a breadth-first search to place the small contigs between two large contigs. An MP library usually contains some PE reads that confuse the scaffolding. Based on BESST, an ILP model is constructed, whose objective function considers the discrepancies between MP and PE links. iLSLS [67] is a scaffolding method based on a greedy strategy. This method first determines the valuable contigs among all contigs based on the relationship between the paired reads and the contigs and removes the error regions of some unreliable contigs; then, iLSLS is adapted for the problem of uneven sequence depth and employs a ‘loose- strict-loose’ extended path strategy, which not only prevents the loss of correct connections between contigs but also reduces false contig extension. Finally, iLSLS also utilizes a scoring approach for more than one path. The use of a series of methods enables iLSLS to not only accurately estimate the gap size but also effectively improve the accuracy of the scaffolds. WiseScaffolder [68] uses a manual intervention model to optimize the scaffolding method. Weller et al. [69] used a tree decomposition method to solve the orientation and order relationship of contigs. inGap-sf [1] uses direct links and paired links to order contigs. A direct link is created through the detection of overlaps between all contig pairs. Paired links are created by paired reads that can be mapped to any pair of contigs. Then, inGap-sf travels a direct link graph to order contigs and uses a statistic-based estimation model based on paired links to solve fork problems.
Identifying repetitive regions in contigs is significant for scaffolding. Opera identifies and filters repetitive contigs whose read coverage is large. BATISCAF [70] classifies contigs as either repeat contigs or unique contigs. It first orders and orients only unique contigs and forms a sketch of all contigs. Finally, it inserts repeat and short contigs into the sketch and produces the final scaffolds. According to the GC-content and alignments of paired reads, SCOP [71] first classifies the contigs into three categories (true, uncertain and misassembled). This classification helps to detect MA and repetitive regions in contigs. After constructing a scaffold graph, SCOP not only detects spurious edges, but also splits contigs based on their categories. Most contradictions in the scaffold graph can be removed, and the accuracy of the scaffold graph is improved.
Scaffolding methods based on long reads
With their read length advantage, third-generation sequencing technologies have been widely used in many biological applications and have greatly improved the development of scaffolding methods [72]. Currently, many methods use various strategies to resolve repetitive problems based on long reads [73, 74]. A long read can be aligned in multiple contigs and can link two or more contigs. Then, based on their alignment orientations and positions, their orders and orientations can be inferred. Hence, obtaining accurate alignment results is important for scaffolding.
SSPACE-LongRead [75] uses BLASR to obtain alignments between long reads and contigs. For two contigs, the number of long reads linking them is their weight. Thus, SSPACE-LongRead considers two contigs with high weight to be adjacent. SMSC [76] also uses BLASR and Nucmer as alignment tools and first detects MA in contigs. Then, SMSC constructs a scaffold graph, and the scaffolding problem is transferred into a maximum alternating path cover problem.
To improve the accuracy of alignment, LINKS [77] first extracts paired k-mers from the long reads. Then, these paired k-mers are aligned against the contigs, and used to link different contigs. SMIS [78] also splits the long reads into short segments to produce fake mate reads and link different contigs. RAILS [79] was developed based on the engines of SSAKE and LINKS. This strategy can be used to obtain numerous paired k-mers linking contigs. However, one paired k-mer is easily aligned at multiple positions which may confuse the subsequent steps.
Unlike other methods, DBG2OLC [80] first detects the order and orientation of the long reads. Next, DBG2OLC transfers the long reads to the contigs. Specifically, DBG2OLC detects and finds overlaps between long reads and contigs. Then, based on the contigs that can be mapped in the same long read, the long read is represented by contig identifiers and termed a compressed read. Because the compressed reads are much shorter than the original long reads, DBG2OLC can quickly find overlaps among the compressed reads. Next, DBG2OLC constructs a read overlap graph, in which a node is a compressed read, and nodes are chained to one another in both directions with the best possible overlap. After chaining, each chain is converted back to raw nucleotide reads, and a consensus module is called to align these reads to each backbone and calculate the polished assembly.
Because most repetitive regions in contigs can be identified by the long reads, some methods have been proposed based on classifying contigs into two categories: repetitive contigs and nonrepetitive contigs. These methods first order and orient nonrepetitive contigs and produce draft scaffolds, and repetitive contigs are then inserted into these draft scaffolds. npScarf [81] classifies contigs as either unique contigs or repetitive contigs based on sequencing coverage. High coverage of a contig usually indicates that it is a repetitive. LRScaf [82] identifies and removes repetitive regions in the contigs, based on the alignment coverage between the long reads and the contigs. Then, such regions are ignored in the following steps of LRScaf. LRScaf also constructs a scaffold graph, from which it extracts simple paths. SLR [83] develops a method to divide contigs into unique and ambiguous contigs. SLR produces local scaffolds, each of which includes ordered and oriented contigs that can be aligned in the same long read. Based on the positions of a contig in the local scaffolds, SLR determines whether it is ambiguous. After constructing a scaffold graph that includes only unique contigs, SLR adopts a linear programming model to optimize and extract scaffolds from the scaffold graph.
Scaffolding methods based on linked reads
Recently, 10X Genomics developed a new long-range sequencing technology. Unlike TGS, it does not capture all the information about a long sequence. It extracts only some short reads, named linked reads, from the same long DNA molecule, which can guarantee the correctness of linked reads and reduce sequencing cost. It is an effective and less expensive alternative for scaffolding. Note that each linked read is combined with a barcode. Therefore, linked reads with the same barcode are from the same long sequence and should be grouped together. Alignment of linked reads in the same group with different contigs usually means that these contigs can be linked.
ARCS [84] first groups the linked reads by barcode. When one contig is aligned with some linked reads, ARCS uses a binomial test to validate whether the alignment is reliable. ARCS also constructs a scaffold graph in which an edge is added when the link orientation has the most types of supporting barcodes. After the scaffold graph is completed, ARCS adopts LINKS to orient and order the contigs. Other stand-alone scaffolding methods that can construct a scaffold graph can also be used within the ARCS pipeline.
ARKS [85] developed a new method to connect contigs based on linked reads. To reduce the run time, ARKS uses the k-mer strategy to align linked reads and contigs. The k-mers at the two ends of the contigs are stored in a hash table. By computing the shared k-mers between one end of a contig and a linked read, ARKS gives an alignment score and determines whether the linked read can be aligned at the end of the contigs. If the ends of two contigs can be linked by the linked reads from the same group, one edge can be added between them. To estimate the distance between two contigs, ARKS trains an approach based on the distribution of linked reads aligned at the two ends of the given contigs. After the edges are determined, a scaffold graph is constructed. The subsequent step of orienting and ordering the contigs is the same as that of ARCS.
fragScaff [86] computes a shared barcode fraction metric between any paired contigs. When the value is higher than a threshold, an edge is added between two contigs. For a scaffold graph, fragScaff uses the maximum weight minimum spanning tree to traverse the scaffold graph and produce the final scaffolds. In addition, Architect [87], a scaffolder designed based on molecular read cloud specifications, also uses barcoded reads to improve the contiguity of genome assemblies. AnVIL [88] was developed for not only scaffolding but also minimizing gaps in the scaffolds. AnVIL uses linked reads to detect the linkages between the contigs, and adopts an overlap-layout-consensus strategy to fill the gaps.
Since the order information of linked reads with the same barcode is unknown, the methods to determine whether two contigs can be linked are not straightforward, and the estimation of the gap size between two linked contigs is unclear. All these methods require more sophisticated algorithms.
Scaffolding methods based on Hi-C paired reads
Hi-C technology is derived from chromosome conformation capture and high-throughput PE sequencing technology [26, 89]. Every Hi-C paired read refers to two genome locations and indicates a spatial interaction between these two genome locations. After aligning Hi-C paired reads against a reference genome, the number of paired reads that link any two regions in a reference genome can be computed. There are two interesting findings. First, the frequency of links between two regions in two different chromosomes is much less than that between two regions in the same chromosome, and this value can even be separated by tens of megabases. Second, for two regions in the same chromosome, the frequency of their links decays rapidly with the linear genome distance. Therefore, Hi-C paired reads can be used to study the spatial relationship of the entire chromatin DNA and to obtain high-resolution three-dimensional structural information on chromatin.
In addition, Hi-C paired reads can be used in combination with scaffolding to generate chromosome-scale scaffolds. Multiple scaffolding methods based on Hi-C paired reads are presented. LACHESIS [28] is the first method of genome assembly that use Hi-C paired reads and consists mainly of three steps. After the alignment of Hi-C paired reads against the contigs, the number of contacts between any pair of contigs can be obtained. Then, the contigs with more contacts are first clustered into one group by using the hierarchical clustering approach; when the number of groups reaches the number of user-specified chromosomes, the grouping approach ends. Next, the contigs are ordered within chromosome groups, and they may be adjacent if the number of their contacts is large. Finally, the orientation is determined for each contig within the groups by calculating an orientation quality score. However, the number of chromosomes is user-specified, and for an unknown chromosome, the LACHESIS approach cannot be utilized. dnaTri [90] is a statistical scaffolding method and trains a model using a supervised machine learning approach to verify that the pattern can be used to accurately predict the chromosome to which the contigs belongs, verifying that the pattern can accurately predict the location of the contigs. One disadvantage of LACHESIS and dnaTri is that they both rely on hierarchical clustering algorithms, and the cost of calculating the score for each contig is high. Another disadvantage is that there are no error correction steps for contigs or scaffolds before scaffolding, which could result in the output of some erroneous contigs or scaffolds to the final scaffold. GRAAL [91] is a probabilistic scaffolding method, but its verification is limited to a single chromosome. Based on GRAAL, instaGRAAL [92] is presented to overcome some limitations. instaGRAAL enhances the computational efficiency of GRAAL and adds a step to correct MA in contigs.
Hi-C paired reads can also be used to identify errors in contigs. 3D-DNA [29] first identifies and corrects errors in the input contigs by using Hi-C paired reads. First, the approach for misjoining correction uses many iterative steps to correct misjoins in the input contigs, exploiting the fact that the correct contigs have more contacts than the wrong contigs. Next, the scaffolding method is run to cluster, order and orient the contigs and output a megascaffold containing all the contigs. Finally, a merged method is run to merge the contigs with overlapping regions and output final scaffolds. However, the megascaffold is broken into a user-specified number of chromosomes based on the Hi-C contact map.
SALSA [93] defines the physical coverage for each base in the contigs, which is the sum of paired reads that span it. A sudden drop in physical coverage at a certain position signalizes for misassembly. For the efficient detection of misassembled signals in contigs, SALSA adopts a variation of Kadane’s approach for the maximum sum subarray problem. After correction, SALSA constructs a scaffold graph, and the weight of an edge is set based on the number of links and the lengths of contigs. Two long contigs tend to have more links. SALSA2 [94] was developed based on SALSA and uses a hybrid graph that combines Hi-C links with an assembly graph to generate scaffolds. SALSA2 calculates the score for each edge and retains the edges with an edge score > 1. Then, the edges are selected using the greedy weighted maximum matching approach, and the edges from the same contig are added to generate the initial scaffold.
Recently, a new approach called HiCAssembler was proposed [95]. It uses HiCExplorer to create a Hi-C contact matrix that filters out some low-quality reads. It also uses two methods to remove MA: the TAD detection approach based on the TAD separation score and a manual adjustment parameter method based on visual inspection. HiCAssembler obtains chromosome-scale scaffolds by iteratively joining scaffold paths to form paths of increasing size, as in 3D-DNA. The scaffolding method uses a maximum spanning tree approach to determine the order of contigs similar to LACHESIS.
Scaffolding methods based on optical maps
Single-molecule optical maps are ordered genome wide optical restriction maps (ORMs) derived from a single DNA molecule. Optical maps provide a macroscopic framework that reflects structural information on the whole genome. Single-molecule optical maps for scaffolding need to solve two problems: how to find the best matches between the contigs and the optical maps, and how to place the contigs on the map in the correct order. The following are some methods for scaffolding using optical maps.
SOMA [96] is a scaffolding method that overcomes limitations such as the length of the sequencing reads by combining the information of the optical maps. The scaffolding problem is decomposed into two simple subproblems: seeking good matches between the optical maps and the contigs and searching for a consistent placement when given many matches. AGORA [97] uses an optical map within the de Bruijn graph framework to promote genome assembly and simplifies the process of genome assembly into the process of finding a ‘Chinese postman path’. The method process mainly includes two steps. The landmark edges can be found first by using the greedy approach, which is an edge in the graph that matches one position on the optical map. The landmark edges are then sorted in the order of the placement of the optical map. Next, the path between consecutive landmark edges can be found by using the depth-first search approach. The experiments discussed in this article not only indicated that optical maps are effective for genome assembly in the de Bruijn graph framework but also demonstrated the advantages of more accurate optical mapping technologies such as nanocoding.
Due to the insufficiency of traditional optical mapping methods, such as the extremely low efficiency of the image capture process, the difficulty of expanding DNA molecules and the incompletely automated analysis of the optical image, an approach called whole genome mapping (WGM) [98] is used to generate optical mapping data and scaffolds. The WGM process includes the following steps. Briefly, the approach first converts contigs into in silico maps and then uses an alignment approach in which single-molecule maps are aligned to the in silico maps to extend the scaffolds. Then, after all the contigs are expanded, the contigs are used to generate regions including overlaps in adjacent contigs, and the contigs with overlapping regions are finally connected by using a modified Smith–Waterman dynamic programming approach to form a super-scaffold. The use of WGM technology for large genome assembly was verified in goats. Based on ORMs, a greedy scoring approach and three greedy placement algorithms are utilized for scaffolding [99]. The score of each contig corresponding to its possible placement in the ORM is first calculated and then used to determine the nonoverlapping placement of each contig in the ORM.
For Bionano optical maps, SewingMachine [100] and optical map-based genome scaffolding (OMGS [101]) are the two main methods for scaffolding. SewingMachine iteratively uses RefAligner and Stitch. First, the contigs are converted into new maps, and RefAligner then aligns the new maps to optical maps. Based on the alignment information, Stitch orders and orients the contigs. Because most methods can process only one optical map at a time, OMGS was developed to solve this problem by handling two or more optical maps simultaneously. Multiple optical maps can supply more linkage information between contigs and are helpful for resolving conflicts and repetitive regions.
Scaffolding methods based on reference genomes
The methods mentioned above are reference-free and use the reads from the targeted organism for scaffolding. To date, there are many available complete genomes. During evolution, although DNA sequences vary, some conserved genes and their order are retained. Hence, to assemble a target genome, one or more closely related reference genomes can serve as a reference and are useful for guiding the process of scaffolding. Obviously, more related reference genomes are more conducive to identifying conserved genes and their order.
Recently, some reference-based scaffolding methods were presented, most of which are faster and cheaper than reference-free methods [102]. These methods can be further divided into alignment-based methods [102–107] and rearrangement-based methods [102, 108–113].
Alignment-based methods usually directly align contigs against the references and then deduce their order and orientation based on their aligned positions and orientations in the reference genome. For example Projector 2 [103], Mauve [104] and OSLay [105]. MeDuSa [107] take advantage of multiple sets of reference genomes from related organisms for scaffolding. A scaffold graph is first constructed in which each vertex denotes a contig, and the edge is weighted by the number of reference supports adjacent to the two vertices. MeDuSa utilizes a constant factor approximation approach to order and orient contigs in a scaffold graph. As an efficient scaffolding method, RaGOO [102] adopts Minimap2 to align contigs and a reference. Before scaffolding, RaGOO can identify chimeric contigs and break them at their incorrect positions. After that, each contig is assigned to a reference chromosome, and all contigs in that chromosome are ordered and oriented. RaGOO assigns three confidence scores (clustering, location and orientation) for each contig, which can be used to judge whether the scaffolding result is correct.
First, rearrangement-based methods commonly find synteny blocks or genomic markers between contigs and references. The synteny blocks are a collection of conserved genomic regions that simultaneously appear in the contigs and the references, and they can be detected by alignment tools, such as Cactus [114] or MUMmer [115]. Then, the contigs and references can be represented by ordered and oriented synteny blocks. By using the references as backbones, these methods try to minimize the rearrangement distance between the contigs and the reference. CAR [110] uses a related reference genome to guide the scaffolding process. By utilizing permutation groups in algebra, CAR measures the rearrangement distance based on genetic markers (conserved sequences) in the contigs and reference genomes, which can be detected by Nucmer or Promer. In contrast to CAR, Multi-CAR [111] simultaneously uses multiple reference genomes for scaffolding. If two contigs can be joined by CAR with one reference genome, the weight between the two contigs increases by one. Then, Multi-CAR iteratively finds the paths with the maximum weights, and the paths correspond to the scaffolding result. CSAR-web [112] is a web server developed based on CSAR [113] that uses an incomplete reference genome for scaffolding. Ragout2 [102] used Cactus to detect conserved sequences, and develops a new graph simplification algorithm to generate synteny blocks containing consecutive conserved sequences. Then Ragout2 adopts a two-break rearrangement algorithm to break MA in the contigs. Finally, Ragout2 iteratively finds scaffolds with different synteny blocks sizes.
Other scaffolding methods
In addition to the mainstream methods mentioned previously, some researchers have tried to combine different reads or protein sequences for scaffolding.
OPERA-LG [116] is a hybrid scaffolding method that can simultaneously adopt short paired reads and long reads as input. OPERA-LG transfers only the long reads to synthetic paired reads with different insert sizes. Then, it uses multiple paired read libraries and an exact approach for scaffolding. SLR-superscaffolder [117] employs a top-to-bottom scheme to combine long fragment reads and paired reads. It first uses long fragment reads in large-scale scaffolding and then uses paired reads for local scaffolding.
Protein sequences are usually well conserved, and exons in the same protein sequence can be located on different contigs. Therefore, two different contigs can be linked when they are aligned in the same protein sequence. SWiPS [118] uses proteome sets from different species for scaffolding. By mapping the contigs to protein sequences using TBLASTN [119], SWiPS can find contigs aligned in the same protein sequence and compute their similarity scores. Because the same region of the protein sequence can usually be aligned with multiple contigs, SWiPS selects the path with the highest similarity score as the scaffold. PEP_scaffolder [120] uses BLAST [121] to align protein sequences with contigs. The protein sequence is divided into blocks, and the longest alignment region in one block is selected. Then, one contig corresponds to one block; therefore, the contigs can be sorted based on the order of the blocks.
Evaluation
Metrics
Evaluating the performance of scaffolding methods is crucial for assessing how well a scaffolding method performs and for fairly comparing the performances of different methods. Here, we introduce the most widely used evaluation metrics in the literature.
The traditional evaluation approach for scaffolding methods is to compute the metrics N50 and corrected N50 (NGA50/NA50) values. N50 is the length of the longest scaffold such that all the scaffolds of that length or longer cover at least half of the length of all scaffolds. After aligning the scaffolds in the reference genome and breaking the scaffolds at their error points, N50 can be recomputed to obtain the corrected N50. These metrics can be obtained by GAGE [122] and QUAST [123], which are typically used to evaluate the contiguity and completion of scaffolds. However, it is difficult to evaluate the number of correct orders and orientations. Joining long contigs correctly usually increases the values of N50 and corrected N50. However, joining short contigs correctly sometimes has no effect on the values of N50 and corrected N50.
To compute the number of correct joins among contigs, a novel evaluation pipeline is first provided for the scaffolding methods, termed a scaffolding evaluation [124]. For the contigs produced by any assembler, scaffold evaluation leads to the identification of assembly errors by aligning them against the reference genome using NUCmer. The best alignments are selected, and a new contig set with no errors is then generated. Each new contig is attached with a tag, and their orders, orientations and gap distances in the reference genome are kept. Finally, scaffolding methods are run on the new contig set, and the scaffolding results are evaluated based on the orders and orientation of contigs determined in the reference genome. They present five metrics: correct joins, incorrect joins, skipped tags, lost tags and running time. Correct join indicates that two contigs are joined in the expected order, orientation and gap distance. Incorrect join indicates that two contigs are joined with an incorrect order, orientation and gap distance. A skipped tag indicates that two contigs are joined correctly but that the gap region between them should contain another contig. A lost tag indicates that a contig has been lost in the scaffolding result. The running time is the total CPU time required to run one scaffolding method.
ScaffoldMatch computes precision, recall and the F1-score based on the number of correct joins and incorrect joins. Precision is calculated as (correct joins/correct joins + Incorrect joins), Recall is calculated as (correct joins/total joins), and the F1-score equals [2*precision*recall/(precision + recall)], thus yielding a comprehensive evaluation. However, some repetitive regions in the contigs are commonly used for scaffolding. Some contigs or subcontigs can be aligned at two or more positions. When producing perfect new contigs, the scaffold evaluation leads to the selection of only the best alignment for a contig or subcontig. However, Igor Mandric et al. [125] considered multiple alignments for repetitive contigs or subcontigs over a given identity level to compute correct joins and incorrect joins. This strategy enhanced the evaluation accuracy for scaffolding methods.
Comparisons
Current studies usually compare the same kinds of scaffolding methods, and the differences in performance among different kinds of scaffolding methods are unclear. To evaluate different scaffolding methods, we selected some and obtained experimental results from the corresponding studies. The scaffolding results shown in Table 2 are from studies on SCOP and BOSS. The scaffolding results shown in Tables 3 and 4 are from the studies on SLR and LRscaf, respectively. In Table 5, we show the scaffolding results from the study of SALSA2, whereas Table 6 shows the scaffolding results from the study of ARCS, and Table 7 shows the scaffolding results from the study of CSAR. Two important metrics are used: NGA50 and MA, which represent the continuity and correctness of scaffolding results, respectively.
Genome . | Staphylococcus aureus . | Rhodobacter sphaeroides . | R. sphaeroides . | Bacillus cereus . | Plasmodium falciparum . | Human chr 14 . | Human chr 14 . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 2.9 M | 4.6 M | 4.6 M | 5.4 M | 23 M | 88 M | 88 M | |||||||
Insert size | 3500 | 3700 | 540 | 600 | 2700 | 2900 | 34 500 | |||||||
Coverage | 45 | 46 | 100 | 100 | 39 | 26 | 2 | |||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
ABySS | 609 | 1 | 263 | 186 | 23 | 3 | 103 | 5 | 6 | 40 | 114 | 38 | 12 | 22 |
Bambus2 | 240 | 2 | 113 | 240 | 28 | 3 | 109 | 3 | − | − | 28 | 1451 | 72 | 3178 |
OPERA | 677 | 10 | 108 | 260 | 23 | 2 | 63 | 3 | 24 | 199 | 133 | 603 | 20 | 3903 |
SCARPA | 112 | 8 | 35 | 363 | 24 | 4 | 159 | 3 | 17 | 263 | 45 | 441 | 17 | 1499 |
SGA | 304 | 1 | 41 | 344 | 33 | 6 | 223 | 4 | 5 | 89 | 88 | 16 | 86 | 3595 |
SOAP2 | 304 | 3 | 2504 | 101 | 35 | 6 | 169 | 4 | 6 | 349 | 119 | 627 | 34 | 2788 |
SOPRA | 109 | 2 | 29 | 320 | 24 | 4 | 165 | 3 | 11 | 64 | 19 | 1081 | 30 | 2562 |
SSPACE | 298 | 3 | 109 | 213 | 23 | 2 | 103 | 3 | 5 | 307 | 44 | 64 | 8 | 111 |
ScaffMatch | 335 | 15 | 2468 | 89 | 24 | 13 | 143 | 9 | 47 | 112 | 195 | 877 | 42 | 641 |
BOSS | 337 | 17 | 360 | 53 | 26 | 7 | 209 | 10 | 30 | 328 | 42 | 2514 | 43 | 5129 |
SCOP | 341 | 13 | 449 | 61 | 37 | 3 | 456 | 5 | 31 | 350 | 45 | 2457 | − | − |
Genome . | Staphylococcus aureus . | Rhodobacter sphaeroides . | R. sphaeroides . | Bacillus cereus . | Plasmodium falciparum . | Human chr 14 . | Human chr 14 . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 2.9 M | 4.6 M | 4.6 M | 5.4 M | 23 M | 88 M | 88 M | |||||||
Insert size | 3500 | 3700 | 540 | 600 | 2700 | 2900 | 34 500 | |||||||
Coverage | 45 | 46 | 100 | 100 | 39 | 26 | 2 | |||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
ABySS | 609 | 1 | 263 | 186 | 23 | 3 | 103 | 5 | 6 | 40 | 114 | 38 | 12 | 22 |
Bambus2 | 240 | 2 | 113 | 240 | 28 | 3 | 109 | 3 | − | − | 28 | 1451 | 72 | 3178 |
OPERA | 677 | 10 | 108 | 260 | 23 | 2 | 63 | 3 | 24 | 199 | 133 | 603 | 20 | 3903 |
SCARPA | 112 | 8 | 35 | 363 | 24 | 4 | 159 | 3 | 17 | 263 | 45 | 441 | 17 | 1499 |
SGA | 304 | 1 | 41 | 344 | 33 | 6 | 223 | 4 | 5 | 89 | 88 | 16 | 86 | 3595 |
SOAP2 | 304 | 3 | 2504 | 101 | 35 | 6 | 169 | 4 | 6 | 349 | 119 | 627 | 34 | 2788 |
SOPRA | 109 | 2 | 29 | 320 | 24 | 4 | 165 | 3 | 11 | 64 | 19 | 1081 | 30 | 2562 |
SSPACE | 298 | 3 | 109 | 213 | 23 | 2 | 103 | 3 | 5 | 307 | 44 | 64 | 8 | 111 |
ScaffMatch | 335 | 15 | 2468 | 89 | 24 | 13 | 143 | 9 | 47 | 112 | 195 | 877 | 42 | 641 |
BOSS | 337 | 17 | 360 | 53 | 26 | 7 | 209 | 10 | 30 | 328 | 42 | 2514 | 43 | 5129 |
SCOP | 341 | 13 | 449 | 61 | 37 | 3 | 456 | 5 | 31 | 350 | 45 | 2457 | − | − |
Genome . | Staphylococcus aureus . | Rhodobacter sphaeroides . | R. sphaeroides . | Bacillus cereus . | Plasmodium falciparum . | Human chr 14 . | Human chr 14 . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 2.9 M | 4.6 M | 4.6 M | 5.4 M | 23 M | 88 M | 88 M | |||||||
Insert size | 3500 | 3700 | 540 | 600 | 2700 | 2900 | 34 500 | |||||||
Coverage | 45 | 46 | 100 | 100 | 39 | 26 | 2 | |||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
ABySS | 609 | 1 | 263 | 186 | 23 | 3 | 103 | 5 | 6 | 40 | 114 | 38 | 12 | 22 |
Bambus2 | 240 | 2 | 113 | 240 | 28 | 3 | 109 | 3 | − | − | 28 | 1451 | 72 | 3178 |
OPERA | 677 | 10 | 108 | 260 | 23 | 2 | 63 | 3 | 24 | 199 | 133 | 603 | 20 | 3903 |
SCARPA | 112 | 8 | 35 | 363 | 24 | 4 | 159 | 3 | 17 | 263 | 45 | 441 | 17 | 1499 |
SGA | 304 | 1 | 41 | 344 | 33 | 6 | 223 | 4 | 5 | 89 | 88 | 16 | 86 | 3595 |
SOAP2 | 304 | 3 | 2504 | 101 | 35 | 6 | 169 | 4 | 6 | 349 | 119 | 627 | 34 | 2788 |
SOPRA | 109 | 2 | 29 | 320 | 24 | 4 | 165 | 3 | 11 | 64 | 19 | 1081 | 30 | 2562 |
SSPACE | 298 | 3 | 109 | 213 | 23 | 2 | 103 | 3 | 5 | 307 | 44 | 64 | 8 | 111 |
ScaffMatch | 335 | 15 | 2468 | 89 | 24 | 13 | 143 | 9 | 47 | 112 | 195 | 877 | 42 | 641 |
BOSS | 337 | 17 | 360 | 53 | 26 | 7 | 209 | 10 | 30 | 328 | 42 | 2514 | 43 | 5129 |
SCOP | 341 | 13 | 449 | 61 | 37 | 3 | 456 | 5 | 31 | 350 | 45 | 2457 | − | − |
Genome . | Staphylococcus aureus . | Rhodobacter sphaeroides . | R. sphaeroides . | Bacillus cereus . | Plasmodium falciparum . | Human chr 14 . | Human chr 14 . | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 2.9 M | 4.6 M | 4.6 M | 5.4 M | 23 M | 88 M | 88 M | |||||||
Insert size | 3500 | 3700 | 540 | 600 | 2700 | 2900 | 34 500 | |||||||
Coverage | 45 | 46 | 100 | 100 | 39 | 26 | 2 | |||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
ABySS | 609 | 1 | 263 | 186 | 23 | 3 | 103 | 5 | 6 | 40 | 114 | 38 | 12 | 22 |
Bambus2 | 240 | 2 | 113 | 240 | 28 | 3 | 109 | 3 | − | − | 28 | 1451 | 72 | 3178 |
OPERA | 677 | 10 | 108 | 260 | 23 | 2 | 63 | 3 | 24 | 199 | 133 | 603 | 20 | 3903 |
SCARPA | 112 | 8 | 35 | 363 | 24 | 4 | 159 | 3 | 17 | 263 | 45 | 441 | 17 | 1499 |
SGA | 304 | 1 | 41 | 344 | 33 | 6 | 223 | 4 | 5 | 89 | 88 | 16 | 86 | 3595 |
SOAP2 | 304 | 3 | 2504 | 101 | 35 | 6 | 169 | 4 | 6 | 349 | 119 | 627 | 34 | 2788 |
SOPRA | 109 | 2 | 29 | 320 | 24 | 4 | 165 | 3 | 11 | 64 | 19 | 1081 | 30 | 2562 |
SSPACE | 298 | 3 | 109 | 213 | 23 | 2 | 103 | 3 | 5 | 307 | 44 | 64 | 8 | 111 |
ScaffMatch | 335 | 15 | 2468 | 89 | 24 | 13 | 143 | 9 | 47 | 112 | 195 | 877 | 42 | 641 |
BOSS | 337 | 17 | 360 | 53 | 26 | 7 | 209 | 10 | 30 | 328 | 42 | 2514 | 43 | 5129 |
SCOP | 341 | 13 | 449 | 61 | 37 | 3 | 456 | 5 | 31 | 350 | 45 | 2457 | − | − |
Genome . | Escherichia coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | Chr X . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | |||||
Coverage | 41 | 29 | 234 | 196 | 57 | |||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 538 | 10 | 1175 | 6 | 200 | 101 | 190 | 215 | 809 | 176 |
LINKS | 445 | 6 | 693 | 5 | 203 | 54 | 231 | 52 | 183 | 100 |
npscarf | 444 | 13 | 1672 | 6 | 312 | 80 | 374 | 83 | 2325 | 157 |
SLR | 723 | 4 | 2927 | 4 | 374 | 52 | 374 | 46 | 2390 | 83 |
Genome . | Escherichia coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | Chr X . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | |||||
Coverage | 41 | 29 | 234 | 196 | 57 | |||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 538 | 10 | 1175 | 6 | 200 | 101 | 190 | 215 | 809 | 176 |
LINKS | 445 | 6 | 693 | 5 | 203 | 54 | 231 | 52 | 183 | 100 |
npscarf | 444 | 13 | 1672 | 6 | 312 | 80 | 374 | 83 | 2325 | 157 |
SLR | 723 | 4 | 2927 | 4 | 374 | 52 | 374 | 46 | 2390 | 83 |
Genome . | Escherichia coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | Chr X . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | |||||
Coverage | 41 | 29 | 234 | 196 | 57 | |||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 538 | 10 | 1175 | 6 | 200 | 101 | 190 | 215 | 809 | 176 |
LINKS | 445 | 6 | 693 | 5 | 203 | 54 | 231 | 52 | 183 | 100 |
npscarf | 444 | 13 | 1672 | 6 | 312 | 80 | 374 | 83 | 2325 | 157 |
SLR | 723 | 4 | 2927 | 4 | 374 | 52 | 374 | 46 | 2390 | 83 |
Genome . | Escherichia coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | Chr X . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | |||||
Coverage | 41 | 29 | 234 | 196 | 57 | |||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 538 | 10 | 1175 | 6 | 200 | 101 | 190 | 215 | 809 | 176 |
LINKS | 445 | 6 | 693 | 5 | 203 | 54 | 231 | 52 | 183 | 100 |
npscarf | 444 | 13 | 1672 | 6 | 312 | 80 | 374 | 83 | 2325 | 157 |
SLR | 723 | 4 | 2927 | 4 | 374 | 52 | 374 | 46 | 2390 | 83 |
Genome . | Escherichia coli . | E. coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | S. cerevisiae . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | ||||||
Coverage | 1 | 5 | 10 | 1 | 5 | 10 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 58 | 3 | 600 | 1 | 2700 | 1 | 36 | 7 | 182 | 10 | 354 | 9 |
LINKS | 59 | 6 | 292 | 9 | 205 | 11 | 35 | 12 | 111 | 15 | 113 | 16 |
npscarf | 58 | 1 | 245 | 5 | 429 | 5 | 30 | 6 | 18 | 3 | 100 | 17 |
LRScaf | 59 | 0 | 1300 | 6 | 1600 | 1 | 32 | 6 | 154 | 9 | 299 | 9 |
Genome . | Escherichia coli . | E. coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | S. cerevisiae . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | ||||||
Coverage | 1 | 5 | 10 | 1 | 5 | 10 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 58 | 3 | 600 | 1 | 2700 | 1 | 36 | 7 | 182 | 10 | 354 | 9 |
LINKS | 59 | 6 | 292 | 9 | 205 | 11 | 35 | 12 | 111 | 15 | 113 | 16 |
npscarf | 58 | 1 | 245 | 5 | 429 | 5 | 30 | 6 | 18 | 3 | 100 | 17 |
LRScaf | 59 | 0 | 1300 | 6 | 1600 | 1 | 32 | 6 | 154 | 9 | 299 | 9 |
Genome . | Escherichia coli . | E. coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | S. cerevisiae . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | ||||||
Coverage | 1 | 5 | 10 | 1 | 5 | 10 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 58 | 3 | 600 | 1 | 2700 | 1 | 36 | 7 | 182 | 10 | 354 | 9 |
LINKS | 59 | 6 | 292 | 9 | 205 | 11 | 35 | 12 | 111 | 15 | 113 | 16 |
npscarf | 58 | 1 | 245 | 5 | 429 | 5 | 30 | 6 | 18 | 3 | 100 | 17 |
LRScaf | 59 | 0 | 1300 | 6 | 1600 | 1 | 32 | 6 | 154 | 9 | 299 | 9 |
Genome . | Escherichia coli . | E. coli . | E. coli . | Saccharomyces cerevisiae . | S. cerevisiae . | S. cerevisiae . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 4.6 M | 4.6 M | 4.6 M | 12.1 M | 12.1 M | 155.2 M | ||||||
Coverage | 1 | 5 | 10 | 1 | 5 | 10 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SSPACE-LR | 58 | 3 | 600 | 1 | 2700 | 1 | 36 | 7 | 182 | 10 | 354 | 9 |
LINKS | 59 | 6 | 292 | 9 | 205 | 11 | 35 | 12 | 111 | 15 | 113 | 16 |
npscarf | 58 | 1 | 245 | 5 | 429 | 5 | 30 | 6 | 18 | 3 | 100 | 17 |
LRScaf | 59 | 0 | 1300 | 6 | 1600 | 1 | 32 | 6 | 154 | 9 | 299 | 9 |
Genome . | Homo sapiens . | H. sapiens . | H. sapiens . | |||
---|---|---|---|---|---|---|
Genome size | 2.9G | 2.9G | 2.9G | |||
Coverage | 40 | 17 | 18 | |||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SALSA2 | 57 200 | 587 | 26 460 | 273 | 4630 | 259 |
3D-DNA | 28 610 | 816 | 21 470 | 828 | 4520 | 930 |
SALSA1 | 14 810 | 371 |
Genome . | Homo sapiens . | H. sapiens . | H. sapiens . | |||
---|---|---|---|---|---|---|
Genome size | 2.9G | 2.9G | 2.9G | |||
Coverage | 40 | 17 | 18 | |||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SALSA2 | 57 200 | 587 | 26 460 | 273 | 4630 | 259 |
3D-DNA | 28 610 | 816 | 21 470 | 828 | 4520 | 930 |
SALSA1 | 14 810 | 371 |
Genome . | Homo sapiens . | H. sapiens . | H. sapiens . | |||
---|---|---|---|---|---|---|
Genome size | 2.9G | 2.9G | 2.9G | |||
Coverage | 40 | 17 | 18 | |||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SALSA2 | 57 200 | 587 | 26 460 | 273 | 4630 | 259 |
3D-DNA | 28 610 | 816 | 21 470 | 828 | 4520 | 930 |
SALSA1 | 14 810 | 371 |
Genome . | Homo sapiens . | H. sapiens . | H. sapiens . | |||
---|---|---|---|---|---|---|
Genome size | 2.9G | 2.9G | 2.9G | |||
Coverage | 40 | 17 | 18 | |||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
SALSA2 | 57 200 | 587 | 26 460 | 273 | 4630 | 259 |
3D-DNA | 28 610 | 816 | 21 470 | 828 | 4520 | 930 |
SALSA1 | 14 810 | 371 |
In these studies, although different kinds of scaffolding methods were utilized on the different datasets, we still found some interesting results as follows:
(i) Even for scaffolding methods of the same kind, their performances on different datasets of the same genome are usually different. The characteristics of the sequencing data have a large impact on the final result.
(ii) For long-range sequencing data, their performances are good. As shown in Tables 3 and 4, scaffolding methods based on long reads perform better in terms of the MA value. The NGA50 values for different datasets are also large, which can be attributed to the length of long reads, thus resolving more problems caused by repetitive regions. To improve the accuracy of scaffolding results, users can select this kind of scaffolding method in their research.
Table 5 shows that scaffolding methods based on Hi-C paired reads perform better in terms of the NGA50 value, which demonstrates that Hi-C paired reads are helpful for distinguishing contigs from different chromosomes and guarantees the continuity of scaffolding results. Hence, Hi-C paired reads are significant for chromosome-level scaffolding. As shown in Tables 6 and 7, linked reads and reference genomes are helpful for scaffolding, and the NGA50 values are good.
(iii) For small genomes, scaffolding methods based on paired reads also perform well in terms of the NGA50 value, as shown in Table 2. A repetition problem rarely exist for small genomes, and paired reads can be more exactly aligned in contigs. If cost is a limitation, users can choose this type of scaffolding method.
(iv) Table 4 shows that as the sequencing coverage of long reads increases, the continuity of the scaffolding results improves. Ou et al. [126] also claimed that a more sequencing coverage (>40) is helpful for identifying more transposable elements in complex genomes. As shown in Table 2, the results are affected by not only the sequencing coverage, but also the insert size. A small insert size is not conducive to resolving the repetitive region problem. Table 5 shows that high coverage improves the value of NGA50 for the same species. Accurate and complete assembly of highly complex genomes will require more sequencing data [127].
Discussion
With the rapid development of sequencing technologies and assembly algorithms, the ability to obtain more complete and accurate complex genomes has advanced [9]. The most prominent problem in genome assembly is repetitive regions in the target genome. Since the long reads produced by TGS can more proportionally span the repetitive regions, the long reads are useful for addressing problems caused by repetitive regions and can effectively improve the assembly quality. Assembly methods based on long reads can be used to obtain high reference-quality genomes with contig sizes measured in megabases instead of kilobases and are common for second-generation attempts [17]. During the assembly contig process, the extension of a contig commonly ends when complex repetitive regions, low coverage regions or regions with sequencing errors are encountered.
Genome . | Caenorhabditis elegans . | Homo sapiens . | ||
---|---|---|---|---|
Genome size | 100 M | 2.9G | ||
Coverage | 50 | 136 | ||
NGA50 (kb) | MA | NGA50 (kb) | MA | |
ARKS | 8280 | 1359 | 448 | 217 |
ARCS | 8450 | 1473 | 410 | 207 |
fragScaff | 7730 | 1791 | 267 | 117 |
Architect | 7170 | 1452 | 174 | 454 |
Genome . | Caenorhabditis elegans . | Homo sapiens . | ||
---|---|---|---|---|
Genome size | 100 M | 2.9G | ||
Coverage | 50 | 136 | ||
NGA50 (kb) | MA | NGA50 (kb) | MA | |
ARKS | 8280 | 1359 | 448 | 217 |
ARCS | 8450 | 1473 | 410 | 207 |
fragScaff | 7730 | 1791 | 267 | 117 |
Architect | 7170 | 1452 | 174 | 454 |
Genome . | Caenorhabditis elegans . | Homo sapiens . | ||
---|---|---|---|---|
Genome size | 100 M | 2.9G | ||
Coverage | 50 | 136 | ||
NGA50 (kb) | MA | NGA50 (kb) | MA | |
ARKS | 8280 | 1359 | 448 | 217 |
ARCS | 8450 | 1473 | 410 | 207 |
fragScaff | 7730 | 1791 | 267 | 117 |
Architect | 7170 | 1452 | 174 | 454 |
Genome . | Caenorhabditis elegans . | Homo sapiens . | ||
---|---|---|---|---|
Genome size | 100 M | 2.9G | ||
Coverage | 50 | 136 | ||
NGA50 (kb) | MA | NGA50 (kb) | MA | |
ARKS | 8280 | 1359 | 448 | 217 |
ARCS | 8450 | 1473 | 410 | 207 |
fragScaff | 7730 | 1791 | 267 | 117 |
Architect | 7170 | 1452 | 174 | 454 |
Genome . | Brucella pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 3.4 M | 3.4 M | 3.4 M | 5 M | 5 M | 5 M | ||||||
Reference | B. pinnipedialis M163/99/10 | B. melitensis KU/RCF-31 | B. canis SCL | V. antiquarius Ex25 939 | V. parahaemolyticus GCSL R74 | V. campbellii DS40M4 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
CSAR (PROmer) | 254 | 7 | 241 | 7 | 24 | 6 | 1205 | 7 | 667 | 4 | 441 | 6 |
CSAR (NUCmer) | 266 | 7 | 241 | 4 | 241 | 4 | 1205 | 7 | 287 | 17 | 153 | 25 |
Mauve | 435 | 30 | 190 | 40 | 190 | 46 | 469 | 27 | 261 | 39 | 138 | 57 |
OSLay | 311 | 9 | 221 | 2 | 221 | 0 | 266 | 10 | 153 | 22 | 92 | 18 |
Projector2 | 301 | 19 | 157 | 11 | 157 | 8 | 233 | 23 | 233 | 24 | 138 | 24 |
Genome . | Brucella pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 3.4 M | 3.4 M | 3.4 M | 5 M | 5 M | 5 M | ||||||
Reference | B. pinnipedialis M163/99/10 | B. melitensis KU/RCF-31 | B. canis SCL | V. antiquarius Ex25 939 | V. parahaemolyticus GCSL R74 | V. campbellii DS40M4 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
CSAR (PROmer) | 254 | 7 | 241 | 7 | 24 | 6 | 1205 | 7 | 667 | 4 | 441 | 6 |
CSAR (NUCmer) | 266 | 7 | 241 | 4 | 241 | 4 | 1205 | 7 | 287 | 17 | 153 | 25 |
Mauve | 435 | 30 | 190 | 40 | 190 | 46 | 469 | 27 | 261 | 39 | 138 | 57 |
OSLay | 311 | 9 | 221 | 2 | 221 | 0 | 266 | 10 | 153 | 22 | 92 | 18 |
Projector2 | 301 | 19 | 157 | 11 | 157 | 8 | 233 | 23 | 233 | 24 | 138 | 24 |
Genome . | Brucella pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 3.4 M | 3.4 M | 3.4 M | 5 M | 5 M | 5 M | ||||||
Reference | B. pinnipedialis M163/99/10 | B. melitensis KU/RCF-31 | B. canis SCL | V. antiquarius Ex25 939 | V. parahaemolyticus GCSL R74 | V. campbellii DS40M4 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
CSAR (PROmer) | 254 | 7 | 241 | 7 | 24 | 6 | 1205 | 7 | 667 | 4 | 441 | 6 |
CSAR (NUCmer) | 266 | 7 | 241 | 4 | 241 | 4 | 1205 | 7 | 287 | 17 | 153 | 25 |
Mauve | 435 | 30 | 190 | 40 | 190 | 46 | 469 | 27 | 261 | 39 | 138 | 57 |
OSLay | 311 | 9 | 221 | 2 | 221 | 0 | 266 | 10 | 153 | 22 | 92 | 18 |
Projector2 | 301 | 19 | 157 | 11 | 157 | 8 | 233 | 23 | 233 | 24 | 138 | 24 |
Genome . | Brucella pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | B. pinnipedialis B2/94 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | V. antiquarius Ex25 . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Genome size | 3.4 M | 3.4 M | 3.4 M | 5 M | 5 M | 5 M | ||||||
Reference | B. pinnipedialis M163/99/10 | B. melitensis KU/RCF-31 | B. canis SCL | V. antiquarius Ex25 939 | V. parahaemolyticus GCSL R74 | V. campbellii DS40M4 | ||||||
NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | NGA50 (kb) | MA | |
CSAR (PROmer) | 254 | 7 | 241 | 7 | 24 | 6 | 1205 | 7 | 667 | 4 | 441 | 6 |
CSAR (NUCmer) | 266 | 7 | 241 | 4 | 241 | 4 | 1205 | 7 | 287 | 17 | 153 | 25 |
Mauve | 435 | 30 | 190 | 40 | 190 | 46 | 469 | 27 | 261 | 39 | 138 | 57 |
OSLay | 311 | 9 | 221 | 2 | 221 | 0 | 266 | 10 | 153 | 22 | 92 | 18 |
Projector2 | 301 | 19 | 157 | 11 | 157 | 8 | 233 | 23 | 233 | 24 | 138 | 24 |
Scaffolding methods have been widely used in many applications. In addition to reconstructing a single genome, scaffolding methods can also be used in pan-genome assembly. The pan-genome refers to a collection of all DNA sequences that occur in a species, which is significant for understanding genetic diversity. Compared with a single reference genome, the pan-genome can identify more variants that are related to phenotypes. Constructing of a pan-genome usually requires the sequencing of many individuals and the production of large amounts of sequencing data. Scaffolding methods can produce more complete individual references efficiently and improve the subsequent analysis of variants. RaGOO was used to order and orient 103 draft Arabidopsis thaliana genomes and to construct the corresponding pan-genome. Recently, genome assemblies for five sesame varieties were scaffolded using a reference-based method, and a pan-genome was produced for comparative genomic analyses and gene discovery [128].
Construction of a haplotype-resolved genome for species is important for population genetic analyses. Delaneau et al. [129] employed the scaffolding method SSPACE to generate supper-scaffolds, resulting in halotype-improved assembly. Yang et al. [130] evaluated three scaffolding methods based on Hi-C paired reads and selected SALSA2 to create haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle. Low et al. [131] generated scaffolds using the ARCS + LINKS pipeline which increased the assembly continuity.
More complete assembly allows the identification of new features of transposable elements and is helpful for reconstructing subgenome. Xu et al. [132] obtained draft contigs of white lupin by Canu, and the N50 value of the contigs was 1.76 M. Then, they employed 3ddna based on Hi-C paired reads to scaffold these contigs and obtained more continuous scaffolds, obtaining an N50 value of 18.66 Mb. Based on the assembly and analysis of the white lupin genome, the authors deciphered its diploid ancestral genome and reconstructed the three subgenomes. Cai et al. [133] reported an updated assembly of the Brassica oleracea reference genome and then used syntenic gene pairs between the updated assembly and Arabidopsis thaliana to construct three subgenomes.
Users must know which scaffolding methods to employ for assembly, and short paired reads may be a better choice for the scaffolding of small and noncomplex genomes. Scaffolding methods based on long reads are suggested for more continuous assemblies, and methods based on Hi-C paired reads can usually yield chromosome-level results. However, in addition to their advantages, each scaffolding method has some biases. Scaffolding methods based on short paired reads usually get stuck in repetitive regions, estimate the insert size and link two adjacent contigs that are far apart. Scaffolding methods based on Hi-C paired reads cannot provide the exact distance and orientation information about two contigs. The order of linked reads with the same barcode is unclear, which limits the application of linked reads for scaffolding.
Currently, most studies prefer to use hybrid data for scaffolding. Wallberg et al. [134] used three different scaffolding methods (ARCS + LINKS + LACHESIS) based on linked reads, long reads and Hi-C paired reads to produce a new genome assembly of the honeybee Apis mellifera. Xu et al. [135] first produced contigs using long reads and then used ARCS based on linked reads to obtain the final assembly of Malania oleifera. To reconstruct the grape phylloxera genome, Rispe et al. [136] used short paired reads to generate a draft assembly and then employed SSPANCE-Longread for scaffolding based on the long reads. Finally, the gaps were filled by GapFiller based on short paired reads. Li et al. [137] used a combination of long reads, short paired reads, linked reads and Hi-C paired reads to obtain the genome sequence of Asparagus setaceus. The scaffolding methods ARKS, LINKS and LACHESIS were all used in the assembly process. Arimoto et al. [138] assembled the Praesagittifera naikaiensis genome using BESST and LINKS to perform scaffolding with short paired reads and long reads.
Hence, when assembling more complex genomes, such as repeat-rich plant and mammalian genomes, combining multiple types of data and scaffolding methods can usually yield a high-quality genome.
After scaffolding, the DNA sequences of gaps in the scaffold are unknown. To improve the completeness of the assembly, some stand-alone gap filling methods have been proposed, which can be divided into two categories. (i) The first involves the methods based on paired reads, which usually involve an initial collection of reads that possibly cover the gaps based on the alignments between paired reads and scaffolds. Then, different strategies are utilized to construct a local assembly to fill the gaps. IMAGE [139] employs Velvet to obtain contigs for filling gaps, and GapCloser [49] is a gap filling module in the assembler SOAP2. When filling gaps, GapFiller [140] considers the difference between the gap size and the length filled. Sealer [141] selects paths from De Bruijn graphs that are represented by bloom filter data structures. Gap2Seq [142] solves the gap filling problem by finding a path in a directed graph whose length is in a given range. For a gap and its related reads, GapReduce [143] iteratively constructs De Bruijn graphs based on different k-mer sizes and frequencies and selects paths for filling the gaps. (ii) The second category involves the methods based on long reads. PBJelly [144] uses BLASR to align the long reads to scaffolds, and identify the long reads that possibly cover the gap. Then, it employs the OLC assembly algorithm to fill the gaps. GMcloser [145] uses the likelihood model to select reliable long reads to fill the gaps. LR_Gapcloser [146] segments each long read into short sequence fragments with the same length, which can be used to find more accurate alignments. TGS-GapCloser [147] corrects errors for long reads related to a gap.
Conclusions and future directions
Scaffolding is a significant step in genome assembly and can greatly improve the completeness and contiguity of assembly. With the continuous development of sequencing technologies, many different types of scaffolding methods have been developed and are widely used. Although the latest sequencing technologies have substantially advanced scaffolding, resolving intricate ambiguities is also challenging. In this review, the difficulties of scaffolding, characteristics of various kinds of sequencing data, and scaffolding methods are summarized and discussed.
Scaffolding methods commonly implement a scaffold graph based on the alignments between contigs and reads. Although many scaffolding methods based on distinct sequencing technologies are available, the problems caused by genome complexity and the drawbacks of sequencing technologies are difficult to resolve. Furthermore, sequencing errors, highly repetitive regions, and alignment errors lead to false edges and increase the complexity of the scaffold graph. In addition, graph search approaches, which are used to extract scaffolds, usually require many more constraints with respect to the number of vertices and edges in the scaffold graph. Moreover, the most powerful latest sequencing technologies and scaffold graph structure have not been fully explored.
For future research, our recommendations are as follows:
(i) By taking complete advantage of various sequencing data obtained by different technologies, scaffolding methods could yield more accurate results. Generally, one type of sequencing data supplies only part of the information about the order and orientations of the contigs. For example, paired reads can be more accurately aligned in contigs, long reads can span more repetitive regions, and Hi-C paired reads are helpful for chromosome-level scaffolding. More efforts should be made to integrate multiple sequencing data into a unified workflow, which is obviously challenging. For example, Hi-C paired reads can be used to distinguish contigs from distinct chromosomes first, and related reference genomes can provide a draft contig backbone. Then, repetitive regions in the same chromosome are approached by long reads. Paired reads improve the quality of local assembly. One new strategy that can simultaneously consider linked information from multiple kinds of data should be paid more attention. In particular, the consistencies and contradictions of various types of alignment information should be analyzed in-depth. For example, whether two contigs can be linked simultaneously by paired and long reads? And how to address two contigs being supported by only one type of read need to be further elucidated. Hence, for sequencing data integration, scaffolding methods should consider the different contributions of various data and design rational methods for weightings.
(ii) For the long reads produced by TGS technologies, the greatest advantage is the read length. Compared with paired reads, one long read can link two contigs separated by repetitive regions, and three or more contigs can be aligned with the same long read. Paired reads can link only two contigs, and it is difficult to span repetitive regions. For current methods based on paired reads or long reads, when constructing an optimization model to represent the scaffolding problem, the constraint condition usually refers to two contigs. Hence, for three or more contigs that can be aligned in the same long read, current methods based on long reads should pay more attention to designing a constraint condition that can include them. In addition, more accurate methods to identify repetitive regions in contigs based on long reads also require more effort.
(iii) Due to the low cost and high sequencing accuracy of NGS, paired reads from NGS are still widely used in many applications. Further studies on scaffolding methods based on paired reads still make sense. For future studies based on paired reads, the primary issue is to determine whether two contigs can be linked based on the positional information of paired reads aligned in them, and how to weight the link to reflect its confidence. For two contigs, more focus can be applied to the positional distribution of aligned paired reads to analyze their order and orientation. Moreover, identifying repetitive contigs based on paired reads also needs to be further investigated.
(iv) For scaffolding methods based on Hi-C paired reads, the gap sizes in the scaffolds are difficult to estimate. Hi-C paired reads reflect only the spatial proximity but do not directly reflect the linear distance. Current clustering methods, which distinguish contigs from different chromosomes, are usually developed based on scaffold graphs. The weights of the edges in the scaffold graph are important for clustering results and are usually related to the number of paired reads linking two vertices and to the length of contigs. Due to the complexity of the three-dimensional chromosome structure, some contigs from different chromosomes can be spatially close. This situation should be considered to enhance the accuracy of clustering. In addition, for two contigs, the correlation between the number of aligned Hi-C paired reads and their distance is currently ambiguous.
(v) The sequencing technology for linked reads balances the sequencing error rate and the length of fragments sequenced. Although linked reads with the same barcode can be grouped together, their order is unclear, which introduces some problems regarding the determination of whether two contigs can be linked. When linking contigs, current methods usually consider only the linked reads that can be aligned at the ends of the contigs. The ability to simultaneously align the linked reads in a group at two contig ends separately commonly means that the two contigs are adjacent. However, for two contigs, more information (the order of the aligned linked reads and the number of the unaligned linked reads from the same group, the consistency of aligned information between different groups) could be further utilized to determine their order and gap size.
(vi) Optical maps also supply long-range information that can guide scaffolding. Before optical maps are aligned in the contigs, the restriction sites in the contigs should be identified. However, the length of one restriction enzyme is commonly short, and more restriction sizes than expected are usually found. Moreover, the fragment sizes often have small deviations, which also influence the alignment results. Exact alignment methods for contigs and optical maps should be given more attention.
(vii) Reference-based scaffolding methods can take advantage of one or more related reference genomes to guide the process of scaffolding and are cost-efficient alternatives for scaffolding. However, some differences (small or large structural variations and polymorphisms) commonly exist between the related reference genome and the target genome. Hence, these methods may introduce some reference bias. For example, some contigs may not be aligned to the reference genomes or aligned at incorrect positions or orientations, and these contigs may be treated as chimaeras. Commonly, multiple related reference genomes can provide more consistent adjacency information about contigs than a single reference genome. Although some methods using multiple reference genomes have been developed, obtaining accurate alignments and synteny blocks between the contigs and the reference genomes is challenging. Combining long-range sequencing technologies with reference genomes may improve the completeness and accuracy of scaffolding results. Therefore, a draft scaffold set based on reference genomes can be produced. Then, long reads can be used to tackle the problems caused by repetitive regions or small structural variations, and Hi-C paired reads can be used to address large differences between contigs and reference genomes.
(viii) For scaffolding methods, most computational costs are consumed in the alignment step. Different alignment tools vary greatly in efficiency. Minimap and Minimap2 usually perform better than other tools when aligning long reads and contigs. As the scale of sequencing data increases, the running time and scaffolding memory will also increase. Therefore, selecting a rational alignment tool is important for scaffolding methods.
(ix) To fairly compare and evaluate various kinds of scaffolding methods, benchmark datasets and evaluation metrics should be constructed by collecting real sequencing data. In particular, the contig set should match the real situation, which could contain some MA and repetitive regions. Therefore, more efforts should be made to verify the scaffolding results.
Scaffolding methods help scientists gain deep insights into genomic research, including promoter cloning, gene identification, gene function analysis, gene expression regulation and comparative genomics analyses.
According to sequencing technologies and sequencing data, scaffolding methods are summarized and discussed in details. And, the metrics used for evaluating scaffolding methods are analyzed comprehensively.
To investigate the performance of the different types of scaffolding methods, some related computational experiments are analyzed.
Combination of multiple types of sequencing data using effective integrate methods can improve the performance of scaffolding methods in the future.
Funding
This work was supported in part by the National Natural Science Foundation of China under (grant nos. 61972134, 61602156, 61802113, 61772557 and 11601129), Henan Provincial Department of Science and Technology Research Project under (grant no. 192102210118), Doctor Foundation of Henan Polytechnic University under (grant no. B2018-36).
Junwei Luo is an Associate Professor in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include genome assembly, scaffolding, gap filling and structural variant detection.
Yawei Wei is a graduate student in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include bioinformatics and scaffolding.
Mengna Lyu is a graduate student in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. Her research interests include bioinformatics and scaffolding.
Zhengjiang Wu is an Associate Professor in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include bioinformatics and data mining.
Xiaoyan Liu is a lecturer in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. Her research interests include scaffolding methods and artificial intelligence.
Huimin Luo is an Associate Professor in the School of Computer and Information Engineering, Henan University, Kaifeng, China. Her research interests include bioinformatics and computational drug repositioning.
Chaokun Yan is an Associate Professor in the School of Computer and Information Engineering, Henan University, Kaifeng, China. His research interests include artificial intelligence and computational biology.