Abstract

In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.

Introduction

The basis of genomic research, including promoter cloning, gene identification, gene function analysis, gene expression regulation and comparative genomics analyses, is to obtain an accurate and complete reference genome [1–4]. The reference genome provides a persist structure which can be used for identifying specific sequence locations that are possibly related with diseases. For instance, we could align sequenced reads from one target in related reference genome to analyze DNA methylation, differential gene expression and structural variations. The reference genome also helps the analysis of the transcriptomic heterogeneity within populations of cells studies [5]. Hence, one accurate and complete reference genome can improve the quality and accuracy of the downstream analysis. Due to the complexity of the genome and the limitations of sequencing technologies, genome assemblers usually first produce fragmented long sequences (contigs). Next, scaffolding methods are utilized to infer the orientations and orders among these contigs and to produce some longer sequences (scaffolds). A scaffold comprises some ordered and oriented contigs, and the gap between any two adjacent contigs is filled with ‘N’s. Hence, scaffolding can make draft assembly more complete and contiguous and is an important step in genome assembly.

With the advancement of high-throughput sequencing (HTS) technologies, the reduction in the sequencing cost per genome has surpassed Moore’s law, and the cost of sequencing an entire genome as now dropped below $1000 genome [6]. The decreasing cost of HTS facilitates the widespread use of sequencing data, and typical HTS devices can generate over 100 gigabases of data in 24 h [7].

As a result of the development of HTS technologies, numerous scaffolding methods have been presented in the past decade, greatly advancing the field of genome assembly. Scaffolding methods generally utilize reads from sequencing technologies to link the different contigs and take advantage of the linked information to infer the order and orientation of contigs. The scaffolding result quality is highly dependent on the quality and length of the reads. However, different sequencing technologies produce reads with different characteristics [8–12]. Therefore, scaffolding methods based on different sequencing technologies have large differences in both their method designs and scaffolding results.

Short paired reads from next-generation sequencing (NGS), whose insert sizes can reach several kilobases, can span some repetitive regions and link different contigs. Moreover, paired reads from NGS have the advantages of low cost and high accuracy and have been used in many research fields. However, short reads reaches are only a few 100 bases in long, and the middle region between paired reads is unknown, resulting in important limitations, including ambiguous mapping, a limited ability to span long regions and amplification artifacts during library construction. Hence, the limitations of short paired reads make chromosome-level scaffolding impossible, and scaffolding errors occur frequently.

In the past few years, several new sequencing technologies have become available that are increasingly used to resolve problems in scaffolding. Comparing with NGS, long-range sequencing, represented by single-molecule real-time (SMRT) sequencing from Pacific Biosciences (PacBio; [13]) and nanopore-based sequencing from Oxford Nanopore Technologies (ONT; [14]), is capable of producing much longer reads. SMRT sequencing technology can produce an average read length of 10–30 kb [15], but reads can exceed 80 kb. Nanopore read length is depended on the length of the DNA molecules to be sequenced [16]. The longest read obtained by this method can be longer than 1 Mb [17]. These long reads can span most repetitive regions in the genome and link two or more different contigs, which greatly improves scaffolding quality [18]. However, the sequencing error rate for long reads is ~15% [17, 19–22], and obtaining accurate alignment between long reads and contigs is difficult. For error-prone long reads, high consensus accuracy can be achieved by hybrid error correction or selfcorrection methods, but the computation cost is expensive [23]. ONT has developed a new method that can reduce the error rate to ~3%, but the throughput is also reduced [16]. Recently, PacBio obtained a consensus sequence (HiFi read) from multiple passes of a long individual molecule. Its average length was ~10–25 kb, and the accuracy rate was >99.5% [24, 25].

The Hi-C protocol is capable of generating paired reads that are in close proximity in the 3-D space. The number of Hi-C paired reads from the same chromosome is significantly larger than that from different chromosomes [26, 27]. If two contigs come from the same chromosome, the number of paired reads linking them decreases approximately exponentially as a function of the genomic distance [28, 29]. Based on these two patterns, Hi-C paired reads can be used to distinguish contigs from the same chromosome as other contigs and deduce the order among contigs. Unfortunately, the true linear genomic distance between one set of paired reads varies greatly and ambiguously, as the minimum is only a few 1000 bases, and the maximum is a few 100 million bases, which prevents Hi-C paired reads from yielding accurate scaffolding results.

Recently, linked reads generated by chromium technology from 10X Genomics have become available for scaffolding [30]. This technology can capture a long-range sequence fragment with an average length of 100 kb and use the data to sequence some short barcoded reads. Therefore, linked reads with the same barcode can be utilized to link the different contigs. In addition, this technology cost less than that for long reads, and its accuracy is the same as that of NGS.

Optical mapping technologies [31], commercialized by Bionano and Opgen, can identify the locations of markers (restriction enzyme and nicking endonucleases) in a single molecule of DNA and provide a list of the ordered distances between any two markers. This information is finally used to construct a genomic optical map. Since the markers are known, it is easy to determine the distances between any two markers in each contig. Then, the contigs can be aligned in the optical map, and their orders and orientations can be deduced.

The technologies mentioned above commonly identify some or all parts of a long-range fragment and the identified parts can link the different contigs at a distance less than the length of the long-range fragment. However, the distribution characteristics of the identified parts, the lengths of the long-range fragments and the error rates substantially differ among these technologies. Current scaffolding methods commonly select one type of data mentioned previously as input, and various strategies are developed to exploit the links among contigs for scaffolding. Some scaffolding methods using hybrid data, protein sequences, or related reference genomes also exist. Although the field of scaffolding has been improved by these technologies, obtaining accurate scaffolding results remains challenging.

In this review, we summarize the widely used scaffolding methods and discuss both the strengths and weaknesses of sequencing data for scaffolding. Our purpose is to help readers gain a deeper understanding of the current development and future research in this domain. This article comprises the following parts. First, we introduce the difficulties of scaffolding. Second, according to read type, we divide the scaffolding methods into seven categories, including methods based on paired reads, long reads, Hi-C paired reads, linked reads, optical mapping, reference genomes and other data. Then, we review in detail how these methods take advantage of these reads to tackle the problems in scaffolding. Third, we review some scaffolding evaluation tools, and compare these scaffolding methods. Finally, the opportunities and directions of future research are discussed.

Scaffolding

After obtaining a set of contigs from an assembler, the strand from which they originate is unknown, and their locations in the genome are also unclear. However, many types of reads can link different contigs, and this information can be used to estimate the approximate distance between two contigs and whether they are on the same strand. Scaffolding methods take advantage of these links to deduce the order and orientation among contigs. Their input includes a set of contigs and a set of reads, and their final output is a set of scaffolds.

Before scaffolding, two issues should be considered. (i) The correction of contigs: generally, the contigs input to scaffolding methods may contain some misassemblies (MA), which may introduce incorrect links between contigs and complicate the process of scaffolding. Hence, it is important to detect MA in contigs, and a contig can be broken into two parts at the site of misassembly. At present, several methods are available to detect and correct MA in contigs. REAPR [32], NxRepair [33], PECC [34] and MEC [35] were designed based on paired reads.

And, (ii) Alignment tool: we need to utilize an alignment tool to align reads against the contigs and produce the link information. Bowtie [36], Bowtie2 [37] and BWA [38] can align short reads [paired-end (PE) reads, mate-paired reads, Hi-C paired reads and linked reads] produced by NGS against contigs. Basic local alignment with successive refinement (BLASR [39]), BWA, Minimap [40] and Minimap2 [41] can align long reads against contigs. For optical mapping data, each scaffolding method usually designs a scoring mechanism to determine the location of each contig on the optical map. However, obtaining accurate alignment is challenging, and the current alignment tools often score each alignment and list all possible alignments from which users can evaluate and select.

The difficulties of scaffolding include sequencing errors and repetitive regions in the genome, which are the primary reasons for the production of ambiguous alignments [42, 43], thus increasing the difficulty of inferring orders and orientations among contigs. If the sequencing accuracy is 100% and sequenced fragments are sufficiently long to span all repetitive regions, scaffolding is simple. However, the current sequencing technologies cannot maintain a high accuracy and yield and sufficiently long sequences simultaneously, as they always leverage sequencing error and length of sequenced fragments.

The reads produced by different sequencing technologies are very different and have different abilities to overcome the difficulties of scaffolding. Protein sequences and related reference genomes can also be used for scaffolding. Hence, the design of existing scaffolding methods is often based on one type of input data. Therefore, we divide existing scaffolding methods into seven categories based on data types: (i) scaffolding methods based on paired reads; (ii) scaffolding methods based on long reads; (iii) scaffolding methods based on Hi-C paired reads; (iv) scaffolding methods based on linked reads; (v) scaffolding methods based on optical mapping; (vi) scaffolding methods based on reference genomes and (vii) other scaffolding methods. We show the draft processes of these scaffolding methods in Figure 1. Generally, these scaffolding methods include two primary steps. First, alignment tools are utilized to obtain the links among contigs. Second, based on the alignment information, scaffolding methods use various strategies to deduce the order and orientation among contigs and produce the final scaffolds. We list the scaffolding methods in Table 1. Below, we will detail these different types of scaffolding methods.

Processes of six scaffolding method types. First, these reads are aligned against the contigs, or these contigs are aligned against optical maps or the reference genomes. Second, based on the alignment information, the order and orientation among contigs are deduced. Finally, scaffolds are output by these methods.
Figure 1

Processes of six scaffolding method types. First, these reads are aligned against the contigs, or these contigs are aligned against optical maps or the reference genomes. Second, based on the alignment information, the order and orientation among contigs are deduced. Finally, scaffolds are output by these methods.

Table 1

Scaffolding methods. The first column shows the type of scaffolding method, the second column shows the name of the scaffolding method and the publication year, and the last column shows the link used to obtain the method

TypeMethodURL
Scaffolding methods based on paired reads (paired-end/mate-paired reads)Bambus (2003) [62]http://www.tigr.org/software/bambus
SSPACE (2011) [52]www.baseclear.com/bioinformatics-tools/
MIP (2011) [53]http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/
Bambus2 (2011) [63]http://amos.sf.net
Opera (2011) [54]http://sourceforge.net/projects/operasf
GRASS (2012) [59]http://code.google.com/p/tud-scaffolding/
SLIQ (2012) [60]https://omictools.com/sliq-tool
SOAPdenovo2 (2012) [49]https://sourceforge.net/project/soapdenovo2/
SCARPA (2013) [50]http://compbio.cs.toronto.edu/scarpa
SILP2 (2014) [66]http://dna.engr.uconn.edu/software/SILP2
Scaftools (2014) [65]https://hal.archives-ouvertes.fr/hal-01198359
BESST (2014) [55]https://github.com/ksahlin/BESST
ScaffMatch (2015) [51]http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch
WiseScaffolder (2015) [68]http://abims.sb-roscoff.fr/wisescaffolder
SOPRA (2016) [58]http://www.physics.rutgers.edu/∼anirvans/SOPRA/
BOSS (2017) [64]https://github.com/bioinfomaticsCSU/BOSS
inGap-sf (2017) [1]https://sourceforge.net/projects/ingap-sf/
BATISCAF (2018) [70]https://github.com/mandricigor/batiscaf
iLSLS (2018) [67]https://github.com/bioinfomaticsCSU/iLSLS
SCOP (2019) [71]https://github.com/bioinfomaticsCSU/SCOP
Scaffolding methods based on long readsSSPACE-LongRead (2014) [75]https://www.baseclear.com/services/bioinformatics/basetools/
LINKS(2015) [77]https://github.com/warrenlr/LINKS/
SMIS (2015) [78]https://www.sanger.ac.uk/tool/smis/
RAILS (2016) [79]https://github.com/warrenlr/RAILS
DBG2OLC (2016) [80]https://github.com/yechengxi/DBG2OLC
SMSC (2017) [76]https://bitbucket.org/NDBL/smsc
npScarf (2017) [82]https://github.com/mdcao/npScarf
LRScaf (2019) [82]https://github.com/shingocat/lrscaf
SLR (2019) [83]https://github.com/luojunwei/SLR
Scaffolding methods based on linked readsfragScaff (2014) [86]https://sourceforge.net/projects/fragscaff/
ARCS (2017) [84]https://github.com/bcgsc/ARCS/
ARKS (2018) [85]https://github.com/bcgsc/arks
AnVIL(2020) [88]https://github.com/markhilt/AnVIL
Scaffolding methods based on Hi-C paired readsLACHESIS (2013) [28]https://github.com/shendurelab/LACHESIS
GRAAL (2014) [91]https://github.com/koszullab/GRAAL
3D-DNA (2017) [29]https://github.com/theaidenlab/3d-dna
SALSA (2017) [93]https://github.com/marbl/SALSA
SALSA2 (2019) [94]https://github.com/machinegun/SALSA
HiCAssembler (2019) [95]https://github.com/maxplanck-ie/HiCAssembler
instaGRAAL(2020) [92]https://github.com/koszullab/instaGRAAL
Scaffolding methods based on optical mapsSOMA (2008) [96]ftp://ftp.cbcb.umd.edu/pub/software/soma
AGORA (2012) [97]https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-189/MediaObjects/12859_2012_5306_MOESM3_ESM.zip
WGM (2012) [98]https://www.nature.com/articles/nbt.2478
SewingMachine (2015) [100]https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
OMGS(2020) [101]https://github.com/ucrbioinfo/OMGS
Scaffolding methods based on reference genomesCAR (2014) [110]https://genome.cs.nthu.edu.tw/CAR/
MeDuSa (2015) [107]https://combo.dbe.unifi.it/medusa
Multi-CAR (2016) [111]https://genome.cs.nthu.edu.tw/Multi-CAR/
Ragout2 (2018) [102]http://fenderglass.github.io/Ragout/
CSAR (2018) [113]https://github.com/ablab-nthu/CSAR
CSAR-web (2018) [112]https://genome.cs.nthu.edu.tw/CSAR-web/
RaGOO (2019) [102]https://github.com/malonge/RaGOO
Other scaffolding methodsSWIPS (2013) [118]https://www.well.ox.ac.uk/yli142/swips.html
OPERA-LG (2016) [116]https://sourceforge.net/projects/operasf
PEP_scaffolder (2016) [120]https://www.fishbrowser.org/software/PEPscaffolder/
TypeMethodURL
Scaffolding methods based on paired reads (paired-end/mate-paired reads)Bambus (2003) [62]http://www.tigr.org/software/bambus
SSPACE (2011) [52]www.baseclear.com/bioinformatics-tools/
MIP (2011) [53]http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/
Bambus2 (2011) [63]http://amos.sf.net
Opera (2011) [54]http://sourceforge.net/projects/operasf
GRASS (2012) [59]http://code.google.com/p/tud-scaffolding/
SLIQ (2012) [60]https://omictools.com/sliq-tool
SOAPdenovo2 (2012) [49]https://sourceforge.net/project/soapdenovo2/
SCARPA (2013) [50]http://compbio.cs.toronto.edu/scarpa
SILP2 (2014) [66]http://dna.engr.uconn.edu/software/SILP2
Scaftools (2014) [65]https://hal.archives-ouvertes.fr/hal-01198359
BESST (2014) [55]https://github.com/ksahlin/BESST
ScaffMatch (2015) [51]http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch
WiseScaffolder (2015) [68]http://abims.sb-roscoff.fr/wisescaffolder
SOPRA (2016) [58]http://www.physics.rutgers.edu/∼anirvans/SOPRA/
BOSS (2017) [64]https://github.com/bioinfomaticsCSU/BOSS
inGap-sf (2017) [1]https://sourceforge.net/projects/ingap-sf/
BATISCAF (2018) [70]https://github.com/mandricigor/batiscaf
iLSLS (2018) [67]https://github.com/bioinfomaticsCSU/iLSLS
SCOP (2019) [71]https://github.com/bioinfomaticsCSU/SCOP
Scaffolding methods based on long readsSSPACE-LongRead (2014) [75]https://www.baseclear.com/services/bioinformatics/basetools/
LINKS(2015) [77]https://github.com/warrenlr/LINKS/
SMIS (2015) [78]https://www.sanger.ac.uk/tool/smis/
RAILS (2016) [79]https://github.com/warrenlr/RAILS
DBG2OLC (2016) [80]https://github.com/yechengxi/DBG2OLC
SMSC (2017) [76]https://bitbucket.org/NDBL/smsc
npScarf (2017) [82]https://github.com/mdcao/npScarf
LRScaf (2019) [82]https://github.com/shingocat/lrscaf
SLR (2019) [83]https://github.com/luojunwei/SLR
Scaffolding methods based on linked readsfragScaff (2014) [86]https://sourceforge.net/projects/fragscaff/
ARCS (2017) [84]https://github.com/bcgsc/ARCS/
ARKS (2018) [85]https://github.com/bcgsc/arks
AnVIL(2020) [88]https://github.com/markhilt/AnVIL
Scaffolding methods based on Hi-C paired readsLACHESIS (2013) [28]https://github.com/shendurelab/LACHESIS
GRAAL (2014) [91]https://github.com/koszullab/GRAAL
3D-DNA (2017) [29]https://github.com/theaidenlab/3d-dna
SALSA (2017) [93]https://github.com/marbl/SALSA
SALSA2 (2019) [94]https://github.com/machinegun/SALSA
HiCAssembler (2019) [95]https://github.com/maxplanck-ie/HiCAssembler
instaGRAAL(2020) [92]https://github.com/koszullab/instaGRAAL
Scaffolding methods based on optical mapsSOMA (2008) [96]ftp://ftp.cbcb.umd.edu/pub/software/soma
AGORA (2012) [97]https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-189/MediaObjects/12859_2012_5306_MOESM3_ESM.zip
WGM (2012) [98]https://www.nature.com/articles/nbt.2478
SewingMachine (2015) [100]https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
OMGS(2020) [101]https://github.com/ucrbioinfo/OMGS
Scaffolding methods based on reference genomesCAR (2014) [110]https://genome.cs.nthu.edu.tw/CAR/
MeDuSa (2015) [107]https://combo.dbe.unifi.it/medusa
Multi-CAR (2016) [111]https://genome.cs.nthu.edu.tw/Multi-CAR/
Ragout2 (2018) [102]http://fenderglass.github.io/Ragout/
CSAR (2018) [113]https://github.com/ablab-nthu/CSAR
CSAR-web (2018) [112]https://genome.cs.nthu.edu.tw/CSAR-web/
RaGOO (2019) [102]https://github.com/malonge/RaGOO
Other scaffolding methodsSWIPS (2013) [118]https://www.well.ox.ac.uk/yli142/swips.html
OPERA-LG (2016) [116]https://sourceforge.net/projects/operasf
PEP_scaffolder (2016) [120]https://www.fishbrowser.org/software/PEPscaffolder/
Table 1

Scaffolding methods. The first column shows the type of scaffolding method, the second column shows the name of the scaffolding method and the publication year, and the last column shows the link used to obtain the method

TypeMethodURL
Scaffolding methods based on paired reads (paired-end/mate-paired reads)Bambus (2003) [62]http://www.tigr.org/software/bambus
SSPACE (2011) [52]www.baseclear.com/bioinformatics-tools/
MIP (2011) [53]http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/
Bambus2 (2011) [63]http://amos.sf.net
Opera (2011) [54]http://sourceforge.net/projects/operasf
GRASS (2012) [59]http://code.google.com/p/tud-scaffolding/
SLIQ (2012) [60]https://omictools.com/sliq-tool
SOAPdenovo2 (2012) [49]https://sourceforge.net/project/soapdenovo2/
SCARPA (2013) [50]http://compbio.cs.toronto.edu/scarpa
SILP2 (2014) [66]http://dna.engr.uconn.edu/software/SILP2
Scaftools (2014) [65]https://hal.archives-ouvertes.fr/hal-01198359
BESST (2014) [55]https://github.com/ksahlin/BESST
ScaffMatch (2015) [51]http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch
WiseScaffolder (2015) [68]http://abims.sb-roscoff.fr/wisescaffolder
SOPRA (2016) [58]http://www.physics.rutgers.edu/∼anirvans/SOPRA/
BOSS (2017) [64]https://github.com/bioinfomaticsCSU/BOSS
inGap-sf (2017) [1]https://sourceforge.net/projects/ingap-sf/
BATISCAF (2018) [70]https://github.com/mandricigor/batiscaf
iLSLS (2018) [67]https://github.com/bioinfomaticsCSU/iLSLS
SCOP (2019) [71]https://github.com/bioinfomaticsCSU/SCOP
Scaffolding methods based on long readsSSPACE-LongRead (2014) [75]https://www.baseclear.com/services/bioinformatics/basetools/
LINKS(2015) [77]https://github.com/warrenlr/LINKS/
SMIS (2015) [78]https://www.sanger.ac.uk/tool/smis/
RAILS (2016) [79]https://github.com/warrenlr/RAILS
DBG2OLC (2016) [80]https://github.com/yechengxi/DBG2OLC
SMSC (2017) [76]https://bitbucket.org/NDBL/smsc
npScarf (2017) [82]https://github.com/mdcao/npScarf
LRScaf (2019) [82]https://github.com/shingocat/lrscaf
SLR (2019) [83]https://github.com/luojunwei/SLR
Scaffolding methods based on linked readsfragScaff (2014) [86]https://sourceforge.net/projects/fragscaff/
ARCS (2017) [84]https://github.com/bcgsc/ARCS/
ARKS (2018) [85]https://github.com/bcgsc/arks
AnVIL(2020) [88]https://github.com/markhilt/AnVIL
Scaffolding methods based on Hi-C paired readsLACHESIS (2013) [28]https://github.com/shendurelab/LACHESIS
GRAAL (2014) [91]https://github.com/koszullab/GRAAL
3D-DNA (2017) [29]https://github.com/theaidenlab/3d-dna
SALSA (2017) [93]https://github.com/marbl/SALSA
SALSA2 (2019) [94]https://github.com/machinegun/SALSA
HiCAssembler (2019) [95]https://github.com/maxplanck-ie/HiCAssembler
instaGRAAL(2020) [92]https://github.com/koszullab/instaGRAAL
Scaffolding methods based on optical mapsSOMA (2008) [96]ftp://ftp.cbcb.umd.edu/pub/software/soma
AGORA (2012) [97]https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-189/MediaObjects/12859_2012_5306_MOESM3_ESM.zip
WGM (2012) [98]https://www.nature.com/articles/nbt.2478
SewingMachine (2015) [100]https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
OMGS(2020) [101]https://github.com/ucrbioinfo/OMGS
Scaffolding methods based on reference genomesCAR (2014) [110]https://genome.cs.nthu.edu.tw/CAR/
MeDuSa (2015) [107]https://combo.dbe.unifi.it/medusa
Multi-CAR (2016) [111]https://genome.cs.nthu.edu.tw/Multi-CAR/
Ragout2 (2018) [102]http://fenderglass.github.io/Ragout/
CSAR (2018) [113]https://github.com/ablab-nthu/CSAR
CSAR-web (2018) [112]https://genome.cs.nthu.edu.tw/CSAR-web/
RaGOO (2019) [102]https://github.com/malonge/RaGOO
Other scaffolding methodsSWIPS (2013) [118]https://www.well.ox.ac.uk/yli142/swips.html
OPERA-LG (2016) [116]https://sourceforge.net/projects/operasf
PEP_scaffolder (2016) [120]https://www.fishbrowser.org/software/PEPscaffolder/
TypeMethodURL
Scaffolding methods based on paired reads (paired-end/mate-paired reads)Bambus (2003) [62]http://www.tigr.org/software/bambus
SSPACE (2011) [52]www.baseclear.com/bioinformatics-tools/
MIP (2011) [53]http://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/
Bambus2 (2011) [63]http://amos.sf.net
Opera (2011) [54]http://sourceforge.net/projects/operasf
GRASS (2012) [59]http://code.google.com/p/tud-scaffolding/
SLIQ (2012) [60]https://omictools.com/sliq-tool
SOAPdenovo2 (2012) [49]https://sourceforge.net/project/soapdenovo2/
SCARPA (2013) [50]http://compbio.cs.toronto.edu/scarpa
SILP2 (2014) [66]http://dna.engr.uconn.edu/software/SILP2
Scaftools (2014) [65]https://hal.archives-ouvertes.fr/hal-01198359
BESST (2014) [55]https://github.com/ksahlin/BESST
ScaffMatch (2015) [51]http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch
WiseScaffolder (2015) [68]http://abims.sb-roscoff.fr/wisescaffolder
SOPRA (2016) [58]http://www.physics.rutgers.edu/∼anirvans/SOPRA/
BOSS (2017) [64]https://github.com/bioinfomaticsCSU/BOSS
inGap-sf (2017) [1]https://sourceforge.net/projects/ingap-sf/
BATISCAF (2018) [70]https://github.com/mandricigor/batiscaf
iLSLS (2018) [67]https://github.com/bioinfomaticsCSU/iLSLS
SCOP (2019) [71]https://github.com/bioinfomaticsCSU/SCOP
Scaffolding methods based on long readsSSPACE-LongRead (2014) [75]https://www.baseclear.com/services/bioinformatics/basetools/
LINKS(2015) [77]https://github.com/warrenlr/LINKS/
SMIS (2015) [78]https://www.sanger.ac.uk/tool/smis/
RAILS (2016) [79]https://github.com/warrenlr/RAILS
DBG2OLC (2016) [80]https://github.com/yechengxi/DBG2OLC
SMSC (2017) [76]https://bitbucket.org/NDBL/smsc
npScarf (2017) [82]https://github.com/mdcao/npScarf
LRScaf (2019) [82]https://github.com/shingocat/lrscaf
SLR (2019) [83]https://github.com/luojunwei/SLR
Scaffolding methods based on linked readsfragScaff (2014) [86]https://sourceforge.net/projects/fragscaff/
ARCS (2017) [84]https://github.com/bcgsc/ARCS/
ARKS (2018) [85]https://github.com/bcgsc/arks
AnVIL(2020) [88]https://github.com/markhilt/AnVIL
Scaffolding methods based on Hi-C paired readsLACHESIS (2013) [28]https://github.com/shendurelab/LACHESIS
GRAAL (2014) [91]https://github.com/koszullab/GRAAL
3D-DNA (2017) [29]https://github.com/theaidenlab/3d-dna
SALSA (2017) [93]https://github.com/marbl/SALSA
SALSA2 (2019) [94]https://github.com/machinegun/SALSA
HiCAssembler (2019) [95]https://github.com/maxplanck-ie/HiCAssembler
instaGRAAL(2020) [92]https://github.com/koszullab/instaGRAAL
Scaffolding methods based on optical mapsSOMA (2008) [96]ftp://ftp.cbcb.umd.edu/pub/software/soma
AGORA (2012) [97]https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-189/MediaObjects/12859_2012_5306_MOESM3_ESM.zip
WGM (2012) [98]https://www.nature.com/articles/nbt.2478
SewingMachine (2015) [100]https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
OMGS(2020) [101]https://github.com/ucrbioinfo/OMGS
Scaffolding methods based on reference genomesCAR (2014) [110]https://genome.cs.nthu.edu.tw/CAR/
MeDuSa (2015) [107]https://combo.dbe.unifi.it/medusa
Multi-CAR (2016) [111]https://genome.cs.nthu.edu.tw/Multi-CAR/
Ragout2 (2018) [102]http://fenderglass.github.io/Ragout/
CSAR (2018) [113]https://github.com/ablab-nthu/CSAR
CSAR-web (2018) [112]https://genome.cs.nthu.edu.tw/CSAR-web/
RaGOO (2019) [102]https://github.com/malonge/RaGOO
Other scaffolding methodsSWIPS (2013) [118]https://www.well.ox.ac.uk/yli142/swips.html
OPERA-LG (2016) [116]https://sourceforge.net/projects/operasf
PEP_scaffolder (2016) [120]https://www.fishbrowser.org/software/PEPscaffolder/

Scaffolding methods based on paired reads

With the continuous evolution of NGS technologies, sequencing costs continue to decrease and sequencing throughput continues to increase. NGS technologies have been widely applied to many research fields with great success [44–47]. NGS technologies can generally produce both single-end and PE reads. However, the read length of NGS is short. Single-end reads can link only two contigs that are closely together in the genome, and it is difficult to obtain accurate alignments in repetitive regions [48].

Paired reads (PE/mate-paired reads) come from two ends of a long sequence fragment and are not on the same strand. The size of the long sequence fragment, called the insert size, can reach several 1000 bases. Alignment of one set of paired reads is aligned in two different contigs usually indicates that the distance of the two contigs is smaller than the insert size. If the alignment orientations are the same, the two contigs are in the reverse orientation; otherwise, they are in the same orientation. Therefore, paired reads are typically used to infer order and orientation among contigs. Currently, many scaffolding methods based on paired reads have been developed. These methods generally include the following three steps.

Preprocessing

Due to the limitations of read length, repetitive regions and sequencing errors, one short read may be aligned at incorrect or multiple locations. Most scaffolding methods adopt stand-alone alignment tools (Bowtie, Bowtie2 and BWA) to obtain the positions of reads in the contigs. SOAPDenovo2 [49] uses its own alignment tool. SCARPA [50] and ScaffMatch [51] utilize a simple approach to address these ambiguous alignments, as alignment of one paired reads at multiple positions results in the deletion of alignment information. SSPACE [52] removes the paired reads that are both aligned in multiple positions. By analyzing the characteristics of paired reads that are aligned in the same region, MIP [53] removes inconsistent reads. To determine whether the contigs are repetitive regions, ScaffMatch first calculates the average read coverage of the contigs and then removes the contigs with high read coverage. For the detection and removal of MA in contigs, SOPRA infers whether the misassembly occurs by analyzing the number of paired reads aligned in the same contig.

Constructing a scaffold graph

Scaffolding methods usually involve construction of a graph to store links among contigs. These methods extract some paths from the graph, and each path corresponds to a scaffold. In different studies, this graph has several different names, such as a scaffold graph/scaffolding graph (MIP, Opera [54], SCAPRA, BESST [55], integer linear programming (ILP) [56], ScaffMatch, SWALO [57]), a contig connectivity graph (SOPRA [58]), a contig link graph (GRASS [59]), a contig graph (SLIQ [60]) and a directed graph (ScaffoldScaffolder [61]). In this review, we uniformly call these graph scaffold graphs.

In a scaffold graph, each vertex usually denotes a contig or one end of a contig, and each edge means that paired reads link related vertices. An edge in a scaffold graph determines the order and orientation of two linked vertices. In constructing a scaffold graph, the most important issues are how to determine whether to add an edge between two contigs, and how to set the edge weights. The weight of an edge reflects the reliability of the edge.

Scaffolding methods, such as MIP, Bambus [62], Bambus2 [63], Opera, SSPACE and ScaffoldScaffolder, add an edge between two vertices if the number of paired reads linking them exceeds one threshold to eliminate the misaligned paired reads. Commonly, this threshold is set arbitrarily by users. Unlike other methods, Opera involves the development of a simulation method to set this threshold based on genome size and sequence coverage. The weights of the edges in the above methods are set to be the number of links. In SCARPA each contig corresponds to two vertices, representing the 3′-end and the 5′-end of a contig. The forward and reverse chains of each contig in ScaffMatch correspond to two vertices. BESST and BOSS [64] are two scaffolding methods that use the distribution of paired reads to construct a scaffold graph. BESST considers paired reads that link two contigs to analyze the difference between the expected and observed standard deviation of gap distance and determines whether to add an edge based on the difference. However, BOSS further considers the paired reads in which only one mate read is aligned in the contigs to find the expected number of paired reads linking two contigs. Based on the ratio of the expected number and real number, BOSS determines whether to add an edge and sets the weight to be the ratio.

Determining orientation and order among contigs

For two contigs, the edge linking them determines whether they are on the same strand and the approximate distance between them. Due to misaligned paired reads, some incorrect edges are inevitably introduced in a scaffold graph, which commonly leads to some contradictions in orientation and order.

Hence, based on the different optimization strategies, scaffolding methods usually initially detect and the remove edges that could lead to the elimination of contradictions in the scaffold graph. Next, scaffolding methods use various strategies to select consistent paths from the scaffold graph, which determines the orientation and order of contigs in each path. These strategies can be divided into three categories as follows:

  • (i) Many different optimization algorithms are used for this problem. Bambus turns the contig orientation problem into the problem of coloring a bipartite graph. Bambus2 removes the minimum weighted edges to prohibit inconsistent directions in the graph and then uses the optimal linear arrangement model to determine the orientation and order of the contigs. To determine the orientations of the contigs, SOPRA aims to assign an orientation to each contig to satisfy as many paired reads as possible. In addition, it transfers finding an optimization solution to the maximum weight cut problem. To determine the order of the contigs, SOPRA continually assigns a position to each contig and utilizes a greedy approach to determine the distances between contigs that best agrees with the sizes of the gaps, as suggested by the edges. SLIQ first identifies and removes junctions from the scaffold graph. By deleting the edge with the lowest weight and strongly connected components, SLIQ guarantees that the scaffold graph is a directed acyclic graph. Finally, each connected component is extracted as a separate scaffold. SCARPA first determines only the orientation of all contigs. It transfers the contig orientation problem to a minimum odd cycle transversal problem and utilizes a fixed-parameter approach to delete some edges and to guarantee that there are no odd cycles in the scaffoldgraph. Second, SCARPA eliminates all directed cycles from the scaffold graph for ordering contigs. The problem is defined as the feedback arc set problem. Finally, by assigning a starting coordinate to each contig, SCARPA sets the order constraint for each contig and uses a linear programming approach to find a solution. Opera defines ordering and orienting contigs as a bounded version of the graph bandwidth problem. Then, Opera takes advantage of dynamic programming to find the exact scaffolding solution based on the scaffold graph. ScaffMatch translates the entire orientation and order inference problem into a maximum-weight acyclic 2-matching problem and iteratively removes some edges to prohibit cycles in the graph. In the final constructed acyclic graph, the vertices are linearized. ScaffoldScaffolder defines the contig orientation problem as a MAX-DIR problem, and aims to find at least k directed edges in a bidirected graph by assigning an orientation to each vertex. Scaftools [65] uses an iterative method of breaking cycles to determine the orientation and order relationship of the contigs in the contig connectivity graph.

  • (ii) Many scaffolding methods have been developed based on linear programming methods. The important issues in this kind of method are constraint functions and optimization objectives. For each edge in a scaffold graph, MIP obtains six constraints. A variable exists to relax each constraint. Then, an optimization objective is presented to minimize the amount of stretching in these constraints. Finally, MIP utilizes mixed-integer programming to find the solution. In addition, MIP divides the scaffold graph into biconnected components and then conducts scaffolding on each subgraph. GRASS presents a mixed-integer quadratic programming formulation of the scaffolding problem. Each edge is designed as a distance constraint, order constraint and orientation constraint. The optimization objective is to minimize the number of unsatisfactory constraints, whereas the expectation maximization (EM) approach is employed to solve the problem. SILP2 [66] builds a maximum likelihood model to remove links with low reliability. For links between two contigs, the reliability is calculated based on the number of links, the probability of repeated alignment, and the probability of abnormal coverage. Then, the model is solved by using ILP. BOSS utilizes an iterative strategy to find spurious edges by first constructing a subgraph that includes only edges with a large weight. Next, it iteratively adds the remaining edges to the subgraph from large to small weights. Each iteration includes a subgraph, and BOSS builds two linear programming models to resolve orientation and position contradictions in the subgraph. Finally, BOSS employs a heuristic approach to sort vertices (contigs) and then generates scaffolds.

  • (iii) Based on the scaffold graph, heuristic algorithms are designed to resolve the problem of fork paths. SSPACE utilizes a direct heuristic approach to extract paths from the scaffold graph. It selects the largest contig as the initial path and extends the path from its two ends by iteratively combining contigs. When there are multiple contigs with edges linking the end of the path, it calculates a ratio between the two best edge weights. If the ratio is below a user-defined threshold, the best candidate is added to the path. BESST constructs a scaffold graph that includes only the large contigs and selects paths from the graph using a method similar to SSPACE. Next, BESST utilizes a breadth-first search to place the small contigs between two large contigs. An MP library usually contains some PE reads that confuse the scaffolding. Based on BESST, an ILP model is constructed, whose objective function considers the discrepancies between MP and PE links. iLSLS [67] is a scaffolding method based on a greedy strategy. This method first determines the valuable contigs among all contigs based on the relationship between the paired reads and the contigs and removes the error regions of some unreliable contigs; then, iLSLS is adapted for the problem of uneven sequence depth and employs a ‘loose- strict-loose’ extended path strategy, which not only prevents the loss of correct connections between contigs but also reduces false contig extension. Finally, iLSLS also utilizes a scoring approach for more than one path. The use of a series of methods enables iLSLS to not only accurately estimate the gap size but also effectively improve the accuracy of the scaffolds. WiseScaffolder [68] uses a manual intervention model to optimize the scaffolding method. Weller et al. [69] used a tree decomposition method to solve the orientation and order relationship of contigs. inGap-sf [1] uses direct links and paired links to order contigs. A direct link is created through the detection of overlaps between all contig pairs. Paired links are created by paired reads that can be mapped to any pair of contigs. Then, inGap-sf travels a direct link graph to order contigs and uses a statistic-based estimation model based on paired links to solve fork problems.

Identifying repetitive regions in contigs is significant for scaffolding. Opera identifies and filters repetitive contigs whose read coverage is large. BATISCAF [70] classifies contigs as either repeat contigs or unique contigs. It first orders and orients only unique contigs and forms a sketch of all contigs. Finally, it inserts repeat and short contigs into the sketch and produces the final scaffolds. According to the GC-content and alignments of paired reads, SCOP [71] first classifies the contigs into three categories (true, uncertain and misassembled). This classification helps to detect MA and repetitive regions in contigs. After constructing a scaffold graph, SCOP not only detects spurious edges, but also splits contigs based on their categories. Most contradictions in the scaffold graph can be removed, and the accuracy of the scaffold graph is improved.

Scaffolding methods based on long reads

With their read length advantage, third-generation sequencing technologies have been widely used in many biological applications and have greatly improved the development of scaffolding methods [72]. Currently, many methods use various strategies to resolve repetitive problems based on long reads [73, 74]. A long read can be aligned in multiple contigs and can link two or more contigs. Then, based on their alignment orientations and positions, their orders and orientations can be inferred. Hence, obtaining accurate alignment results is important for scaffolding.

SSPACE-LongRead [75] uses BLASR to obtain alignments between long reads and contigs. For two contigs, the number of long reads linking them is their weight. Thus, SSPACE-LongRead considers two contigs with high weight to be adjacent. SMSC [76] also uses BLASR and Nucmer as alignment tools and first detects MA in contigs. Then, SMSC constructs a scaffold graph, and the scaffolding problem is transferred into a maximum alternating path cover problem.

To improve the accuracy of alignment, LINKS [77] first extracts paired k-mers from the long reads. Then, these paired k-mers are aligned against the contigs, and used to link different contigs. SMIS [78] also splits the long reads into short segments to produce fake mate reads and link different contigs. RAILS [79] was developed based on the engines of SSAKE and LINKS. This strategy can be used to obtain numerous paired k-mers linking contigs. However, one paired k-mer is easily aligned at multiple positions which may confuse the subsequent steps.

Unlike other methods, DBG2OLC [80] first detects the order and orientation of the long reads. Next, DBG2OLC transfers the long reads to the contigs. Specifically, DBG2OLC detects and finds overlaps between long reads and contigs. Then, based on the contigs that can be mapped in the same long read, the long read is represented by contig identifiers and termed a compressed read. Because the compressed reads are much shorter than the original long reads, DBG2OLC can quickly find overlaps among the compressed reads. Next, DBG2OLC constructs a read overlap graph, in which a node is a compressed read, and nodes are chained to one another in both directions with the best possible overlap. After chaining, each chain is converted back to raw nucleotide reads, and a consensus module is called to align these reads to each backbone and calculate the polished assembly.

Because most repetitive regions in contigs can be identified by the long reads, some methods have been proposed based on classifying contigs into two categories: repetitive contigs and nonrepetitive contigs. These methods first order and orient nonrepetitive contigs and produce draft scaffolds, and repetitive contigs are then inserted into these draft scaffolds. npScarf [81] classifies contigs as either unique contigs or repetitive contigs based on sequencing coverage. High coverage of a contig usually indicates that it is a repetitive. LRScaf [82] identifies and removes repetitive regions in the contigs, based on the alignment coverage between the long reads and the contigs. Then, such regions are ignored in the following steps of LRScaf. LRScaf also constructs a scaffold graph, from which it extracts simple paths. SLR [83] develops a method to divide contigs into unique and ambiguous contigs. SLR produces local scaffolds, each of which includes ordered and oriented contigs that can be aligned in the same long read. Based on the positions of a contig in the local scaffolds, SLR determines whether it is ambiguous. After constructing a scaffold graph that includes only unique contigs, SLR adopts a linear programming model to optimize and extract scaffolds from the scaffold graph.

Scaffolding methods based on linked reads

Recently, 10X Genomics developed a new long-range sequencing technology. Unlike TGS, it does not capture all the information about a long sequence. It extracts only some short reads, named linked reads, from the same long DNA molecule, which can guarantee the correctness of linked reads and reduce sequencing cost. It is an effective and less expensive alternative for scaffolding. Note that each linked read is combined with a barcode. Therefore, linked reads with the same barcode are from the same long sequence and should be grouped together. Alignment of linked reads in the same group with different contigs usually means that these contigs can be linked.

ARCS [84] first groups the linked reads by barcode. When one contig is aligned with some linked reads, ARCS uses a binomial test to validate whether the alignment is reliable. ARCS also constructs a scaffold graph in which an edge is added when the link orientation has the most types of supporting barcodes. After the scaffold graph is completed, ARCS adopts LINKS to orient and order the contigs. Other stand-alone scaffolding methods that can construct a scaffold graph can also be used within the ARCS pipeline.

ARKS [85] developed a new method to connect contigs based on linked reads. To reduce the run time, ARKS uses the k-mer strategy to align linked reads and contigs. The k-mers at the two ends of the contigs are stored in a hash table. By computing the shared k-mers between one end of a contig and a linked read, ARKS gives an alignment score and determines whether the linked read can be aligned at the end of the contigs. If the ends of two contigs can be linked by the linked reads from the same group, one edge can be added between them. To estimate the distance between two contigs, ARKS trains an approach based on the distribution of linked reads aligned at the two ends of the given contigs. After the edges are determined, a scaffold graph is constructed. The subsequent step of orienting and ordering the contigs is the same as that of ARCS.

fragScaff [86] computes a shared barcode fraction metric between any paired contigs. When the value is higher than a threshold, an edge is added between two contigs. For a scaffold graph, fragScaff uses the maximum weight minimum spanning tree to traverse the scaffold graph and produce the final scaffolds. In addition, Architect [87], a scaffolder designed based on molecular read cloud specifications, also uses barcoded reads to improve the contiguity of genome assemblies. AnVIL [88] was developed for not only scaffolding but also minimizing gaps in the scaffolds. AnVIL uses linked reads to detect the linkages between the contigs, and adopts an overlap-layout-consensus strategy to fill the gaps.

Since the order information of linked reads with the same barcode is unknown, the methods to determine whether two contigs can be linked are not straightforward, and the estimation of the gap size between two linked contigs is unclear. All these methods require more sophisticated algorithms.

Scaffolding methods based on Hi-C paired reads

Hi-C technology is derived from chromosome conformation capture and high-throughput PE sequencing technology [26, 89]. Every Hi-C paired read refers to two genome locations and indicates a spatial interaction between these two genome locations. After aligning Hi-C paired reads against a reference genome, the number of paired reads that link any two regions in a reference genome can be computed. There are two interesting findings. First, the frequency of links between two regions in two different chromosomes is much less than that between two regions in the same chromosome, and this value can even be separated by tens of megabases. Second, for two regions in the same chromosome, the frequency of their links decays rapidly with the linear genome distance. Therefore, Hi-C paired reads can be used to study the spatial relationship of the entire chromatin DNA and to obtain high-resolution three-dimensional structural information on chromatin.

In addition, Hi-C paired reads can be used in combination with scaffolding to generate chromosome-scale scaffolds. Multiple scaffolding methods based on Hi-C paired reads are presented. LACHESIS [28] is the first method of genome assembly that use Hi-C paired reads and consists mainly of three steps. After the alignment of Hi-C paired reads against the contigs, the number of contacts between any pair of contigs can be obtained. Then, the contigs with more contacts are first clustered into one group by using the hierarchical clustering approach; when the number of groups reaches the number of user-specified chromosomes, the grouping approach ends. Next, the contigs are ordered within chromosome groups, and they may be adjacent if the number of their contacts is large. Finally, the orientation is determined for each contig within the groups by calculating an orientation quality score. However, the number of chromosomes is user-specified, and for an unknown chromosome, the LACHESIS approach cannot be utilized. dnaTri [90] is a statistical scaffolding method and trains a model using a supervised machine learning approach to verify that the pattern can be used to accurately predict the chromosome to which the contigs belongs, verifying that the pattern can accurately predict the location of the contigs. One disadvantage of LACHESIS and dnaTri is that they both rely on hierarchical clustering algorithms, and the cost of calculating the score for each contig is high. Another disadvantage is that there are no error correction steps for contigs or scaffolds before scaffolding, which could result in the output of some erroneous contigs or scaffolds to the final scaffold. GRAAL [91] is a probabilistic scaffolding method, but its verification is limited to a single chromosome. Based on GRAAL, instaGRAAL [92] is presented to overcome some limitations. instaGRAAL enhances the computational efficiency of GRAAL and adds a step to correct MA in contigs.

Hi-C paired reads can also be used to identify errors in contigs. 3D-DNA [29] first identifies and corrects errors in the input contigs by using Hi-C paired reads. First, the approach for misjoining correction uses many iterative steps to correct misjoins in the input contigs, exploiting the fact that the correct contigs have more contacts than the wrong contigs. Next, the scaffolding method is run to cluster, order and orient the contigs and output a megascaffold containing all the contigs. Finally, a merged method is run to merge the contigs with overlapping regions and output final scaffolds. However, the megascaffold is broken into a user-specified number of chromosomes based on the Hi-C contact map.

SALSA [93] defines the physical coverage for each base in the contigs, which is the sum of paired reads that span it. A sudden drop in physical coverage at a certain position signalizes for misassembly. For the efficient detection of misassembled signals in contigs, SALSA adopts a variation of Kadane’s approach for the maximum sum subarray problem. After correction, SALSA constructs a scaffold graph, and the weight of an edge is set based on the number of links and the lengths of contigs. Two long contigs tend to have more links. SALSA2 [94] was developed based on SALSA and uses a hybrid graph that combines Hi-C links with an assembly graph to generate scaffolds. SALSA2 calculates the score for each edge and retains the edges with an edge score > 1. Then, the edges are selected using the greedy weighted maximum matching approach, and the edges from the same contig are added to generate the initial scaffold.

Recently, a new approach called HiCAssembler was proposed [95]. It uses HiCExplorer to create a Hi-C contact matrix that filters out some low-quality reads. It also uses two methods to remove MA: the TAD detection approach based on the TAD separation score and a manual adjustment parameter method based on visual inspection. HiCAssembler obtains chromosome-scale scaffolds by iteratively joining scaffold paths to form paths of increasing size, as in 3D-DNA. The scaffolding method uses a maximum spanning tree approach to determine the order of contigs similar to LACHESIS.

Scaffolding methods based on optical maps

Single-molecule optical maps are ordered genome wide optical restriction maps (ORMs) derived from a single DNA molecule. Optical maps provide a macroscopic framework that reflects structural information on the whole genome. Single-molecule optical maps for scaffolding need to solve two problems: how to find the best matches between the contigs and the optical maps, and how to place the contigs on the map in the correct order. The following are some methods for scaffolding using optical maps.

SOMA [96] is a scaffolding method that overcomes limitations such as the length of the sequencing reads by combining the information of the optical maps. The scaffolding problem is decomposed into two simple subproblems: seeking good matches between the optical maps and the contigs and searching for a consistent placement when given many matches. AGORA [97] uses an optical map within the de Bruijn graph framework to promote genome assembly and simplifies the process of genome assembly into the process of finding a ‘Chinese postman path’. The method process mainly includes two steps. The landmark edges can be found first by using the greedy approach, which is an edge in the graph that matches one position on the optical map. The landmark edges are then sorted in the order of the placement of the optical map. Next, the path between consecutive landmark edges can be found by using the depth-first search approach. The experiments discussed in this article not only indicated that optical maps are effective for genome assembly in the de Bruijn graph framework but also demonstrated the advantages of more accurate optical mapping technologies such as nanocoding.

Due to the insufficiency of traditional optical mapping methods, such as the extremely low efficiency of the image capture process, the difficulty of expanding DNA molecules and the incompletely automated analysis of the optical image, an approach called whole genome mapping (WGM) [98] is used to generate optical mapping data and scaffolds. The WGM process includes the following steps. Briefly, the approach first converts contigs into in silico maps and then uses an alignment approach in which single-molecule maps are aligned to the in silico maps to extend the scaffolds. Then, after all the contigs are expanded, the contigs are used to generate regions including overlaps in adjacent contigs, and the contigs with overlapping regions are finally connected by using a modified Smith–Waterman dynamic programming approach to form a super-scaffold. The use of WGM technology for large genome assembly was verified in goats. Based on ORMs, a greedy scoring approach and three greedy placement algorithms are utilized for scaffolding [99]. The score of each contig corresponding to its possible placement in the ORM is first calculated and then used to determine the nonoverlapping placement of each contig in the ORM.

For Bionano optical maps, SewingMachine [100] and optical map-based genome scaffolding (OMGS [101]) are the two main methods for scaffolding. SewingMachine iteratively uses RefAligner and Stitch. First, the contigs are converted into new maps, and RefAligner then aligns the new maps to optical maps. Based on the alignment information, Stitch orders and orients the contigs. Because most methods can process only one optical map at a time, OMGS was developed to solve this problem by handling two or more optical maps simultaneously. Multiple optical maps can supply more linkage information between contigs and are helpful for resolving conflicts and repetitive regions.

Scaffolding methods based on reference genomes

The methods mentioned above are reference-free and use the reads from the targeted organism for scaffolding. To date, there are many available complete genomes. During evolution, although DNA sequences vary, some conserved genes and their order are retained. Hence, to assemble a target genome, one or more closely related reference genomes can serve as a reference and are useful for guiding the process of scaffolding. Obviously, more related reference genomes are more conducive to identifying conserved genes and their order.

Recently, some reference-based scaffolding methods were presented, most of which are faster and cheaper than reference-free methods [102]. These methods can be further divided into alignment-based methods [102–107] and rearrangement-based methods [102, 108–113].

Alignment-based methods usually directly align contigs against the references and then deduce their order and orientation based on their aligned positions and orientations in the reference genome. For example Projector 2 [103], Mauve [104] and OSLay [105]. MeDuSa [107] take advantage of multiple sets of reference genomes from related organisms for scaffolding. A scaffold graph is first constructed in which each vertex denotes a contig, and the edge is weighted by the number of reference supports adjacent to the two vertices. MeDuSa utilizes a constant factor approximation approach to order and orient contigs in a scaffold graph. As an efficient scaffolding method, RaGOO [102] adopts Minimap2 to align contigs and a reference. Before scaffolding, RaGOO can identify chimeric contigs and break them at their incorrect positions. After that, each contig is assigned to a reference chromosome, and all contigs in that chromosome are ordered and oriented. RaGOO assigns three confidence scores (clustering, location and orientation) for each contig, which can be used to judge whether the scaffolding result is correct.

First, rearrangement-based methods commonly find synteny blocks or genomic markers between contigs and references. The synteny blocks are a collection of conserved genomic regions that simultaneously appear in the contigs and the references, and they can be detected by alignment tools, such as Cactus [114] or MUMmer [115]. Then, the contigs and references can be represented by ordered and oriented synteny blocks. By using the references as backbones, these methods try to minimize the rearrangement distance between the contigs and the reference. CAR [110] uses a related reference genome to guide the scaffolding process. By utilizing permutation groups in algebra, CAR measures the rearrangement distance based on genetic markers (conserved sequences) in the contigs and reference genomes, which can be detected by Nucmer or Promer. In contrast to CAR, Multi-CAR [111] simultaneously uses multiple reference genomes for scaffolding. If two contigs can be joined by CAR with one reference genome, the weight between the two contigs increases by one. Then, Multi-CAR iteratively finds the paths with the maximum weights, and the paths correspond to the scaffolding result. CSAR-web [112] is a web server developed based on CSAR [113] that uses an incomplete reference genome for scaffolding. Ragout2 [102] used Cactus to detect conserved sequences, and develops a new graph simplification algorithm to generate synteny blocks containing consecutive conserved sequences. Then Ragout2 adopts a two-break rearrangement algorithm to break MA in the contigs. Finally, Ragout2 iteratively finds scaffolds with different synteny blocks sizes.

Other scaffolding methods

In addition to the mainstream methods mentioned previously, some researchers have tried to combine different reads or protein sequences for scaffolding.

OPERA-LG [116] is a hybrid scaffolding method that can simultaneously adopt short paired reads and long reads as input. OPERA-LG transfers only the long reads to synthetic paired reads with different insert sizes. Then, it uses multiple paired read libraries and an exact approach for scaffolding. SLR-superscaffolder [117] employs a top-to-bottom scheme to combine long fragment reads and paired reads. It first uses long fragment reads in large-scale scaffolding and then uses paired reads for local scaffolding.

Protein sequences are usually well conserved, and exons in the same protein sequence can be located on different contigs. Therefore, two different contigs can be linked when they are aligned in the same protein sequence. SWiPS [118] uses proteome sets from different species for scaffolding. By mapping the contigs to protein sequences using TBLASTN [119], SWiPS can find contigs aligned in the same protein sequence and compute their similarity scores. Because the same region of the protein sequence can usually be aligned with multiple contigs, SWiPS selects the path with the highest similarity score as the scaffold. PEP_scaffolder [120] uses BLAST [121] to align protein sequences with contigs. The protein sequence is divided into blocks, and the longest alignment region in one block is selected. Then, one contig corresponds to one block; therefore, the contigs can be sorted based on the order of the blocks.

Evaluation

Metrics

Evaluating the performance of scaffolding methods is crucial for assessing how well a scaffolding method performs and for fairly comparing the performances of different methods. Here, we introduce the most widely used evaluation metrics in the literature.

The traditional evaluation approach for scaffolding methods is to compute the metrics N50 and corrected N50 (NGA50/NA50) values. N50 is the length of the longest scaffold such that all the scaffolds of that length or longer cover at least half of the length of all scaffolds. After aligning the scaffolds in the reference genome and breaking the scaffolds at their error points, N50 can be recomputed to obtain the corrected N50. These metrics can be obtained by GAGE [122] and QUAST [123], which are typically used to evaluate the contiguity and completion of scaffolds. However, it is difficult to evaluate the number of correct orders and orientations. Joining long contigs correctly usually increases the values of N50 and corrected N50. However, joining short contigs correctly sometimes has no effect on the values of N50 and corrected N50.

To compute the number of correct joins among contigs, a novel evaluation pipeline is first provided for the scaffolding methods, termed a scaffolding evaluation [124]. For the contigs produced by any assembler, scaffold evaluation leads to the identification of assembly errors by aligning them against the reference genome using NUCmer. The best alignments are selected, and a new contig set with no errors is then generated. Each new contig is attached with a tag, and their orders, orientations and gap distances in the reference genome are kept. Finally, scaffolding methods are run on the new contig set, and the scaffolding results are evaluated based on the orders and orientation of contigs determined in the reference genome. They present five metrics: correct joins, incorrect joins, skipped tags, lost tags and running time. Correct join indicates that two contigs are joined in the expected order, orientation and gap distance. Incorrect join indicates that two contigs are joined with an incorrect order, orientation and gap distance. A skipped tag indicates that two contigs are joined correctly but that the gap region between them should contain another contig. A lost tag indicates that a contig has been lost in the scaffolding result. The running time is the total CPU time required to run one scaffolding method.

ScaffoldMatch computes precision, recall and the F1-score based on the number of correct joins and incorrect joins. Precision is calculated as (correct joins/correct joins + Incorrect joins), Recall is calculated as (correct joins/total joins), and the F1-score equals [2*precision*recall/(precision + recall)], thus yielding a comprehensive evaluation. However, some repetitive regions in the contigs are commonly used for scaffolding. Some contigs or subcontigs can be aligned at two or more positions. When producing perfect new contigs, the scaffold evaluation leads to the selection of only the best alignment for a contig or subcontig. However, Igor Mandric et al. [125] considered multiple alignments for repetitive contigs or subcontigs over a given identity level to compute correct joins and incorrect joins. This strategy enhanced the evaluation accuracy for scaffolding methods.

Comparisons

Current studies usually compare the same kinds of scaffolding methods, and the differences in performance among different kinds of scaffolding methods are unclear. To evaluate different scaffolding methods, we selected some and obtained experimental results from the corresponding studies. The scaffolding results shown in Table 2 are from studies on SCOP and BOSS. The scaffolding results shown in Tables 3 and 4 are from the studies on SLR and LRscaf, respectively. In Table 5, we show the scaffolding results from the study of SALSA2, whereas Table 6 shows the scaffolding results from the study of ARCS, and Table 7 shows the scaffolding results from the study of CSAR. Two important metrics are used: NGA50 and MA, which represent the continuity and correctness of scaffolding results, respectively.

Table 2

Experimental results of scaffolding methods based on paired reads

GenomeStaphylococcus aureusRhodobacter sphaeroidesR. sphaeroidesBacillus cereusPlasmodium falciparumHuman chr 14Human chr 14
Genome size2.9 M4.6 M4.6 M5.4 M23 M88 M88 M
Insert size350037005406002700290034 500
Coverage454610010039262
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
ABySS60912631862331035640114381222
Bambus224021132402831093281451723178
OPERA6771010826023263324199133603203903
SCARPA11283536324415931726345441171499
SGA30414134433622345898816863595
SOAP23043250410135616946349119627342788
SOPRA10922932024416531164191081302562
SSPACE29831092132321033530744648111
ScaffMatch33515246889241314394711219587742641
BOSS33717360532672091030328422514435129
SCOP3411344961373456531350452457
GenomeStaphylococcus aureusRhodobacter sphaeroidesR. sphaeroidesBacillus cereusPlasmodium falciparumHuman chr 14Human chr 14
Genome size2.9 M4.6 M4.6 M5.4 M23 M88 M88 M
Insert size350037005406002700290034 500
Coverage454610010039262
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
ABySS60912631862331035640114381222
Bambus224021132402831093281451723178
OPERA6771010826023263324199133603203903
SCARPA11283536324415931726345441171499
SGA30414134433622345898816863595
SOAP23043250410135616946349119627342788
SOPRA10922932024416531164191081302562
SSPACE29831092132321033530744648111
ScaffMatch33515246889241314394711219587742641
BOSS33717360532672091030328422514435129
SCOP3411344961373456531350452457
Table 2

Experimental results of scaffolding methods based on paired reads

GenomeStaphylococcus aureusRhodobacter sphaeroidesR. sphaeroidesBacillus cereusPlasmodium falciparumHuman chr 14Human chr 14
Genome size2.9 M4.6 M4.6 M5.4 M23 M88 M88 M
Insert size350037005406002700290034 500
Coverage454610010039262
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
ABySS60912631862331035640114381222
Bambus224021132402831093281451723178
OPERA6771010826023263324199133603203903
SCARPA11283536324415931726345441171499
SGA30414134433622345898816863595
SOAP23043250410135616946349119627342788
SOPRA10922932024416531164191081302562
SSPACE29831092132321033530744648111
ScaffMatch33515246889241314394711219587742641
BOSS33717360532672091030328422514435129
SCOP3411344961373456531350452457
GenomeStaphylococcus aureusRhodobacter sphaeroidesR. sphaeroidesBacillus cereusPlasmodium falciparumHuman chr 14Human chr 14
Genome size2.9 M4.6 M4.6 M5.4 M23 M88 M88 M
Insert size350037005406002700290034 500
Coverage454610010039262
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
ABySS60912631862331035640114381222
Bambus224021132402831093281451723178
OPERA6771010826023263324199133603203903
SCARPA11283536324415931726345441171499
SGA30414134433622345898816863595
SOAP23043250410135616946349119627342788
SOPRA10922932024416531164191081302562
SSPACE29831092132321033530744648111
ScaffMatch33515246889241314394711219587742641
BOSS33717360532672091030328422514435129
SCOP3411344961373456531350452457
Table 3

Experimental results of scaffolding methods based on long reads from SLR

GenomeEscherichia coliE. coliSaccharomyces cerevisiaeS. cerevisiaeChr X
Genome size4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage412923419657
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR5381011756200101190215809176
LINKS445669352035423152183100
npscarf444131672631280374832325157
SLR7234292743745237446239083
GenomeEscherichia coliE. coliSaccharomyces cerevisiaeS. cerevisiaeChr X
Genome size4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage412923419657
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR5381011756200101190215809176
LINKS445669352035423152183100
npscarf444131672631280374832325157
SLR7234292743745237446239083
Table 3

Experimental results of scaffolding methods based on long reads from SLR

GenomeEscherichia coliE. coliSaccharomyces cerevisiaeS. cerevisiaeChr X
Genome size4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage412923419657
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR5381011756200101190215809176
LINKS445669352035423152183100
npscarf444131672631280374832325157
SLR7234292743745237446239083
GenomeEscherichia coliE. coliSaccharomyces cerevisiaeS. cerevisiaeChr X
Genome size4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage412923419657
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR5381011756200101190215809176
LINKS445669352035423152183100
npscarf444131672631280374832325157
SLR7234292743745237446239083
Table 4

Experimental results of scaffolding methods based on long reads from LRScaf

GenomeEscherichia coliE. coliE. coliSaccharomyces cerevisiaeS. cerevisiaeS. cerevisiae
Genome size4.6 M4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage15101510
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR583600127001367182103549
LINKS59629292051135121111511316
npscarf5812455429530618310017
LRScaf590130061600132615492999
GenomeEscherichia coliE. coliE. coliSaccharomyces cerevisiaeS. cerevisiaeS. cerevisiae
Genome size4.6 M4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage15101510
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR583600127001367182103549
LINKS59629292051135121111511316
npscarf5812455429530618310017
LRScaf590130061600132615492999
Table 4

Experimental results of scaffolding methods based on long reads from LRScaf

GenomeEscherichia coliE. coliE. coliSaccharomyces cerevisiaeS. cerevisiaeS. cerevisiae
Genome size4.6 M4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage15101510
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR583600127001367182103549
LINKS59629292051135121111511316
npscarf5812455429530618310017
LRScaf590130061600132615492999
GenomeEscherichia coliE. coliE. coliSaccharomyces cerevisiaeS. cerevisiaeS. cerevisiae
Genome size4.6 M4.6 M4.6 M12.1 M12.1 M155.2 M
Coverage15101510
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SSPACE-LR583600127001367182103549
LINKS59629292051135121111511316
npscarf5812455429530618310017
LRScaf590130061600132615492999
Table 5

Experimental results of scaffolding methods based on Hi-C paired reads

GenomeHomo sapiensH. sapiensH. sapiens
Genome size2.9G2.9G2.9G
Coverage401718
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SALSA257 20058726 4602734630259
3D-DNA28 61081621 4708284520930
SALSA114 810371
GenomeHomo sapiensH. sapiensH. sapiens
Genome size2.9G2.9G2.9G
Coverage401718
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SALSA257 20058726 4602734630259
3D-DNA28 61081621 4708284520930
SALSA114 810371
Table 5

Experimental results of scaffolding methods based on Hi-C paired reads

GenomeHomo sapiensH. sapiensH. sapiens
Genome size2.9G2.9G2.9G
Coverage401718
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SALSA257 20058726 4602734630259
3D-DNA28 61081621 4708284520930
SALSA114 810371
GenomeHomo sapiensH. sapiensH. sapiens
Genome size2.9G2.9G2.9G
Coverage401718
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
SALSA257 20058726 4602734630259
3D-DNA28 61081621 4708284520930
SALSA114 810371

In these studies, although different kinds of scaffolding methods were utilized on the different datasets, we still found some interesting results as follows:

  • (i) Even for scaffolding methods of the same kind, their performances on different datasets of the same genome are usually different. The characteristics of the sequencing data have a large impact on the final result.

  • (ii) For long-range sequencing data, their performances are good. As shown in Tables 3 and 4, scaffolding methods based on long reads perform better in terms of the MA value. The NGA50 values for different datasets are also large, which can be attributed to the length of long reads, thus resolving more problems caused by repetitive regions. To improve the accuracy of scaffolding results, users can select this kind of scaffolding method in their research.

Table 5 shows that scaffolding methods based on Hi-C paired reads perform better in terms of the NGA50 value, which demonstrates that Hi-C paired reads are helpful for distinguishing contigs from different chromosomes and guarantees the continuity of scaffolding results. Hence, Hi-C paired reads are significant for chromosome-level scaffolding. As shown in Tables 6 and 7, linked reads and reference genomes are helpful for scaffolding, and the NGA50 values are good.

  • (iii) For small genomes, scaffolding methods based on paired reads also perform well in terms of the NGA50 value, as shown in Table 2. A repetition problem rarely exist for small genomes, and paired reads can be more exactly aligned in contigs. If cost is a limitation, users can choose this type of scaffolding method.

  • (iv) Table 4 shows that as the sequencing coverage of long reads increases, the continuity of the scaffolding results improves. Ou et al. [126] also claimed that a more sequencing coverage (>40) is helpful for identifying more transposable elements in complex genomes. As shown in Table 2, the results are affected by not only the sequencing coverage, but also the insert size. A small insert size is not conducive to resolving the repetitive region problem. Table 5 shows that high coverage improves the value of NGA50 for the same species. Accurate and complete assembly of highly complex genomes will require more sequencing data [127].

Discussion

With the rapid development of sequencing technologies and assembly algorithms, the ability to obtain more complete and accurate complex genomes has advanced [9]. The most prominent problem in genome assembly is repetitive regions in the target genome. Since the long reads produced by TGS can more proportionally span the repetitive regions, the long reads are useful for addressing problems caused by repetitive regions and can effectively improve the assembly quality. Assembly methods based on long reads can be used to obtain high reference-quality genomes with contig sizes measured in megabases instead of kilobases and are common for second-generation attempts [17]. During the assembly contig process, the extension of a contig commonly ends when complex repetitive regions, low coverage regions or regions with sequencing errors are encountered.

Table 6

Experimental results of scaffolding methods based on linked reads

GenomeCaenorhabditis elegansHomo sapiens
Genome size100 M2.9G
Coverage50136
NGA50 (kb)MANGA50 (kb)MA
ARKS82801359448217
ARCS84501473410207
fragScaff77301791267117
Architect71701452174454
GenomeCaenorhabditis elegansHomo sapiens
Genome size100 M2.9G
Coverage50136
NGA50 (kb)MANGA50 (kb)MA
ARKS82801359448217
ARCS84501473410207
fragScaff77301791267117
Architect71701452174454
Table 6

Experimental results of scaffolding methods based on linked reads

GenomeCaenorhabditis elegansHomo sapiens
Genome size100 M2.9G
Coverage50136
NGA50 (kb)MANGA50 (kb)MA
ARKS82801359448217
ARCS84501473410207
fragScaff77301791267117
Architect71701452174454
GenomeCaenorhabditis elegansHomo sapiens
Genome size100 M2.9G
Coverage50136
NGA50 (kb)MANGA50 (kb)MA
ARKS82801359448217
ARCS84501473410207
fragScaff77301791267117
Architect71701452174454
Table 7

Experimental results of scaffolding methods based on reference genomes

GenomeBrucella pinnipedialis B2/94B. pinnipedialis B2/94B. pinnipedialis B2/94V. antiquarius Ex25V. antiquarius Ex25V. antiquarius Ex25
Genome size3.4 M3.4 M3.4 M5 M5 M5 M
ReferenceB. pinnipedialis M163/99/10B. melitensis KU/RCF-31B. canis SCLV. antiquarius Ex25 939V. parahaemolyticus GCSL R74V. campbellii DS40M4
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
CSAR (PROmer)254724172461205766744416
CSAR (NUCmer)266724142414120572871715325
Mauve435301904019046469272613913857
OSLay31192212221026610153229218
Projector230119157111578233232332413824
GenomeBrucella pinnipedialis B2/94B. pinnipedialis B2/94B. pinnipedialis B2/94V. antiquarius Ex25V. antiquarius Ex25V. antiquarius Ex25
Genome size3.4 M3.4 M3.4 M5 M5 M5 M
ReferenceB. pinnipedialis M163/99/10B. melitensis KU/RCF-31B. canis SCLV. antiquarius Ex25 939V. parahaemolyticus GCSL R74V. campbellii DS40M4
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
CSAR (PROmer)254724172461205766744416
CSAR (NUCmer)266724142414120572871715325
Mauve435301904019046469272613913857
OSLay31192212221026610153229218
Projector230119157111578233232332413824
Table 7

Experimental results of scaffolding methods based on reference genomes

GenomeBrucella pinnipedialis B2/94B. pinnipedialis B2/94B. pinnipedialis B2/94V. antiquarius Ex25V. antiquarius Ex25V. antiquarius Ex25
Genome size3.4 M3.4 M3.4 M5 M5 M5 M
ReferenceB. pinnipedialis M163/99/10B. melitensis KU/RCF-31B. canis SCLV. antiquarius Ex25 939V. parahaemolyticus GCSL R74V. campbellii DS40M4
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
CSAR (PROmer)254724172461205766744416
CSAR (NUCmer)266724142414120572871715325
Mauve435301904019046469272613913857
OSLay31192212221026610153229218
Projector230119157111578233232332413824
GenomeBrucella pinnipedialis B2/94B. pinnipedialis B2/94B. pinnipedialis B2/94V. antiquarius Ex25V. antiquarius Ex25V. antiquarius Ex25
Genome size3.4 M3.4 M3.4 M5 M5 M5 M
ReferenceB. pinnipedialis M163/99/10B. melitensis KU/RCF-31B. canis SCLV. antiquarius Ex25 939V. parahaemolyticus GCSL R74V. campbellii DS40M4
NGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MANGA50 (kb)MA
CSAR (PROmer)254724172461205766744416
CSAR (NUCmer)266724142414120572871715325
Mauve435301904019046469272613913857
OSLay31192212221026610153229218
Projector230119157111578233232332413824

Scaffolding methods have been widely used in many applications. In addition to reconstructing a single genome, scaffolding methods can also be used in pan-genome assembly. The pan-genome refers to a collection of all DNA sequences that occur in a species, which is significant for understanding genetic diversity. Compared with a single reference genome, the pan-genome can identify more variants that are related to phenotypes. Constructing of a pan-genome usually requires the sequencing of many individuals and the production of large amounts of sequencing data. Scaffolding methods can produce more complete individual references efficiently and improve the subsequent analysis of variants. RaGOO was used to order and orient 103 draft Arabidopsis thaliana genomes and to construct the corresponding pan-genome. Recently, genome assemblies for five sesame varieties were scaffolded using a reference-based method, and a pan-genome was produced for comparative genomic analyses and gene discovery [128].

Construction of a haplotype-resolved genome for species is important for population genetic analyses. Delaneau et al. [129] employed the scaffolding method SSPACE to generate supper-scaffolds, resulting in halotype-improved assembly. Yang et al. [130] evaluated three scaffolding methods based on Hi-C paired reads and selected SALSA2 to create haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle. Low et al. [131] generated scaffolds using the ARCS + LINKS pipeline which increased the assembly continuity.

More complete assembly allows the identification of new features of transposable elements and is helpful for reconstructing subgenome. Xu et al. [132] obtained draft contigs of white lupin by Canu, and the N50 value of the contigs was 1.76 M. Then, they employed 3ddna based on Hi-C paired reads to scaffold these contigs and obtained more continuous scaffolds, obtaining an N50 value of 18.66 Mb. Based on the assembly and analysis of the white lupin genome, the authors deciphered its diploid ancestral genome and reconstructed the three subgenomes. Cai et al. [133] reported an updated assembly of the Brassica oleracea reference genome and then used syntenic gene pairs between the updated assembly and Arabidopsis thaliana to construct three subgenomes.

Users must know which scaffolding methods to employ for assembly, and short paired reads may be a better choice for the scaffolding of small and noncomplex genomes. Scaffolding methods based on long reads are suggested for more continuous assemblies, and methods based on Hi-C paired reads can usually yield chromosome-level results. However, in addition to their advantages, each scaffolding method has some biases. Scaffolding methods based on short paired reads usually get stuck in repetitive regions, estimate the insert size and link two adjacent contigs that are far apart. Scaffolding methods based on Hi-C paired reads cannot provide the exact distance and orientation information about two contigs. The order of linked reads with the same barcode is unclear, which limits the application of linked reads for scaffolding.

Currently, most studies prefer to use hybrid data for scaffolding. Wallberg et al. [134] used three different scaffolding methods (ARCS + LINKS + LACHESIS) based on linked reads, long reads and Hi-C paired reads to produce a new genome assembly of the honeybee Apis mellifera. Xu et al. [135] first produced contigs using long reads and then used ARCS based on linked reads to obtain the final assembly of Malania oleifera. To reconstruct the grape phylloxera genome, Rispe et al. [136] used short paired reads to generate a draft assembly and then employed SSPANCE-Longread for scaffolding based on the long reads. Finally, the gaps were filled by GapFiller based on short paired reads. Li et al. [137] used a combination of long reads, short paired reads, linked reads and Hi-C paired reads to obtain the genome sequence of Asparagus setaceus. The scaffolding methods ARKS, LINKS and LACHESIS were all used in the assembly process. Arimoto et al. [138] assembled the Praesagittifera naikaiensis genome using BESST and LINKS to perform scaffolding with short paired reads and long reads.

Hence, when assembling more complex genomes, such as repeat-rich plant and mammalian genomes, combining multiple types of data and scaffolding methods can usually yield a high-quality genome.

After scaffolding, the DNA sequences of gaps in the scaffold are unknown. To improve the completeness of the assembly, some stand-alone gap filling methods have been proposed, which can be divided into two categories. (i) The first involves the methods based on paired reads, which usually involve an initial collection of reads that possibly cover the gaps based on the alignments between paired reads and scaffolds. Then, different strategies are utilized to construct a local assembly to fill the gaps. IMAGE [139] employs Velvet to obtain contigs for filling gaps, and GapCloser [49] is a gap filling module in the assembler SOAP2. When filling gaps, GapFiller [140] considers the difference between the gap size and the length filled. Sealer [141] selects paths from De Bruijn graphs that are represented by bloom filter data structures. Gap2Seq [142] solves the gap filling problem by finding a path in a directed graph whose length is in a given range. For a gap and its related reads, GapReduce [143] iteratively constructs De Bruijn graphs based on different k-mer sizes and frequencies and selects paths for filling the gaps. (ii) The second category involves the methods based on long reads. PBJelly [144] uses BLASR to align the long reads to scaffolds, and identify the long reads that possibly cover the gap. Then, it employs the OLC assembly algorithm to fill the gaps. GMcloser [145] uses the likelihood model to select reliable long reads to fill the gaps. LR_Gapcloser [146] segments each long read into short sequence fragments with the same length, which can be used to find more accurate alignments. TGS-GapCloser [147] corrects errors for long reads related to a gap.

Conclusions and future directions

Scaffolding is a significant step in genome assembly and can greatly improve the completeness and contiguity of assembly. With the continuous development of sequencing technologies, many different types of scaffolding methods have been developed and are widely used. Although the latest sequencing technologies have substantially advanced scaffolding, resolving intricate ambiguities is also challenging. In this review, the difficulties of scaffolding, characteristics of various kinds of sequencing data, and scaffolding methods are summarized and discussed.

Scaffolding methods commonly implement a scaffold graph based on the alignments between contigs and reads. Although many scaffolding methods based on distinct sequencing technologies are available, the problems caused by genome complexity and the drawbacks of sequencing technologies are difficult to resolve. Furthermore, sequencing errors, highly repetitive regions, and alignment errors lead to false edges and increase the complexity of the scaffold graph. In addition, graph search approaches, which are used to extract scaffolds, usually require many more constraints with respect to the number of vertices and edges in the scaffold graph. Moreover, the most powerful latest sequencing technologies and scaffold graph structure have not been fully explored.

For future research, our recommendations are as follows:

  • (i) By taking complete advantage of various sequencing data obtained by different technologies, scaffolding methods could yield more accurate results. Generally, one type of sequencing data supplies only part of the information about the order and orientations of the contigs. For example, paired reads can be more accurately aligned in contigs, long reads can span more repetitive regions, and Hi-C paired reads are helpful for chromosome-level scaffolding. More efforts should be made to integrate multiple sequencing data into a unified workflow, which is obviously challenging. For example, Hi-C paired reads can be used to distinguish contigs from distinct chromosomes first, and related reference genomes can provide a draft contig backbone. Then, repetitive regions in the same chromosome are approached by long reads. Paired reads improve the quality of local assembly. One new strategy that can simultaneously consider linked information from multiple kinds of data should be paid more attention. In particular, the consistencies and contradictions of various types of alignment information should be analyzed in-depth. For example, whether two contigs can be linked simultaneously by paired and long reads? And how to address two contigs being supported by only one type of read need to be further elucidated. Hence, for sequencing data integration, scaffolding methods should consider the different contributions of various data and design rational methods for weightings.

  • (ii) For the long reads produced by TGS technologies, the greatest advantage is the read length. Compared with paired reads, one long read can link two contigs separated by repetitive regions, and three or more contigs can be aligned with the same long read. Paired reads can link only two contigs, and it is difficult to span repetitive regions. For current methods based on paired reads or long reads, when constructing an optimization model to represent the scaffolding problem, the constraint condition usually refers to two contigs. Hence, for three or more contigs that can be aligned in the same long read, current methods based on long reads should pay more attention to designing a constraint condition that can include them. In addition, more accurate methods to identify repetitive regions in contigs based on long reads also require more effort.

  • (iii) Due to the low cost and high sequencing accuracy of NGS, paired reads from NGS are still widely used in many applications. Further studies on scaffolding methods based on paired reads still make sense. For future studies based on paired reads, the primary issue is to determine whether two contigs can be linked based on the positional information of paired reads aligned in them, and how to weight the link to reflect its confidence. For two contigs, more focus can be applied to the positional distribution of aligned paired reads to analyze their order and orientation. Moreover, identifying repetitive contigs based on paired reads also needs to be further investigated.

  • (iv) For scaffolding methods based on Hi-C paired reads, the gap sizes in the scaffolds are difficult to estimate. Hi-C paired reads reflect only the spatial proximity but do not directly reflect the linear distance. Current clustering methods, which distinguish contigs from different chromosomes, are usually developed based on scaffold graphs. The weights of the edges in the scaffold graph are important for clustering results and are usually related to the number of paired reads linking two vertices and to the length of contigs. Due to the complexity of the three-dimensional chromosome structure, some contigs from different chromosomes can be spatially close. This situation should be considered to enhance the accuracy of clustering. In addition, for two contigs, the correlation between the number of aligned Hi-C paired reads and their distance is currently ambiguous.

  • (v) The sequencing technology for linked reads balances the sequencing error rate and the length of fragments sequenced. Although linked reads with the same barcode can be grouped together, their order is unclear, which introduces some problems regarding the determination of whether two contigs can be linked. When linking contigs, current methods usually consider only the linked reads that can be aligned at the ends of the contigs. The ability to simultaneously align the linked reads in a group at two contig ends separately commonly means that the two contigs are adjacent. However, for two contigs, more information (the order of the aligned linked reads and the number of the unaligned linked reads from the same group, the consistency of aligned information between different groups) could be further utilized to determine their order and gap size.

  • (vi) Optical maps also supply long-range information that can guide scaffolding. Before optical maps are aligned in the contigs, the restriction sites in the contigs should be identified. However, the length of one restriction enzyme is commonly short, and more restriction sizes than expected are usually found. Moreover, the fragment sizes often have small deviations, which also influence the alignment results. Exact alignment methods for contigs and optical maps should be given more attention.

  • (vii) Reference-based scaffolding methods can take advantage of one or more related reference genomes to guide the process of scaffolding and are cost-efficient alternatives for scaffolding. However, some differences (small or large structural variations and polymorphisms) commonly exist between the related reference genome and the target genome. Hence, these methods may introduce some reference bias. For example, some contigs may not be aligned to the reference genomes or aligned at incorrect positions or orientations, and these contigs may be treated as chimaeras. Commonly, multiple related reference genomes can provide more consistent adjacency information about contigs than a single reference genome. Although some methods using multiple reference genomes have been developed, obtaining accurate alignments and synteny blocks between the contigs and the reference genomes is challenging. Combining long-range sequencing technologies with reference genomes may improve the completeness and accuracy of scaffolding results. Therefore, a draft scaffold set based on reference genomes can be produced. Then, long reads can be used to tackle the problems caused by repetitive regions or small structural variations, and Hi-C paired reads can be used to address large differences between contigs and reference genomes.

  • (viii) For scaffolding methods, most computational costs are consumed in the alignment step. Different alignment tools vary greatly in efficiency. Minimap and Minimap2 usually perform better than other tools when aligning long reads and contigs. As the scale of sequencing data increases, the running time and scaffolding memory will also increase. Therefore, selecting a rational alignment tool is important for scaffolding methods.

  • (ix) To fairly compare and evaluate various kinds of scaffolding methods, benchmark datasets and evaluation metrics should be constructed by collecting real sequencing data. In particular, the contig set should match the real situation, which could contain some MA and repetitive regions. Therefore, more efforts should be made to verify the scaffolding results.

Key Points
  • Scaffolding methods help scientists gain deep insights into genomic research, including promoter cloning, gene identification, gene function analysis, gene expression regulation and comparative genomics analyses.

  • According to sequencing technologies and sequencing data, scaffolding methods are summarized and discussed in details. And, the metrics used for evaluating scaffolding methods are analyzed comprehensively.

  • To investigate the performance of the different types of scaffolding methods, some related computational experiments are analyzed.

  • Combination of multiple types of sequencing data using effective integrate methods can improve the performance of scaffolding methods in the future.

Funding

This work was supported in part by the National Natural Science Foundation of China under (grant nos. 61972134, 61602156, 61802113, 61772557 and 11601129), Henan Provincial Department of Science and Technology Research Project under (grant no. 192102210118), Doctor Foundation of Henan Polytechnic University under (grant no. B2018-36).

Junwei Luo is an Associate Professor in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include genome assembly, scaffolding, gap filling and structural variant detection.

Yawei Wei is a graduate student in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include bioinformatics and scaffolding.

Mengna Lyu is a graduate student in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. Her research interests include bioinformatics and scaffolding.

Zhengjiang Wu is an Associate Professor in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. His research interests include bioinformatics and data mining.

Xiaoyan Liu is a lecturer in the College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China. Her research interests include scaffolding methods and artificial intelligence.

Huimin Luo is an Associate Professor in the School of Computer and Information Engineering, Henan University, Kaifeng, China. Her research interests include bioinformatics and computational drug repositioning.

Chaokun Yan is an Associate Professor in the School of Computer and Information Engineering, Henan University, Kaifeng, China. His research interests include artificial intelligence and computational biology.

References

1.

Shi
W
,
Ji
P
,
Zhao
F
.
The combination of direct and paired link graphs can boost repetitive genome assembly
.
Nucleic Acids Res
2017
;
45
(
6
):
e43
3
.

2.

Ameur
A
,
Kloosterman
WP
,
Hestand
MS
.
Single-molecule sequencing: towards clinical applications
.
Trends Biotechnol
2019
;
37
(
1
):
72
85
.

3.

Liao
X
,
Li
M
,
Zou
Y
, et al.
Current challenges and solutions of de novo assembly
.
Quant Biol
2019
;
7
(
2
):
1
20
.

4.

Ding
X
,
Guo
X
.
A survey of snp data analysis
.
Big Data Min Anal
2018
;
1
(
3
):
173
90
.

5.

Hou
Y
,
Guo
H
,
Cao
C
, et al.
Single-cell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas
.
Cell Res
2016
;
26
(
3
):
304
19
.

6.

Lightbody
G
,
Haberland
V
,
Browne
F
, et al.
Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application
.
Brief Bioinform
2019
;
20
(
5
):
1795
811
.

7.

Naccache
SN
,
Federman
S
,
Veeraraghavan
N
, et al.
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
.
Genome Res
2014
;
24
(
7
):
1180
92
.

8.

Chaisson
MJP
,
Wilson
RK
,
Eichler
EE
.
Genetic variation and the de novo assembly of human genomes
.
Nat Rev Genet
2015
;
16
(
11
):
627
40
.

9.

Ghurye
J
,
Pop
M
.
Modern technologies and algorithms for scaffolding assembled genomes
.
PLoS Comput Biol
2019
;
15
(
6
):
e1006994
.

10.

Ghurye
JS
,
Cepeda-Espinoza
V
,
Pop
M
.
Metagenomic assembly: overview, challenges and applications
.
Yale J Biol Med
2016
;
89
(
3
):
353
62
ISBN, 1551-4056
.

11.

Simpson
JT
,
Pop
M
.
The theory and practice of genome sequence assembly
.
Annu Rev Genomics Hum Genet
2015
;
16
:153–72.

12.

Luo
J
,
Chen
R
,
Zhang
X
, et al.
LROD: an overlap detection algorithm for long reads based on k-mer distribution
.
Front Genet
2020
;
11
:
632
.

13.

Eid
J
,
Fehr
A
,
Gray
J
, et al.
Real-time DNA sequencing from single polymerase molecules
.
Science
2009
;
323
(
5910
):
133
8
.

14.

Jain
M
,
Olsen
HE
,
Paten
B
, et al.
The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community
.
Genome Biol
2016
;
17
(
1
):
239
.

15.

Kraft
F
,
Kurth
I
.
Long-read sequencing in human genetics
.
Medizinische genetik
2019
;
31
(
2
):
198
204
.

16.

van
Dijk
EL
,
Jaszczyszyn
Y
,
Naquin
D
, et al.
The third revolution in sequencing technology
.
Trends Genet
2018
;
34
(
9
):
666
81
.

17.

Sedlazeck
FJ
,
Lee
H
,
Darby
CA
, et al.
Piercing the dark matter: bioinformatics of long-range sequencing and mapping
.
Nat Rev Genet
2018
;
19
(
6
):
329
46
.

18.

Alonge
M
,
Soyk
S
,
Ramakrishnan
S
, et al.
Fast and accurate reference-guided scaffolding of draft genomes
.
BioRxiv
2019
;519637.

19.

Laehnemann
D
,
Borkhardt
A
,
McHardy
AC
.
Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction
.
Brief Bioinform
2016
;
17
(
1
):
154
79
.

20.

Bowden
R
,
Davies
RW
,
Heger
A
, et al.
Sequencing of human genomes with nanopore technology
.
Nat Commun
2019
;
10
(
1
):
1
9
.

21.

Dohm
JC
,
Peters
P
,
Stralis-Pavese
N
, et al.
Benchmarking of long-read correction methods
.
NAR Genom Bioinform
2020
;
2
(
2
):
lqaa037
.

22.

Morisse
P
,
Lecroq
T
,
Lefebvre
A
.
Long-read error correction: a survey and qualitative comparison
.
BioRxiv
2020
.

23.

Zhang
H
,
Jain
C
,
Aluru
S
.
A comprehensive evaluation of long read error correction methods
.
BMC Genomics
2020
;
21
(
6
):
1–15
.

24.

Wenger
AM
,
Peluso
P
,
Rowell
WJ
, et al.
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
.
Nat Biotechnol
2019
;
37
(
10
):
1155
62
.

25.

Hon
T
,
Mars
K
,
Young
G
, et al.
Highly accurate long-read HiFi sequencing data for five complex genomes
.
Scientific data
2020
;
7
(
1
):
1
11
.

26.

Putnam
NH
,
O'Connell
BL
,
Stites
JC
, et al.
Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
.
Genome Res
2016
;
26
(
3
):
342
50
.

27.

Edge
P
,
Bafna
V
,
Bansal
V
.
HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies
.
Genome Res
2017
;
27
(
5
):
801
12
.

28.

Burton
JN
,
Adey
A
,
Patwardhan
RP
, et al.
Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions
.
Nat Biotechnol
2013
;
31
(
12
):
1119
25
.

29.

Dudchenko
O
,
Batra
SS
,
Omer
AD
, et al.
De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds
.
Science
2017
;
356
(
6333
):
92
5
.

30.

Zheng
GXY
,
Lau
BT
,
Schnall-Levin
M
, et al.
Haplotyping germline and cancer genomes with high-throughput linked-read sequencing
.
Nat Biotechnol
2016
;
34
(
3
):
303
11
.

31.

Mendelowitz
L
,
Pop
M
.
Computational methods for optical mapping
.
GigaScience
2014
;
3
(
1
):2047-217X-3-33.

32.

Hunt
M
,
Kikuchi
T
,
Sanders
M
, et al.
REAPR: a universal tool for genome assembly evaluation
.
Genome Biol
2013
;
14
(
5
):
R47
.

33.

Murphy
RR
,
O’Connell
J
,
Cox
AJ
, et al.
NxRepair: error correction in de novo sequence assembly using Nextera mate pairs
.
PeerJ
2015
;
3
:
e996
.

34.

Li
M
,
Wu
B
,
Yan
X
, et al.
PECC: correcting contigs based on paired-end read distribution
.
Comput Biol Chem
2017
;
69
:
178
84
.

35.

Wu
B
,
Li
M
,
Liao
X
, et al.
MEC: Misassembly error correction in contigs based on distribution of paired-end reads and statistics of GC-contents
.
IEEE/ACM Trans Comput Biol Bioinform
2018
;
17
(
3
):847–57.

36.

Langmead
B
.
Aligning short sequencing reads with Bowtie
.
Curr Protoc Bioinformatics
2010
;
32
(
1
):11.7. 1-11.7. 14.

37.

Langmead
B
,
Salzberg
SL
.
Fast gapped-read alignment with Bowtie 2
.
Nat Methods
2012
;
9
(
4
):
357
.

38.

Li
H
,
Durbin
R
.
Fast and accurate short read alignment with burrows–wheeler transform
.
Bioinformatics
2009
;
25
(
14
):
1754
60
.

39.

Chaisson
MJ
,
Tesler
G
.
Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory
.
BMC Bioinformatics
2012
;
13
(
1
):
238
.

40.

Li
H
.
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
.
Bioinformatics
2016
;
32
(
14
):
2103
10
.

41.

Li
H
.
Minimap2: pairwise alignment for nucleotide sequences
.
Bioinformatics
2018
;
34
(
18
):
3094
100
.

42.

Schmid
M
,
Frei
D
,
Patrignani
A
, et al.
Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats
.
Nucleic Acids Res
2018
;
46
(
17
):
8953
65
.

43.

Kolmogorov
M
,
Yuan
J
,
Lin
Y
, et al.
Assembly of long, error-prone reads using repeat graphs
.
Nat Biotechnol
2019
;
37
(
5
):
540
6
.

44.

Celniker
SE
,
Dillon
LAL
,
Gerstein
MB
, et al.
Unlocking the secrets of the genome
.
Nature
2009
;
459
(
7249
):
927
30
.

45.

Nik-Zainal
S
,
Davies
H
,
Staaf
J
, et al.
Landscape of somatic mutations in 560 breast cancer whole-genome sequences
.
Nature
2016
;
534
(
7605
):
47
54
.

46.

Dunham
I
,
Birney
E
,
Lajoie
BR
, et al.
An integrated encyclopedia of DNA elements in the human genome
.
2012
;
489
:57–74.

47.

1000 Genomes Project Consortium
.
A global reference for human genetic variation
.
Nature
2015
;
526
(
7571
):
68
74
.

48.

Chaisson
MJP
,
Huddleston
J
,
Dennis
MY
, et al.
Resolving the complexity of the human genome using single-molecule sequencing
.
Nature
2015
;
517
(
7536
):
608
11
.

49.

Luo
R
,
Liu
B
,
Xie
Y
, et al.
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler
.
Gigascience
2012
;
1
(
1
):2047-217X-1-18.

50.

Donmez
N
,
Brudno
M
.
SCARPA: scaffolding reads with practical algorithms
.
Bioinformatics
2013
;
29
(
4
):
428
34
.

51.

Mandric
I
,
Zelikovsky
A
.
ScaffMatch: scaffolding algorithm based on maximum weight matching
.
Bioinformatics
2015
;
31
(
16
):
2632
8
.

52.

Boetzer
M
,
Henkel
CV
,
Jansen
HJ
, et al.
Scaffolding pre-assembled contigs using SSPACE
.
Bioinformatics
2011
;
27
(
4
):
578
9
.

53.

Salmela
L
,
Mäkinen
V
,
Välimäki
N
, et al.
Fast scaffolding with small independent mixed integer programs
.
Bioinformatics
2011
;
27
(
23
):
3259
65
.

54.

Gao
S
,
Sung
WK
,
Nagarajan
N
.
Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences
.
J Comput Biol
2011
;
18
(
11
):
1681
91
.

55.

Sahlin
K
,
Vezzi
F
,
Nystedt
B
, et al.
BESST-efficient scaffolding of large fragmented assemblies
.
BMC Bioinformatics
2014
;
15
(
1
):
281
.

56.

Lindsay
J
,
Salooti
H
,
Măndoiu
I
, et al.
Ilp-based maximum likelihood genome scaffolding[C]//BMC bioinformatics
.
BioMed Central
2014
;
15
(
S9
):
S9
.

57.

Rahman
A
,
Pachter
L
.
SWALO: scaffolding with assembly likelihood optimization
.
bioRxiv
2016
;081786.

58.

Dayarian
A
,
Michael
TP
,
Sengupta
AM
.
SOPRA: scaffolding algorithm for paired reads via statistical optimization
.
BMC Bioinformatics
2010
;
11
(
1
):
345
.

59.

Gritsenko
AA
,
Nijkamp
JF
,
Reinders
MJT
, et al.
GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies
.
Bioinformatics
2012
;
28
(
11
):
1429
37
.

60.

Roy
RS
,
Chen
KC
,
Sengupta
AM
, et al.
SLIQ: simple linear inequalities for efficient contig scaffolding
.
J Comput Biol
2012
;
19
(
10
):
1162
75
.

61.

Bodily
PM
,
Fujimoto
MS
,
Snell
Q
, et al.
ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction
.
Bioinformatics
2016
;
32
(
1
):
17
24
.

62.

Pop
M
,
Kosack
DS
,
Salzberg
SL
.
Hierarchical scaffolding with Bambus
.
Genome Res
2004
;
14
(
1
):
149
59
.

63.

Koren
S
,
Treangen
TJ
,
Pop
M
.
Bambus 2: scaffolding metagenomes
.
Bioinformatics
2011
;
27
(
21
):
2964
71
.

64.

Luo
J
,
Wang
J
,
Zhang
Z
, et al.
BOSS: a novel scaffolding algorithm based on an optimized scaffold graph
.
Bioinformatics
2017
;
33
(
2
):
169
76
.

65.

Briot
N
,
Chateau
A
,
Coletta
R
, et al. An integer linear programming approach for genome scaffolding[C]. In: ,
2014
.

66.

Lindsay
J
,
Salooti
H
,
Zelikovsky
A
, et al.
Scalable genome scaffolding using integer linear programming
.
Proceedings of the ACM Conference on Bioinformatics Computational Biology and Biomedicine
2012
;377–
83
.

67.

Li
M
,
Tang
L
,
Liao
Z
, et al.
A novel scaffolding algorithm based on contig error correction and path extension
.
IEEE/ACM Trans Comput Biol Bioinform
2018
;
16
(
3
):
764
73
.

68.

Farrant
GK
,
Hoebeke
M
,
Partensky
F
, et al.
WiseScaffolder: an algorithm for the semi-automatic scaffolding of next generation sequencing data
.
BMC Bioinformatics
2015
;
16
(
1
):
281
.

69.

Weller
M
,
Chateau
A
,
Giroudeau
R
.
Exact approaches for scaffolding
.
BMC Bioinformatics
2015
;
16
(
S14
):
S2
.

70.

Mandric
I
,
Zelikovsky
A
.
Solving scaffolding problem with repeats
.
bioRxiv
2018
;330472.

71.

Li
M
,
Tang
L
,
Wu
FX
, et al.
SCOP: a novel scaffolding algorithm based on contig classification and optimization
.
Bioinformatics
2019
;
35
(
7
):
1142
50
.

72.

Jiao
Y
,
Peluso
P
,
Shi
J
, et al.
Improved maize reference genome with single-molecule technologies
.
Nature
2017
;
546
(
7659
):
524
7
.

73.

Jain
M
,
Koren
S
,
Miga
KH
, et al.
Nanopore sequencing and assembly of a human genome with ultra-long reads
.
Nat Biotechnol
2018
;
36
(
4
):
338
45
.

74.

Michael
TP
,
Jupe
F
,
Bemm
F
, et al.
High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell
.
Nat Commun
2018
;
9
(
1
):
1
8
.

75.

Boetzer
M
,
Pirovano
W
.
SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information
.
BMC Bioinformatics
2014
;
15
(
1
):
211
.

76.

Zhu
S
,
Chen
DZ
,
Emrich
SJ
.
Single molecule sequencing-guided scaffolding and correction of draft assemblies
.
BMC Genomics
2017
;
18
(
10
):
879
.

77.

Warren
RL
,
Yang
C
,
Vandervalk
BP
, et al.
LINKS: scalable, alignment-free scaffolding of draft genomes with long reads
.
GigaScience
2015
;
4
(
1
):s13742-015-0076-3.

78.

Z.
Ning
.
SMIS (Single Molecular Integrative Scaffolding): an assembly pipeline to improve genome scaffolding using Oxford Nanopore or PacBio long reads
. Retrieved 1 September
2020
from https://www.sanger.ac.uk/tool/smis/

79.

Warren
RL
.
RAILS and cobbler: scaffolding and automated finishing of draft genomes using long DNA sequences
.
J Open Source Software
2016
;
1
(
7
):
116
.

80.

Ye
C
,
Hill
CM
,
Wu
S
, et al.
DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies
.
Sci Rep
2016
;
6
:
31900
.

81.

Cao
MD
,
Nguyen
SH
,
Ganesamoorthy
D
, et al.
Scaffolding and completing genome assemblies in real-time with nanopore sequencing
.
Nat Commun
2017
;
8
(
1
):
1
10
.

82.

Qin
M
,
Wu
S
,
Li
A
, et al.
LRScaf: improving draft genomes using long noisy reads
.
BMC Genomics
2019
;
20
(
1
):
955
.

83.

Luo
J
,
Lyu
M
,
Chen
R
, et al.
SLR: a scaffolding algorithm based on long reads and contig classification
.
BMC Bioinform
2019
;
20
(
1
):
539
.

84.

Yeo
S
,
Coombe
L
,
Warren
RL
, et al.
ARCS: scaffolding genome drafts with linked reads
.
Bioinformatics
2018
;
34
(
5
):
725
31
.

85.

Coombe
L
,
Zhang
J
,
Vandervalk
BP
, et al.
ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
.
BMC Bioinform
2018
;
19
(
1
):
1
10
.

86.

Adey
A
,
Kitzman
JO
,
Burton
JN
, et al.
In vitro, long-range sequence information for de novo genome assembly via transposase contiguity
.
Genome Res
2014
;
24
(
12
):
2041
9
.

87.

Kuleshov
V
,
Snyder
MP
,
Batzoglou
S
.
Genome assembly from synthetic long read clouds
.
Bioinformatics
2016
;
32
(
12
):
i216
24
.

88.

Hiltunen
M
,
Ryberg
M
,
Johannesson
H
.
AnVIL: an overlap-aware genome assembly scaffolder for linked reads
.
BioRxiv
2020
.

89.

Lieberman-Aiden
E
,
Van Berkum
NL
,
Williams
L
, et al.
Comprehensive mapping of long-range interactions reveals folding principles of the human genome
.
Science
2009
;
326
(
5950
):
289
93
.

90.

Kaplan
N
,
Dekker
J
.
High-throughput genome scaffolding from in vivo DNA interaction frequency
.
Nat Biotechnol
2013
;
31
(
12
):
1143
7
.

91.

Marie-Nelly
H
,
Marbouty
M
,
Cournac
A
, et al.
High-quality genome (re) assembly using chromosomal contact data
.
Nat Commun
2014
;
5
(
1
):
1
10
.

92.

Baudry
L
,
Guiglielmoni
N
,
Marie-Nelly
H
, et al.
instaGRAAL: chromosome-level quality scaffolding of genomes using a proximity ligation-based scaffolder
.
Genome Biol
2020
;
21
(
1
):
1
22
.

93.

Ghurye
J
,
Pop
M
,
Koren
S
, et al.
Scaffolding of long read assemblies using long range contact information
.
BMC Genomics
2017
;
18
(
1
):
1
11
.

94.

Ghurye
J
,
Rhie
A
,
Walenz
BP
, et al.
Integrating Hi-C links with assembly graphs for chromosome-scale assembly
.
PLoS Comput Biol
2019
;
15
(
8
):
e1007273
.

95.

Renschler
G
,
Richard
G
,
Valsecchi
CIK
, et al.
Hi-C guided assemblies reveal conserved regulatory topologies on X and autosomes despite extensive genome shuffling
.
Genes Dev
2019
;
33
(
21-22
):
1591
612
.

96.

Nagarajan
N
,
Read
TD
,
Pop
M
.
Scaffolding and validation of bacterial genome assemblies using optical restriction maps
.
Bioinformatics
2008
;
24
(
10
):
1229
35
.

97.

Lin
HC
,
Goldstein
S
,
Mendelowitz
L
, et al.
AGORA: assembly guided by optical restriction alignment
.
BMC Bioinformatics
2012
;
13
(
1
):
189
.

98.

Dong
Y
,
Xie
M
,
Jiang
Y
, et al.
Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus)
.
Nat Biotechnol
2013
;
31
(
2
):
135
41
.

99.

Saha
S
,
Rajasekaran
S
.
Efficient and scalable scaffolding using optical restriction maps
.
BMC Genomics
2014
;
15
(
S5
):
S5
.

100.

Shelton
JM
,
Coleman
MC
,
Herndon
N
, et al.
Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool
.
BMC Genomics
2015
;
16
(
1
):
734
.

101.

Pan
W
,
Jiang
T
,
Lonardi
S
.
OMGS: optical map-based genome scaffolding
.
J Comput Biol
2020
;
27
(
4
):
519
33
.

102.

Kolmogorov
M
,
Armstrong
J
,
Raney
BJ
, et al.
Chromosome assembly of large and complex genomes using multiple references
.
Genome Res
2018
;
28
(
11
):
1720
32
.

103.

van
Hijum
SAFT
,
Zomer
AL
,
Kuipers
OP
, et al.
Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies
.
Nucleic Acids Res
2005
;
33
(
suppl_2
):
W560
6
.

104.

Richter
DC
,
Schuster
SC
,
Huson
DH
.
OSLay: optimal syntenic layout of unfinished assemblies
.
Bioinformatics
2007
;
23
(
13
):
1573
9
.

105.

Rissman
AI
,
Mau
B
,
Biehl
BS
, et al.
Reordering contigs of draft genomes using the mauve aligner
.
Bioinformatics
2009
;
25
(
16
):
2071
3
.

106.

Husemann
P
,
Stoye
J
.
r2cat: synteny plots and comparative assembly
.
Bioinformatics
2010
;
26
(
4
):
570
1
.

107.

Bosi
E
,
Donati
B
,
Galardini
M
, et al.
MeDuSa: a multi-draft based scaffolder
.
Bioinformatics
2015
;
31
(
15
):
2443
51
.

108.

Muñoz
A
,
Zheng
C
,
Zhu
Q
, et al.
Scaffold filling, contig fusion and comparative gene order inference
.
BMC Bioinformatics
2010
;
11
(
1
):
304
.

109.

Dias
Z
,
Dias
U
,
Setubal
JC
.
SIS: a program to generate draft genome sequence scaffolds for prokaryotes
.
BMC Bioinformatics
2012
;
13
(
1
):
96
.

110.

Lu
CL
,
Chen
KT
,
Huang
SY
, et al.
CAR: contig assembly of prokaryotic draft genomes using rearrangements
.
BMC Bioinformatics
2014
;
15
(
1
):
381
.

111.

Chen
KT
,
Chen
CJ
,
Shen
HT
, et al.
Multi-CAR: a tool of contig scaffolding using multiple references
.
BMC Bioinformatics
2016
;
17
(
17
):
185
92
.

112.

Chen
KT
,
Lu
CL
.
CSAR-web: a web server of contig scaffolding using algebraic rearrangements
.
Nucleic Acids Res
2018
;
46
(
W1
):
W55
9
.

113.

Chen
KT
,
Liu
CL
,
Huang
SH
, et al.
CSAR: a contig scaffolding tool using algebraic rearrangements
.
Bioinformatics
2018
;
34
(
1
):
109
11
.

114.

Paten
B
,
Earl
D
,
Nguyen
N
, et al.
Cactus: algorithms for genome multiple sequence alignment
.
Genome Res
2011
;
21
(
9
):
1512
28
.

115.

Kurtz
S
,
Phillippy
A
,
Delcher
AL
, et al.
Versatile and open software for comparing large genomes
.
Genome Biol
2004
;
5
(
2
):
R12
.

116.

Gao
S
,
Bertrand
D
,
Chia
BKH
, et al.
OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
.
Genome Biol
2016
;
17
(
1
):
102
.

117.

Deng
L
,
Guo
L
,
Xu
M
, et al.
SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme
.
BioRxiv
2019
;762385.

118.

Li
YI
,
Copley
RR
.
Scaffolding low quality genomes using orthologous protein sequences
.
Bioinformatics
2013
;
29
(
2
):
160
5
.

119.

Gertz
EM
,
Yu
YK
,
Agarwala
R
, et al.
Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST
.
BMC Biol
2006
;
4
(
1
):
1
14
.

120.

Zhu
BH
,
Song
YN
,
Xue
W
, et al.
PEP_scaffolder: using (homologous) proteins to scaffold genomes
.
Bioinformatics
2016
;
32
(
20
):
3193
5
.

121.

Kent
WJ
.
BLAT—the BLAST-like alignment tool
.
Genome Res
2002
;
12
(
4
):
656
64
.

122.

Salzberg
SL
,
Phillippy
AM
,
Zimin
A
, et al.
GAGE: a critical evaluation of genome assemblies and assembly algorithms
.
Genome Res
2012
;
22
(
3
):
557
67
.

123.

Gurevich
A
,
Saveliev
V
,
Vyahhi
N
, et al.
QUAST: quality assessment tool for genome assemblies
.
Bioinformatics
2013
;
29
(
8
):
1072
5
.

124.

Hunt
M
,
Newbold
C
,
Berriman
M
, et al.
A comprehensive evaluation of assembly scaffolding tools
.
Genome Biol
2014
;
15
(
3
):
R42
.

125.

Mandric
I
,
Knyazev
S
,
Zelikovsky
A
.
Repeat-aware evaluation of scaffolding tools
.
Bioinformatics
2018
;
34
(
15
):
2530
7
.

126.

Ou
S
,
Liu
J
,
Chougule
KM
, et al.
Effect of sequence depth and length in long-read assembly of the maize inbred NC358
.
Nat Commun
2020
;
11
(
1
):
1
10
.

127.

Liu
J
,
Seetharam
AS
,
Chougule
K
, et al.
Gapless assembly of maize chromosomes using long-read technologies
.
Genome Biol
2020
;
21
(
1
):
1
17
.

128.

Yu
J
,
Golicz
AA
,
Lu
K
, et al.
Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars
.
Plant Biotechnol J
2019
;
17
(
5
):
881
92
.

129.

Delaneau
O
,
Zagury
JF
,
Robinson
MR
, et al.
Accurate, scalable and integrative haplotype estimation
.
Nat Commun
2019
;
10
(
1
):
1
10
.

130.

Yang
J
,
Moeinzadeh
MH
,
Kuhl
H
, et al.
Haplotype-resolved sweet potato genome traces back its hexaploidization history
.
Nature plants
2017
;
3
(
9
):
696
703
.

131.

Low
WY
,
Tearle
R
,
Liu
R
, et al.
Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle
.
Nat Commun
2020
;
11
(
1
):
1
14
.

132.

Xu
W
,
Zhang
Q
,
Yuan
W
, et al.
The genome evolution and low-phosphorus adaptation in white lupin
.
Nat Commun
2020
;
11
(
1
):
1
13
.

133.

Cai
X
,
Wu
J
,
Liang
J
, et al.
Improved Brassica oleracea JZS assembly reveals significant changing of LTR-RT dynamics in different morphotypes
.
Theor Appl Genet
2020
;
133
(
11
):
3187
99
.

134.

Wallberg
A
,
Bunikis
I
,
Pettersson
OV
, et al.
A hybrid de novo genome assembly of the honeybee, Apis mellifera, with chromosome-length scaffolds
.
BMC Genomics
2019
;
20
(
1
):
275
.

135.

Xu
CQ
,
Liu
H
,
Zhou
SS
, et al.
Genome sequence of Malania oleifera, a tree with great value for nervonic acid production
.
GigaScience
2019
;
8
(
2
):
giy164
.

136.

Rispe
C
,
Legeai
F
,
Nabity
PD
, et al.
The genome sequence of the grape phylloxera provides insights into the evolution, adaptation, and invasion routes of an iconic pest
.
BMC Biol
2020
;
18
(
1
):
1
25
.

137.

Li
SF
,
Wang
J
,
Dong
R
, et al.
Chromosome-level genome assembly, annotation and evolutionary analysis of the ornamental plant Asparagus setaceus
.
Hortic Res
2020
;
7
(
1
):
1
11
.

138.

Arimoto
A
,
Hikosaka-Katayama
T
,
Hikosaka
A
, et al.
A draft nuclear-genome assembly of the acoel flatworm Praesagittifera naikaiensis
.
GigaScience
2019
;
8
(
4
):
giz023
.

139.

Tsai
IJ
,
Otto
TD
,
Berriman
M
.
Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps
.
Genome Biol
2010
;
11
(
4
):
R41
.

140.

Boetzer
M
,
Pirovano
W
.
Toward almost closed genomes with GapFiller
.
Genome Biol
2012
;
13
(
6
):
R56
.

141.

Paulino
D
,
Warren
RL
,
Vandervalk
BP
, et al.
Sealer: a scalable gap-closing application for finishing draft genomes
.
BMC Bioinformatics
2015
;
16
(
1
):
1
8
.

142.

Salmela
L
,
Sahlin
K
,
Mäkinen
V
, et al.
Gap filling as exact path length problem
.
J Comput Biol
2016
;
23
(
5
):
347
61
.

143.

Luo
J
,
Wang
J
,
Shang
J
, et al.
GapReduce: a gap filling algorithm based on partitioned read sets
.
IEEE/ACM Trans Comput Biol Bioinform
2018
.

144.

English
AC
,
Richards
S
,
Han
Y
, et al.
Mind the gap: upgrading genomes with Pacific biosciences RS long-read sequencing technology
.
PLoS One
2012
;
7
(
11
):
e47768
.

145.

Kosugi
S
,
Hirakawa
H
,
Tabata
S
.
GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments
.
Bioinformatics
2015
;
31
(
23
):
3733
41
.

146.

Xu
GC
,
Xu
TJ
,
Zhu
R
, et al.
LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly
.
GigaScience
2019
;
8
(
1
):
giy157
.

147.

Xu
M
,
Guo
L
,
Gu
S
, et al.
TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads
.
bioRxiv
2019
;831248.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data