A critical comparison of technologies for a plant genome sequencing project

Abstract Background A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read–only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates. Results Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs. Conclusions The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers.

.1: Summary of short read library sequencing. Insert size is given where a mode is appropriate.

Germplasm
S. verrucosum accesssion Ver-54 was grown in the glass house in James Hutton Institute in Scotland. Both fresh and frozen leaves from this accession and its clones were used for DNA extraction.

DNA extraction 1.2.1 CTAB DNA extraction
Genomic DNA was extracted using CTAB lysis, phenol-chloroform, and Qiagen MagAttract HMW DNA Kit (QIAGEN, Manchester, UK) purification. Young frozen S. verrucosum Ver-54 leaves (5 g) were ground to a fine powder using a liquid nitrogen cooled pestle and mortar, distributed over two 50 ml falcon tubes and mixed with 20 ml CTAB buffer (100mM Tris-HCl, pH 8.0, 2 % (w/v) CTAB, 1.4 M NaCl, 20 mM EDTA) containing 20 µg/ml proteinase K and incubated at 55 • C for 20 minutes. Next, 0.5 volumes (8 ml) chloroform was added and carefully mixed by 15× inversion followed by centrifugation at 2990×g on an Eppendorf Centrifuge 5810 R (Eppendorf, Stevenage, UK) for 30 minutes. The aqueous phase was carefully transferred into a new tube to which 1 volume phenol/chloroform/isoamyl alcohol was added followed by centrifugation for 30 minutes at 2990×g. The aqueous phase was ethanol precipitated by addition of 3 M sodium acetate (1/10 of DNA volume), (pH 5.2) and 2.5 volumes of ethanol, mixed and precipitated at 2990×g and 4 • C. DNA pellets were washed with ice-cold 70 % ethanol, air dried and resuspended in 350 µl of 1× TE buffer (10 mM Tris-HCl, 1 mM EDTA pH 8.0) containing 10 U RNase One (Promega, Southampton, UK). The DNA was dissolved overnight at 4 • C with occasional mixing by inversion. Finally, we purified the DNA with the QIAGEN MagAttract HMW DNA Kit with minor changes to the manufacture instructions at the beginning; we added 150 µl AL Buffer to 200 µl DNA (10 µg in TE) and 15 µl magnetic bead suspension. From this point we followed the Quick-Start Protocol and we eluted the DNA in 200 µl of 1× TE buffer.

IrysPrep DNA extraction
Fresh young leaves of the S. verrucosum Ver-54 accession were collected after 48-hour treatment in the dark. Earlham Institute's Platforms and Pipelines group followed IrysPrep "Fix'n'Blend" Plant DNA extraction protocol supplied by BioNano Genomics. 2.5 g of fresh young leaves were fixed with 2 % formaldehyde. After washing, leaves are disrupted and homogenized in the presence of isolation buffer. The isolation buffer contains PVP10 and BME to prevent oxidation of polyphenols. Triton X-100 is added to facilitate the release of nuclei from the broken cells.
The nuclei are then purified on a Percoll cushion. A nuclei phase is taken and washed several times in isolation buffer before embedding into low melting point agarose. 2 plugs of 90 µl were cast using the CHEF Mammalian Genomic DNA Plug Kit . Once set at 4 • C the plugs were added to a lysis solution containing 200 µl proteinase K (QIAGEN 158920) and 2.5 ml of BioNano lysis buffer in a 50 ml conical tube. These were put at 50 • C for 2 hours on a thermomixer, making a fresh proteinase K solution to incubate overnight. The 50 ml tubes were then removed from the thermomixer for 5 minutes before 50 µl RNAse A (Qiagen158924) was added and the tubes returned to the thermomixer for a further hour at 37 • C.The plugs were then washed 7 times in Wash Buffer supplied in Chef kit and 7 times in 1xTE. One plug was removed and melted for 2 minutes at 70 • C followed by 5 minutes at 43 • C before adding 10 µl of 0.2 U /µl of GELase (Cambio Ltd G31200). After 45 minutes at 43 • C the melted plug was dialysed on a 0.1 µM membrane (Millipore VCWP04700) sitting on 15 ml of 1xTE in a small petri dish. After 2 hours the sample was removed with a wide bore tip and mixed gently 5 times and left overnight at 4 • C.

Tight and Long Library (TALL) paired-end library construction protocol
DNA was extracted using the CTAB protocol given in Section 1.2.1. A total of 3 µg of DNA was sheared in a 60 µl volume on a Covaris S2 (Covaris, Massachusetts, USA) for 1 cycle of 40 seconds with a duty cycle of 5 %, cycles per burst of 200 and intensity of 3. The fragmented DNA was then subjected to size selection on a Blue Pippin (Sage Science, Beverly, USA). The 40 µl in each of collection wells was replaced with fresh buffer and the separation and elution current checked prior to loading the sample. To 30 µl of the end repaired molecules 10 µl of R2 marker solution was added and then loaded onto a 1.5 % Cassette. The Blue Pippin was configured to collect fragments at 800 bp using the tight settings. Post size selection, the 40 µl from the collection well was recovered and the size isolated estimated on High Sensitivty BioAnalyser Chip and DNA concentration determined using a Qubit HS Assay. The size selected molecules were then end repaired in 100 µl volume using the NEB End Repair Module (NEB, Hitchin, UK) incubating the reaction at 22 • C for 30 minutes. Post incubation 100 µl beads of CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) were added and the DNA precipitated onto the beads. They were then washed twice with 70 % ethanol and the end repaired molecules eluted in 25 µl Nuclease free water (Qiagen, Manchester, UK).
End repaired molecules were then A tailed in 30 µl volume using in the NEB A tailing module (NEB) incubating the reaction at 37 • C for 30 minutes. To the A tailed library molecules 1 µl of an appropriate Illumina TruSeq Index adapter (Illumina, San Diego, USA) was added and mixed then 31 µl of Blunt/ TA ligase (NEB) added and incubated at 22 • C for 10m. Post incubation 5 µl of stop ligation was added and then the reaction incubated at room temperature for 5 minutes. Following this incubation 67 µl beads of CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) were added and the DNA precipitated onto the beads. They were then washed twice with 70 % ethanol and the end repaired molecules eluted in 100 µl nuclease free water. Two further 1× CleanPCR bead based purifications were undertaken to remove any adapter dimer molecules that may have formed during the adapter ligation step and the final library eluted in 25 µl Resuspension Buffer (Illumina).
Library QC was performed by running a 1 µl aliquot on a High Sensitivity BioAnalyser chip (Agilent, Stockport, UK) and the DNA concentration measured using the High Sensitivity Qubit (Thermo Fisher, Cambridge, UK). To determine the number of viable library molecules the library was subjected to quantification by the Kappa qPCR Illumina quantification kit (Kapa Biosystems, London, UK) and then sequenced on the Hiseq2500 (Illumina) with a 2×150 bp read metric.

Amplification free paired-end library construction (DISCOVAR)
DNA was extracted using the CTAB protocol given in Section 1.2.1. A total of 600 ng of DNA was sheared in a 60 µl volume on a Covaris S2 (Covaris, Massachusetts, USA) for 1 cycle of 40 seconds with a duty cycle of 5 %, cycles per burst of 200 and intensity of 3. The fragmented molecules were then end repaired in 100 µl volume using the NEB End Repair Module (NEB, Hitchin, UK) incubating the reaction at 22 • C for 30 minutes. Post incubation 58 µl beads of CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) were added using a positive displacement pipette to ensure accuracy and the DNA precipitated onto the beads. They were then washed twice with 70 % ethanol and the end repaired molecules eluted in 25 µl Nuclease free water (Qiagen, Manchester, UK).
End repaired molecules were then A tailed in 30 µl volume using in the NEB A tailing module (NEB) incubating the reaction at 37 • C for 30 minutes. To the A tailed library molecules 1 µl of an appropriate Illumina TruSeq Index adapter (Illumina, San Diego, USA) was added and mixed then 31 µl of Blunt/TA ligase (NEB) added and incubated at 22 • C for 10m. Post incubation 5 µl of stop ligation was added and then the reaction incubated at room temperature for 5 minutes. Following this incubation 67 µl beads of CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) were added and the DNA precipitated onto the beads. They were then washed twice with 70 % ethanol and the end repaired molecules eluted in 100 µl nuclease free water. Two further CleanPCR bead based purifications were undertaken to remove any adapter dimer molecules that may have formed during the adapter ligation step. The first with 0.9× volume beads, the second with 0.6× and the final library eluted in 25 µl Resuspension Buffer (Illumina).
Library QC was performed by running a 1 µl aliquot on a High Sensitivity BioAnalyser chip (Agilent, Stockport, UK) and the DNA concentration measured using the High Sensitivity Qubit (Thermo Fisher, Cambridge, UK). To determine the number of viable library molecules the library was subjected to quantification by the Kappa qPCR Illumina quantification kit (Kapa Biosystems, London, UK) and then sequenced at a loading concentration of 9 pM on the Hiseq2500s (Illumina) with a 2×250bp read metric.

Long mate pair library construction protocol
DNA was extracted using the CTAB protocol given in Section 1.2.1. For the Tagmentation reactions 4 µg of Genomic DNA was prepared in 308 µl and then 80 µl 5× Tagment Buffer Mate Pair (Illumina, San Diego, USA) added followed by 12 µl Mate Pair Tagmentation Enzyme (Illumina) and the reaction gently vortexed to mix. This was then incubated for 30 minutes at 55 • C, 100 µl of Neutralize Tagment Buffer (Illumina) added and then incubated at room temperature for 5 minutes. A 1× volume bead clean-up was performed with CleanPCR beads (GC Biotech, Alphen aan den Rijn, The Netherlands) and the DNA eluted in 165 µl of Nuclease free Water . A 1 µl aliquot was run on a BioAnalyser 1200 chip and DNA concentration determined using a Qubit HS Assay.
Strand Displacement was performed by combining 162 µl of tagmented DNA, 20 µl 10× Strand Displacement Buffer (Illumina), 8 µl dNTPs (Illumina) and 10 µl Strand Displacement Polymerase (Illumina). This was then incubated at room temperature for 30 minutes. A 0.75× volume bead clean-up was performed with CleanPCR beads and the DNA eluted in 16 µl of Nuclease free Water and the eluted DNA from the 3 µg and 6 µg reactions pooled. A 1 µl aliquot was diluted 1:6 and run on a BioAnalyser 1200 chip chip (Agilent, Stockport, UK) and DNA concentration determined using a Qubit HS Assay (Thermo Fisher, Cambridge, UK).
Size selection was performed on a Sage Science Blue Pippin (Sage Science, Beverly, USA). The 40 µl in the collection well was replaced with fresh buffer and the collection and elution current checked prior to loading the sample. To 30 µl of the pooled Strand Displaced reaction 10 µl of loading solution was added and then loaded onto a 0.75 % Cassette which was configured to collect fragments at 10 kbp using the tight setting. Post size selection, the 40 µl from the collection well was recovered and the DNA concentration determined using a Qubit HS Assay.
This was then incubated at 30 • C overnight. Linear DNA was digested by adding 3.75 µl Exonuclease (Illumina) and incubating at 37 • C for 30 minutes followed by 70 • C for 30 minutes to denature the enzyme and 5 µl of stop ligation (Illumina) added. During exonuclease treatment 20 µl of M280 Dynabeads (Thermo Fisher) were prepared by washing twice with 100 µl Bead Bind Buffer (Illumina) before resuspending in 130 µl Bead Bind Buffer. Circularised DNA was then sheared in a 130 µl volume on a Covaris S2 for 2 cycles of 37 seconds with a duty cycle of 10 %, cycles per burst of 200 and intensity of 4.
To 130 µl fragmented DNA 130 µl of washed M280 beads was added, mixed and then placed on a lab rotator at room temperature for 20 minutes. Library molecules bound to M280 beads were then washed four times with 200 µl Bead Washer Buffer (Illumina) and twice with 200 µl Resuspension Buffer (Illumina).
A master mix containing 85 µl nuclease free water (Qiagen, Manchester, UK), 10 µl 10× End Repair Reaction Buffer (NEB, Hitchin, UK) and 5 µl end repair enzyme mix (NEB) was prepared and added to the tube, mixed with the beads and incubated at room temperature for 30 minutes. End repaired library molecules bound to M280 beads were then washed four times with 200 µl Bead Washer Buffer and twice with 200 µl Resuspension Buffer.
A master mix containing 25 µl nuclease free water, 3 µl A Tailing 10× Reaction Buffer (NEB) and 2 µl A tailing enzyme mix (NEB) was prepared and added to the tube, mixed with the beads and incubated at 37 • C for 30 minutes.
To the A tailed library molecules 1 µl of the appropriate Illumina Index adapter (Illumina) was added and mixed then 31 µl of Blunt/TA ligase (NEB) added and incubated at room temperature for 10 minutes. Post incubation 5 µl of stop ligation added and then the adapter ligated library molecules bound to M280 beads were then washed four times with 200 µl Bead Washer Buffer and twice with 200 µl Resuspension Buffer.
A master mix containing 20 µl nuclease free water, 25 µl 2× Kappa HiFi (Kappa Biosystems) and 5 µl Illumina Primer Cocktail (Illumina) was prepared and added to each tube, mixed with the beads and the contents, including beads, transferred to a 200 µl PCR tube. Each sample was then subjected to amplification on a Veriti Thermal Cycler (Thermo Fisher) with the following conditions:-98 • C for 3 minutes, 15 cycles of PCR of 98 • C for 10 seconds, 60 • C for 30 seconds, 72 • C for 30 seconds followed by 72 • C for 5 minutes and Hold at 4 • C.
Post amplification the PCR tube was placed on a magnetic plate, the beads allowed to pellet and then 45 µl of the PCR transferred to a 2 ml Lobind Eppendorf Tube. To this 31.5 µl beads of CleanPCR beads were added to precipitate the DNA, the beads washed twice with 70 % ethanol and the final library eluted in 20 µl resuspension buffer. Library QC was performed by running a 1 µl aliquot on a High Sensitivity BioAnalyser chip and the DNA concentration measured using the High Sensitivity Qubit. The quantification of the pool was determined by the Kappa qPCR Illumina quantification kit with the pool run at 10 pM on a MiSeq with a 2×300 bp reads read metric.

PacBio library preparation
DNA was extracted using the CTAB protocol given in Section 1.2.1. We created a 20 kbp fragment length library according to the manufacturer's instructions as 20 kbp Template Preparation Using BluePippinTM Size-Selection System, and sequenced 65 SMRT cells with the P6C4 chemistry on the PacBio RSII instrument.
The total yield was 32 Gb of Data, the final N50 of read length was 13 499 bp. The total coverage of the raw data was 50×.

10x library preparation
DNA was extracted using the CTAB protocol given in Section 1.2.1. DNA material was diluted to 1.1ng/ul with EB (Qiagen) and checked with a QuBit Flourometer 2.0 (Invitrogen) using the QuBit dsDNA HS Assay kit. The Chromium User Guide was followed as per the manufacturer's instructions (10X Genomics, PN-120229).
The final library was quantified using qPCR (KAPA Library Quant kit (Illumina), ABI Prism qPCR Mix, Kapa Biosystems). Sizing of the library fragments were checked using a Bioanalyzer (High Sensitivity DNA Reagents, Agilent). Samples were pooled based on the molarities calculated using the two QC measurements.
The library was clustered at 8 pM with a 1 % spike in of PhiX library (Illumina). The pool was run on a HiSeq2500 250 bp Rapid Run V2 mode (Illumina). The following run metrics were applied: Read 1: 250 cycles, Index 1: 8 cycles, Index 2: 0 cycles and Read 2: 250 cycles.

BioNano preparation
DNA was extracted using the IrysPrep protocol given in Section 1.2.2. A small amount was removed to QC on an Opgen Argus Q-Card and Qubit HS for the DNA concentration. 300 ng of DNA was taken into the NLRS (Nick, Label, Repair and Stain) reaction using 1 µl Nt.BspQI (NEB R0644S). Following the NLRS reaction 16 µl was loaded onto a single flow cell on a BioNano chip. The Chip loading was optimised and run for 30 cycles on the BioNano Irys using ICS1.6. The same chip was run a total of 8 times on each side to generate 252 Gb of raw data with molecule length over 100 kbp with a nick density of 6.79/100 kbp. Images were converted to .bnx files using AutoDetect 2.1.0.6656 before analysis.
HMW DNA was isolated and the nicking endonuclease Nt.BspQ1 (New England BioLabs) was used label highquality HMW DNA molecules at specific sequence motifs (GCTCTTC). The nicked DNA molecules were then stained according to the instructions of IrysPrep Reagent Kit (BioNano Genomics).
The HMW DNA with flourescent labeles was loaded onto the nanochallen array of the IrysChip (BioNano Genomics) and was automatically imaged by the Irys system (BioNano Genomics). Raw DNA molecules of at least 100 kbp were collected and converted into BNX files by AutoDetect software to obtain basic labeling and DNA length information. The filtered raw DNA molecules in BNX format were aligned, clustered, and assembled into the BNG map by using the BioNano Genomics assembly pipeline as described in previous publications (40, 41). The P value thresholds used for pairwise assembly, extension/refinement, and final refinement stages were 1×109, 1×1010, and 1×1010, respectively. The initial BNG map was then checked for potential chimeric BNG contigs and was further refined. To compare the draft sequence assembly with the BNG map, sequences were digested in silico according to the restriction site of Nt.BspQ1 by using Knickers (BioNano Genomics). The alignment of sequence assemblies with the BNG map was computed with RefAligner, and the visualization of the alignment was performed with snapshot in IrysView (http://bionanogenomics.com/support/software-downloads/).

Dovetail extraction and library preparation
20 g of fresh leaf material was sent to Dovetail Genomics, LLC (Santa Cruz, CA, USA). They extracted HMW DNA that was used to construct a Chicago library.

BAC library
Very young, partially expanded leaves from dark treated S. verrucosum Ver-54 plantlets were harvested and immediately frozen in liquid nitrogen. A 5× pooled BAC library was generated by Bio S&T (Canada) and contained 192 pools of approximately 600 independent recombinants per pool. For the library generation, genomic DNA was subjected to HindIII restriction enzyme digest, ligated in to the vector pindigoBAC-5 (HindIII-Cloning Ready) (Epicentre) and transformed into compatible DH10B cells (Invitrogen). The average insert size was estimated to be 125 kbp following NotI restriction enzyme digest and PFGE separation of nine randomly selected clones. Each well from one plate was then grown on a single plate and a single colony from each was selected for sequencing.

ABySS
The Tight Amplification free Library (TALL) was assembled using ABySS. We used RAMPART [mapleson_rampart:_2015] to automate the running of ABySS and select an appropriate k value (k = 113 was selected). The configuration files for RAMPART is given in Supplemental File S1 (pair_end_potato.xml. We used ABySS version 1.5.1 and RAMPART version 0.10.3. The command use to run Rampart was: rampart pair_end_potato.xml

Mate-pair scaffolding
The raw reads were processed using NextClip which filters out paired-end reads (reads which do not include the Nextera adaptor), and removes the adaptor from the mate-pair reads. We used NextClip version 1.3 which depends on R version 2.15.2, bwa vesrion 0.6.2 and texlive version 1.2.2013. We used the following command for NextClip: nextclip -d -m 30 -t 0 -n 600000000 -l output.log \ -i mp_R1.fastq -j mp_R2.fastq -o nextclip_potato The number of reads before processing was 44 864 446, which were divided into four categories based on whether the Nextera adaptor was found in the reads or not. The reads with adaptor were further filtered for sequences that were long enough (less than 25 bp), and total of 33 044 388 (73.65 %) were deemed usable. Total 11 173 807 647 (15.4×) was written for a file to be used in the scaffolding stage.

PacBio
Python version 2.7.9 was used for running the PacBio software.

Falcon
We used Falcon version 0.3.0 and the following command line: fc_run.py fc_run_ver.cfg The configuration file is in fc_run_ver.cfg (Supplemental File S6) which needs the list of input files input.fofn (Supplemental File S3).

Canu
We used Canu version 1.0 with the following command line: We initially used filtered subreads from the standard PacBio pipeline and then did a second assembly using subreads from the HGAP 3 pipeline (the filtered reads are the result of running the HGAP pipeline, below, up to the P_PreAssemblerDagcon stage).

HGAP
We used HGAP 3 as part of smrtanalysis 2.3.0p5 and the following command line:

10x Genomics Supernova
To assemble the 10x Genomics library we used Supernova version 1.1.1 and the following commands: supernova run --id=verrusocum5 \ --fastqs=potato/10x/supernova/fastq \ --sample=potato supernova mkoutput --asmdir=genome-assembly/10x/verrusocum5/outs/assembly \ --outprefix=pseudohap_ver_500 \ --style=pseudohap --minsize=500 The manufacturer recommends the use of 150 bp paired-end reads with a coverage of between 38× to 56×. Since our library was higher coverage, and with 250 bp reads, we generated subsamples of the read sets to simulate the use of 150 bp reads and coverages of 48× and 52×. The reads for each subsample were selected using a pseudorandom number generator with a known seed. Three different seeds were produced for each subsample to provide triplicates for testing. We found that in each case the subsampled libraries with shorter read lengths produced assemblies with significantly less contiguity. Our original assembly has an N50 of 2.38 Mbp, while the 52× version has 2.15 Mbp and the 48× version has 1.94 Mbp.

Scaffolding
Scaffolding is performed using the optical map assembly and one of our S. verrucosum genome assemblies.

Dovetail
Dovetail scaffolding was performed for the discovar-mp and falcon assemblies by Dovetail Ltd. using HiRise.

BAC assembly 2.6.1 Illumina
The BAC reads were filtered to remove any phiX, E. coli, and the pIndigo5 BAC vector sequence. The filtered reads were then assembled using DISCOVAR de novo and the assembled contigs were filtered to remove any residual pieces of BAC vector and contigs under 500 bp in length.

PacBio
The BAC pools were assembled with HGAP as part of smrtanlysis 2.3.0p5 with standard settings.

Evaluation
All of the code used to perform the evaluation is available on GitHub in the georgek/potato-figures repository. The complete process from raw data to figures can be found at https://georgek.github.io/potato-figures/ figures.html.

K-mer content
In Section we used KAT to see the k-mer content of each assembly. Since we were comparing the assemblies to the Illumina reads, we polished the Falcon assembly using Pilon to reduce the difference due to the sequencing platform. Figure S3.1 shows the KAT plots for Falcon with and without the polishing step for comparison.

Synteny
The complete set of mummer plots showing alignments to the S. tuberosum reference v4.04 for the Falcon assembly, DISCOVAR assembly, and Supernova assembly are shown in Figures S3.2, S3.3, and S3.4, respectively.