The draft nuclear genome assembly of Eucalyptus pauciflora: a pipeline for comparing de novo assemblies

Abstract Background Eucalyptus pauciflora (the snow gum) is a long-lived tree with high economic and ecological importance. Currently, little genomic information for E. pauciflora is available. Here, we sequentially assemble the genome of Eucalyptus pauciflora with different methods, and combine multiple existing and novel approaches to help to select the best genome assembly. Findings We generated high coverage of long- (Nanopore, 174×) and short- (Illumina, 228×) read data from a single E. pauciflora individual and compared assemblies from 5 assemblers (Canu, SMARTdenovo, Flye, Marvel, and MaSuRCA) with different read lengths (1 and 35 kb minimum read length). A key component of our approach is to keep a randomly selected collection of ∼10% of both long and short reads separated from the assemblies to use as a validation set for assessing assemblies. Using this validation set along with a range of existing tools, we compared the assemblies in 8 ways: contig N50, BUSCO scores, LAI (long terminal repeat assembly index) scores, assembly ploidy, base-level error rate, CGAL (computing genome assembly likelihoods) scores, structural variation, and genome sequence similarity. Our result showed that MaSuRCA generated the best assembly, which is 594.87 Mb in size, with a contig N50 of 3.23 Mb, and an estimated error rate of ∼0.006 errors per base. Conclusions We report a draft genome of E. pauciflora, which will be a valuable resource for further genomic studies of eucalypts. The approaches for assessing and comparing genomes should help in assessing and choosing among many potential genome assemblies from a single dataset.

suggested by RepeatMasker website, but we agree with you, your change makes the reference become better. showed that MaSuRCA generated the best assembly, which is 594.87 Mb in size, with 53 a contig N50 of 3.23 Mb, and an estimated error rate of ~0.006 errors per base.

54
Conclusions 55 We report a draft genome of Eucalyptus pauciflora, which will be a valuable resource 56 for further genomic studies of eucalypts. The approaches for assessing and comparing 57 genomes, should help in assessing and choosing among many potential genome 58 assemblies from a single dataset.
Introduction produce less fragmented assemblies at a fraction of the cost of previous methods. 89 Nevertheless, many challenges still remain, not least of which is that different genome 90 assembly software, and small changes to the parameters of a single piece of software, 91 can produce substantially different assemblies. In light of this, methods for choosing 92 the most accurate assembly from a set of possible assemblies have become increasingly 93 important.

95
Two metrics are commonly used to assess and compare genome assemblies: contig N50 96 and Benchmarking Universal Single-Copy Orthologs (BUSCO [18], 97 RRID:SCR_015008) scores. The contig N50 is the size of the contig such that at least 98 50% of the assembled nucleotides can be found in contigs of that size or larger. The

99
N50 is a measure of genome contiguity, where a higher N50 suggests a genome that has 100 been assembled into fewer and larger contigs. representation of a diploid assembly, the minimum possible base-level error rate will 186 be higher, because by necessity a haploid representation of a heterozygous site will not Here, we used long-and short-reads to create a draft haploid assembly of the E. Sample collection, DNA sequencing and quality control 215 We collected leaves from the single E. pauciflora tree near Thredbo, Kosciuszko Creation of assembly and validation datasets 268 We separated our long-read and short-read data into assembly and validation datasets 269 by randomly assigning the trimmed and filtered reads into the two datasets with custom Genome assembly 276 Here, we compared seven long-read-only assemblies and two hybrid assemblies. For 277 each combination of data and genome assembler, we followed the same genome 278 assembly pipeline. We first used the assembler to produce an initial assembly.

279
Following this, we identified and removed contigs from contaminant sequences, and 280 then polished the resulting assembly. We then identified and removed haplotigs from 281 the assembly. Each assembly was re-polished after haplotig removal. our data (174x). We then put the corrected long-read datasets into two sets for assembly.

299
The first dataset contained all corrected long-reads, such that the minimum read length 300 was 1 kb (174x of coverage). The second dataset contained all corrected reads longer 301 than 35 kb (~40x of coverage). We refer to these datasets as the 1 kb and the 35 kb 302 datasets, respectively.

304
We first compared the performance of using corrected and uncorrected long-reads and 305 uncorrected long-reads to assemble the genome with two efficient assemblers, Flye Supplementary result). The results showed clearly that corrected long-reads produced better assemblies than uncorrected long-reads using Flye, while the differences with 309 wtdbg2 were less pronounced (Table S1). Nevertheless, the Flye assemblies with 310 corrected reads were the best overall, so we therefore decided to use corrected long-311 reads for the rest of the assemblies in the study.

313
We attempted eight long-read-only assemblies and two hybrid assemblies. Assemblies  In what follows, we refer to these assemblies as Canu_1kb, Canu_35kb, Pilon because MaSuRCA is a hybrid assembler, and using error-prone long-reads to 353 polish hybrid assemblies tends to induce more errors rather than remove them 354 (Additional file 5: Table S2).

356
We ran each polishing algorithm for multiple iterations until the accuracy of the 357 resulting assembly stopped improving or improving slightly. We assessed the 358 improvements using BUSCO scores and the base-level error rate by re-mapping 359 validation long-and short-reads to each assembly (mapped as above). We evaluated the 360 BUSCO scores using BUSCO with the embryophyta_odb9 lineage (1440 genes in total).

397
The high assembly ploidy for some assemblies after running Purge Haplotigs suggested 398 that these assemblies retained haplotigs that covered up to 29% of the genome. We  (Table 2).

414
Following removal of haplotigs, we re-evaluated each assembly using BUSCO scores 415 ( Fig. 2B and 2C). We noted that, depending on the genome assembly, the number of 416 complete BUSCO genes sometimes dropped and sometimes increased slightly after 417 removing haplotigs (Fig. 2B) The other three metrics assess the correctness of every assembly, and also suggest that 470 the best assemblies for our data are produced by MaSuRCA (Table 3) Finally, to further investigate the different assemblies, we compared the genome 485 sequence similarity between different assemblies using NUCmer module of MUMmer 486 (Fig. 4), with the minimum identity set to 75. Notably, around 8% of the sequence of 487 Canu/SMARTdenovo/Flye/MaSuRCA assemblies failed to align to Marvel_35kb 488 assembly (Fig. 4), which, along with the low genome completeness (BUSCO scores) 489 of the Marvel_35kb assembly ( Based on the eight metrics we used above (Table 3), we suggest that the 496 MaSuRCA_35kb assembly represents the most accurate representation of the E. 497 pauciflora genome. We note, though, that the Flye assembler only took 1-3% of runtime 498 of the other assemblers used in this paper (Table 1), and produced genome assemblies 499 that were of similar quality to the MaSuRCA_35kb assembly in many respects. The

500
Marvel_35kb assembly received the worst scores on many metrics, and also appears to 501 be missing roughly ~10% of the genome according to BUSCO scores and genome 502 sequence similarity analyses compared to other assemblies (Table 3). pauciflora and E. grandis genomes (Fig. 5B).

603
The authors declare that they have no competing financial interests.