de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer

Abstract Background: Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Results: Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Conclusion: Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology.


41
Today, long-read sequencing technology offers interesting alternatives to solve genome 42 assembly difficulties and improve the completeness of genome assemblies, mostly in 43 repetitive regions (Jain et al. 2015) where short-read sequencing has failed. Microbial or small 44 eukaryotic genomes could now be fully assembled using Oxford Nanopore (Loman et al. 45 2015) or Pacific Biosciences reads alone (Chin et al. 2013;Koren and Phillippy 2015) or in 46 combination with short but high quality reads (Koren et al. 2012;Goodwin et al. 2015;47 Madoui et al. 2015). Application of the single-molecule real-time (SMRT) sequencing 48 platform to large complex eukaryotic genomes demonstrated the possibility of considerably 49 improving genome assembly quality (Huddleston et al. 2014;Chaisson et al. 2015). Similar 50 improvements were also accomplished using the 10x Genomics platform, and its application 51 to the human genome produced encouraging results (Mostovoy et al. 2016;Zheng et al. 2016) 52 and showed the importance of obtaining long and high-quality reads. 53 The most used sequencing technologies are based on the synthesis of new DNA strands, 54 including the Illumina and Pacific Biosciences technologies (Mardis 2008). These sequencing 55 technologies based on optical detection of nucleotide incorporations are often commercialized 56 through large-sized and expensive instruments. For example, the cost of the commercially 57 available Pacific Biosystems RS II instrument is high and the infrastructure and 58 implementation needs make it inaccessible to large sections of the research community. This 59 year Oxford Nanopore Technologies Ltd (ONT, Oxford, UK) commercialized MinION, a 60 single-molecule nanopore sequencer that can be connected to a laptop through a USB 61 interface (Loman and Watson 2015;Deamer et al. 2016). This system is portative (close to the 62 size of a harmonica) and low-cost (currently USD 1,000 for the instrument). The MinION 63 technology is based on an array of nanopores embedded on a chip that detects consecutive 6-64 mers of a single-strand DNA molecule by electrical sensing (Kasianowicz et al. 1996;Cherf 65 et al. 2012;Manrao et al. 2012;Laszlo et al. 2014). In addition to its small size and low price, 66 this new technology has several advantages over the older technologies. Library construction 67 involves a simplified method, no amplification step is needed, and data acquisition and 68 analyses occur in real time (Loose et al. 2016). Library preparation can be performed in two 69 ways: (i) a 10-minute library preparation based on an enzymatic method for '1D' sequencing 70 (sequencing one strand of the DNA) or (ii) a library preparation based on ligation for '2D' 71 sequencing (sequencing both the template and complement strands of the DNA). In the 2D 72 sequencing mode, the two strands of a DNA molecule are linked by a hairpin and sequenced 73 consecutively. When the two strands of the molecule are read successfully, a consensus 74 sequence is built to obtain a more accurate read (called 2D read). Otherwise only the template 75 or complement strand sequence is provided (called 1D read). 76 Here, we sequenced the genomes of 22 Saccharomyces cerevisiae isolates to determine if the 77 MinION system could be used in population genomic projects that require a deeper view of 78 the genetic variation landscape. Even when the throughput of MinION was still 79 heterogeneous, we were able to perform the sequencing in a reasonable time using six 80 MinION devices. First, we resequenced the Saccharomyces cerevisiae S288C reference 81 genome using a nanopore long-read sequencing strategy to evaluate recent assembly methods. 82 We generated a complete benchmark of the assembly structures, as well as the completeness 83 of complex regions. Next, we selected 21 strains of S. cerevisiae that were genetically diverse, 84 based on preliminary results of the 1002 Yeast Genomes Project a large-scale short-read 85 resequencing project (http://1002genomes.u-strasbg.fr/). The genomes of these 21 strains 86 were de novo sequenced and assembled with Nanopore long-reads to have a better insight into 87 the variation of their genomic architecture. We obtained near complete assembly, in terms of 88 genes, as well as transposable elements and telomeric regions. The most contiguous assembly 89 produced a single contig per chromosome, except for chromosomes 3 and 12, the latter 90 contains the large repeated rDNA cluster. 91 92

93
MinION data evaluation 94 We first sequenced the S288C genome by doing 11 MinION Mk1 runs with the R7.3 95 chemistry. On average, a 48-hours run produced more than 200 Mb of sequence, and the best 96 run throughput was 400 Mb. Two 2D library types with 8 kb and 20 kb mean fragmentation 97 sizes were used. They led to nearly 360,000 reads with a cumulative length of approximately 98 2.3 Gb and 63% of the nucleotides were in 2D reads, which represented a 187x and 118x 99 genome coverage for 1D and 2D reads, respectively. Template reads had a median length of 100 8.9 kb while 2D reads had a median length of 7.7 kb. All sequencing reads were aligned to the 101 S288C reference genome using BWA (Li and Durbin 2009) to assess their quality. We 102 successfully aligned 95.6% of the 2D reads with an average error rate of 17.2% (Figure 1a). 103 ONT tagged high-quality 2D reads as "2D pass" reads (reads with an average per-base quality 104 higher than 9), and 99.7% of the 2D pass reads were aligned to the reference genome with an 105 average error rate of 12.2%. We then parsed the alignment files to search for errors in 106 stretches of the same nucleotide (homopolymers). About 85% of A, T, C, and G 107 homopolymers of size 2 were present correctly in the reads. This percentage decreased rapidly 108 to 65% for homopolymers of size 4 for A and T homopolymers and to 70% for C and G 109 homopolymers. For size 7 homopolymers, it was 30% for A and T homopolymers and 35% 110 for C and G homopolymers ( Figure S1a). 111 We also sequenced the S288C genome using the R9 chemistry, the recently released version 112 of the pore. We obtained approximately 1 Gb of reads; 568 Mb were 2D reads, which 113 represents a 85x coverage with 1D reads and a 47x coverage with 2D reads. The mean 2D length was 6.1 kb. We aligned 82.1% of the 1D reads with a mean identity percentage of 115 82.8% and 94.3% of the 2D reads with a mean identity percentage of 85.2% (Figure 1b). As 116 we did with the R7.3 reads, we also searched for errors in homopolymers ( Figure S1b). The 117 numbers of correct A, T, C, and G homopolymers started at about 90% for size equal to 2, 118 then decreased to 75% for A and T homopolymers of size 4 and to 60% for the C and G 119 homopolymers. For size 7 homopolymers, it was 32% for A, T, and C homopolymers and 120 35% for G homopolymers. 121 Comparison of Nanopore-only assemblers 122 We tested Canu (Berlin et al. 2015), Miniasm (Li 2016), SMARTdenovo (Ruan) and ABruijn 123  with different subset of 1D, 2D, and 2D pass reads (Supplementary File 2) 124 and kept the best assembly for each software. 125 With Canu, the best assembly was obtained with the whole set of 2D pass reads (67x 126 coverage). The assembly was composed of 37 contigs with a cumulative length of 12 Mb and 127 seven chromosomes were assembled in one or two contigs. After aligning the contigs to the 128 S288C reference genome using Quast (Gurevich et al. 2013), we detected a high number of 129 deletions (120,365), which were often localized in homopolymers (58%). As a consequence, 130 only 454 of the 6,243 genes found in the assembly were insertion/deletion (indel)-free (Table  131   S1). With Miniasm, the best assembly was obtained using the 2D reads corrected by Canu, 132 which represented coverage of approximately 108x. The Miniasm assembly was composed of 133 28 contigs with a cumulative length of 11.8 Mb, and 13 chromosomes were assembled in one 134 or two contigs. The consensus sequence contained a high proportion of mismatches and 135 indels. With SMARTdenovo, 30x of the longest 2D reads produced the best assembly. It was 136 composed of 26 contigs, with a total length of 12 Mb, and 14 chromosomes were assembled 137 in one or two contigs. The SMARTdenovo assembly better covered the reference genome 138 (>99%) and contained the highest number of genes (98.8% of the 6,350 S288C genes), but the 139 Quast output again revealed a high number of deletions (128,050). With ABruijn, we obtained 140 the best results using all the 2D reads as input, which represented coverage of approximately 141 120x. The assembly contained 23 contigs with a cumulative length of 11.9 Mb, and 14 142 chromosomes were assembled in one or two contigs (Table S1). 143 Next, we aligned the assemblies (Canu, Miniasm, SMARTdenovo, and ABruijn) to the S288C 144 reference genome using NUCmer (Kurtz et al. 2004), and visualized the alignments with 145 mummerplot (Figures S2, S3, S4 and S5). We also examined the coordinates of the 146 alignments to search for chimera. We did not detect any chimeric contigs in the Canu, 147 Miniasm, or SMARTdenovo assemblies; however, we did find some in the ABruijn assembly. 148 Three chimeric contigs in the ABruijn assembly showed links between chromosomes 3 and 13 149 (first contig), chromosomes 3 and 2 (second contig), and chromosomes 10 and 2 (third 150 contig). To verify that the portions of these contigs were effectively chimeric, we back 151 aligned the Nanopore reads to the assembly and could not find any sequence that validated 152 these links. Unsurprisingly, these three chimeric contigs were fused at Ty1 transposable 153 element locations. 154 The alignment of each assembly to the reference genome showed that neither Canu, Miniasm, 155 nor SMARTdenovo could assemble the mitochondrial (Mt) genome completely. Because 156 ABruijn was the only assembler to assemble the complete Mt genome sequence, we decided 157 to use it to assemble the Mt DNA of the remaining 21 yeast strains (see below). 158 Generally, long reads allow tandem duplicated genes to be resolve, as for instance the CUP1 159 and ENA1-2 gene families. We compared the maximum number of copies found in the 160 Nanopore reads and the estimated number of copies based on Illumina reads coverage of these 161 two tandem-repeated genes with the number of copies of these two genes in the four 162 assemblies (Table S2). After aligning the paired-end reads to the reference sequence and 163 computing of the coverage, we estimated that CUP1 and ENA1-2 were present in seven and 164 four copies, respectively. The maximum numbers of copies of these genes in a single 165 Nanopore read were eight for CUP1 and five for ENA1-2. The numbers of copies of CUP1 166 and ENA1-2 were, respectively, nine and three in the Canu assembly, seven and two in the 167 Miniasm assembly and seven and four in the SMARTdenovo and ABruijn assemblies. 168 The number of indels in each assembly was considerably high for each assembler. Thus, we 169 tested Nanopolish , the most commonly used Nanopore-only error 170 corrector. We used the SMARTdenovo assembly, which was the most continuous and gene-171 rich assembly and all 2D reads for this test. After the error correction step, the cumulative 172 length of the contigs increased to 12.2 Mb and the N50 increased to 783 kb (at best it was 924 173 kb for the reference genome). The number of mismatches, insertions and deletions decreased 174 to 1,930, 7,707, and 17,445 respectively. The number of genes increased to 6,273 complete 175 and 2,590 without an indel (Table S3). 176 Although all metrics were improved, the number of indels still seemed too high, especially in 177 the coding regions of the genes. We decided to polish all assemblies with 2x250bp Illumina 178 paired-end reads, using Pilon (Walker et al. 2014), to verify if the general quality of the 179 assembly improved. The polishing step increased the N50 of each assembly, and the 180 maximum of 816 kb was obtained with the ABruijn assembly. Pilon reduced the number of 181 errors of each assembly, and the Canu and ABruijn assemblies had the best base quality with 182 about 16 mismatches (15.85 and 17.88 for Canu and ABruijn respectively) and 22 indels 183 (22.49 and 21.76 for Canu and ABruijn respectively) per 100 kb. The SMARTdenovo 184 assembly contained the highest number of complete genes (6,266) and the Canu assembly 185 contained the highest number of genes without any indels (5,921) ( Table 1). 186 Finally, we evaluated the composition of each assembly for various elements (genes, repeated 187 elements, centromeres and telomeric regions). We also generated an Illumina-only assembly 188 using Spades assembler (Bankevich et al. 2012) to compare the number of features found in 189 each assembly. All the assemblies contained nearly the same number of centromeres (120 bp 190 regions in the reference genome assembly) and genes (Figure 2). The Nanopore assemblies 191 contained between 45 and 50 Long Terminal Repeat (LTR) retrotransposons (average size of 192 5.8 kb), while the Illumina-only assembly contained only one. The smallest number of 193 telomeres (three) was found in the ABruijn assembly, while nine, 18, 13, and 14 telomeres 194 were found in the Illumina, Canu, Miniasm, and SMARTdenovo assemblies, respectively. 195 The Illumina-only assembly contained five telomeric repeats (average size 100 bp), while the 196 Nanopore-only assemblies contained between six and nine telomeric repeats. The ABruijn 197 assembly contained the same number of genes encoded by the mitochondrial genome as the 198 reference sequence because it was the only assembler to fully assemble the Mt genome. 199

200
The R9 version of the pore was released too late for us to use it to sequence all the natural S. 201 cerevisiae isolates. However, we did produce some data to compare the R7.3 and R9 202 assemblies. Because SMARTdenovo produced the best results, we used it to assemble the 203 genome of the S288C strain. We input four different read datasets: all 1D and 2D reads, only 204 2D reads, 30x of the longest 2D reads or 30x of the longest 1D and 2D reads (Table S4). 205 This time, the 30x of the longest 1D and 2D reads dataset gave the best results. Indeed, the 206 continuity of the assembly increased, and the number of contigs decreased from 26 with the 207 R7.3 assembly to 23 with the R9 assembly. The number of indels also decreased from 208 133,676 with the R7.3 version to 95,012 with the R9 version. A direct consequence of using 209 the R9 version was that almost all the genes were found, and 6,302 of the 6,350 known genes 210 were complete and 1,226 did not contain any indels. 211

212
To explore the variability of the genomic architecture within S. cerevisiae, 21 natural isolates 213 were sequenced in addition to the S288C reference genome using the same strategy, namely, a combination of long Nanopore and short Illumina reads. Sequenced isolates were selected to 215 include as much diversity as possible in terms of global locations (including Europe, China, 216 Brazil, and Japan), ecological sources (such as fermented beverages, dairy products, trees, 217 fruit soil, and wine), as well as genetic variation highlighted in the frame of the extensive 218 resequencing 1002 Yeast Genomes project (http://1002genomes.u-strasbg.fr/) ( Table S5). 219 Among these isolates, the nucleotide variability was distributed across 491,076 segregating 220 sites and the genetic diversity, estimated by the average pairwise divergence (π), was 0.0062, 221 which is close to what is observed for the whole species (Peter and Schacherer 2016). 222 A total of 78 MinION Mk1 runs were performed and the highest throughput we obtained was 223 650 Mb (1D and 2D reads). This led to 1.4 million of 2D reads with a cumulative length of 12 224 Gb. We obtained 2D coverage that ranged from 22x to 115x ( Figure S6) among the strains 225 with a median read length of approximately 5.4 kb and a maximum size of 75 kb ( Figure S7). 226 In general, three runs or less were sufficient to obtain the expected coverage. Next, for each 227 strain, we gave varying coverages of the longest 2D reads as input to SMARTdenovo and 228 retained the most contiguous assembly. These assemblies were then given as input to Pilon for 229 a polishing step with around 300x of Illumina paired-end reads. After polishing, we obtained 230 a median number of contigs of 27.5 (Table 2), the minimum number was for the CEI strain 231 (18 contigs) and the maximum was for the BAM strain (105 contigs). The median cumulative 232 length was 11.93 Mb and ranged from 11.83 Mb for the ADQ strain to 12.2 Mb for the CNT 233 strain. The median N50 contig size was 593 kb and varied from 201 kb for the CIC strain to 234 896 kb for the ADQ strain. The L90 varied from 14 for the BCN, CEI, and CNT strains, to 72 235 for the BAM strain with a median equal to 19.5. 236 To assemble the mitochondrial (Mt) genome, we used all the 2D reads as input to ABruijn. As 237 a result, we obtained an assembly for each strain and extracted the Mt genome after mapping 238 the contigs against the reference Mt genome. As was the case for the chromosomes, we used 239 Pilon with Illumina paired-end reads to obtain a corrected consensus sequence. 240

241
The availability of high quality assemblies allowed us to establish an extensive map of the 242 transposable elements (TEs) to obtain a global view of their content and positions within the 243 21 natural yeast isolates (Figure 3). Using a reference sequence for each of the five known 244 TE families in yeast (namely Ty1 to Ty5), we mapped the TEs in each assembled genome. 245 Among the 50 annotated TEs in the S288C reference genome, 47 were detected at the correct 246 chromosomal locations in our assembly but three Ty1 locations were not recovered. Seven 247 additional Ty1 elements were found at unannotated sites, three of them have already been 248 detected in the reference genome (Bleykasten-Grosshans et al. 2011). These results attest to 249 the high accuracy of our assembly strategy for TE detection and localization. Among the 22 250 isolates, the TE content was highly variable (Table 3), ranging from five to 55 elements, with 251 a median value of 15. While the frequency of the Ty4 and Ty5 elements was clearly low in all 252 the isolates (up to four and two elements, respectively), the Ty1, Ty2, and Ty3 elements were 253 found in most of the isolates. The most abundant TEs were Ty1 and Ty2, except in the 254 Chinese BAM isolate, in which 12 Ty3 elements were detected. As already described 255 (Bleykasten-Grosshans et al. 2013), the pattern of insertion of these mobile elements is either 256 specific to a given isolate, or shared by only a small number of isolates (mostly two or three). 257 However, four insertion hotspots have been highlighted (shared by seven or more isolates) on 258 chromosomes 2, 3, and 9. The shared insertion hotspots were generally not specific to a 259 specific Ty family, except for the hotspot located on a subtelomeric region of the chromosome 260 3, which was specific to Ty5. 261

262
Structural variations such as copy number variants, large insertions and deletions, 263 duplications, inversions and translocations are of great importance at the phenotypic variation 264 level (Weischenfeldt et al. 2013). Compared with single nucleotide polymorphism (SNPs) and 265 small indels, these variants are usually more difficult to identify, in particular because 266 resequencing strategies have until recently focused mainly on the generation of short reads 267 and reference-based genome analysis. Nanopore long reads sequencing data allow the copy 268 numbers of tandem genes to be determined. As a testbed, we focused on two loci that are 269 known to contain multi-copy genes, namely ENA and CUP1. ENA genes encode plasma 270 membrane Na + -ATPase exporters, which play a role in the detoxification of Na+ ions in S. 271 cerevisiae. CUP1 genes encode metallothioneins, which bind copper and are involved in 272 resistance to copper exposure by amplification of this locus. To determine the degree of 273 divergence among the 21 strains, we searched for the numbers of copies of the CUP1 and 274 ENA1-2 tandem-repeated genes in the assemblies (Table S6). For this purpose, we extracted 275 the corresponding sequence from the S288C reference genome and aligned it to the 276 assemblies of each strain. As expected and already reported (Strope et al. 2015), the copy 277 numbers of ENA1-2 and CUP1 varied greatly across the strains. We found that the copy 278 numbers of ENA genes in the 21 isolates ranged from 1 in 12 of the genomes to five in the 279 BHH strain ( Table S6). The copy numbers of CUP1 genes fluctuated even more, ranging 280 from one to 10 copies in the ABH and AEG strains. We also determined the fitness of the 21 281 isolates in the presence of CuSO 4 and observed a correlation between the number of CUP 282 genes and the resistance of the strain to high concentration of CuSO 4 ( Figure S8). 283 Besides copy number variants, we also focused on larger structural variants, such as 284 translocations and inversions, because our highly contiguous assemblies allowed us to 285 investigate these events. We aligned the polished assemblies of the 21 strains to the reference 286 genome using NUCmer and inspected the alignments with the mummer software suite to search for structural variations. We detected 29 translocations and four inversions within the 288 assemblies of 17 strains ( Table S7). The median length of an inversion was 94 kb and their 289 breakpoints were located mostly in intergenic regions. It is well recognized that SVs might 290 play a major role in the genetic and phenotypic diversity in yeast (Hou et al. 2014;Naseeb et 291 al. 2016). However, up to now, it was impossible to assemble and have an exhaustive view of 292 the SVs content in any S. cerevisiae natural isolates. Indeed, short-read sequencing 293 approaches are not suitable for SVs studies because they results in a high number of false 294 positive as well as false negative detected events. 295 Among the detected events, one translocation detected between chromosomes 5 and 14 in the 296 ABH isolate and another translocation between chromosomes 7 and 12 in the AVB isolate 297 have already been described and confirmed in a reproductive isolation study in S. cerevisiae 298 (Hou et al. 2014). A deeper investigation of our assemblies highlighted the presence of full-299 length Ty transposons at some junctions of the translocation events. For example, the complex 300 Ty-rich junctions of the translocation between the chromosomes 7 and 12 in the ABH isolate 301 was in complete accordance with previously reported results (Hou et al. 2014). Our results 302 underline the high resolution of the constructed assemblies, and show that complex events, 303 such as translocations, can be detected accurately with our strategy. Among the 22 isolates, 304 six were devoid of translocation events whereas the other 16 carries one to four such 305 structural rearrangements compared to the reference. 306 However, several limitations can be highlighted for these detections. Contrary to expectations, 307 no translocation that specifically affected subtelomeric regions was identified, underlining the 308 difficulty of discriminating regions that are variable and contain a large number of repeated 309 segments. Moreover, the detection accuracy is highly dependent on the completeness of the 310 assembly because, if translocation breakpoints are located on contigs boundaries, they will not 311 be detectable. 312

313
The ABruijn assembler allowed the construction of a single contig corresponding to the Mt 314 genome for each isolate. To assess the quality of the assemblies, we aligned the polished 315 S288C Mt contig to the reference sequence (GenBank: KP263414). Only four SNPs and a 316 total of 15 bp long indels were detected. For all but two natural isolates, all the Mt genes 317 (eight protein coding genes, two rRNA subunits and 24 tRNAs) were conserved and 318 syntenous. The Mt genomes of the two remaining isolates (CNT and CFF) contained one and 319 two repeated regions covering a total of 6.5 and 8 kb, respectively. In the CNT, the repeated 320 region was in the COX1 gene and affected its coding sequence. In the CFF isolate, the COX1, 321 ATP6, and ATP8 genes would have been tandemly duplicated. However, because we could 322 not identify reads that clearly covered the repeated regions, we excluded these two Mt 323 genome assemblies from our dataset. 324 The sizes of the 20 considered assemblies ranged from 73.5 to 86.9 kb, which is close to the 325 size reported previously (Wolters et al. 2015). The differences in size between the assemblies 326 can mainly be attributed to the intron content of the COX1 and COB genes (from two to eight 327 introns in COX1 and from two to six introns in COB). These variations lead to extensive gene 328 length variability ranging from 5.7 kb to 14.9 kb for COX1 and from 3.2kb to 8.6 kb for COB, 329 while the coding sequences of these 2 genes were exactly the same length among the 20 330 isolates. Intergenic regions also accumulate many small indels, including those that affect the 331 interspersed GC-clusters, and a few large indels that sometimes correspond to variable 332 hypothetical open reading frames (ORFs), leading to sizes that range from 51.6 to 58 kb. To a 333 lesser extent, the 21S rRNA gene is also subjected to size variation that ranges from 3.2 to 4.4 334 kb. 335

337
One of the major advantages of the Oxford Nanopore technology is the possibility of 338 sequencing very long DNA fragments. In our analyses, we obtained 2D reads up to 75 kb in 339 length, indicating that the system was able to read without interruption a flow of at least 340 150,000 nucleotides. Furthermore, the results of this analysis indicate that the error rate of the 341 ONT R7.3 reads was in the range that is obtained using existing long-read technologies (i.e, 342 about 15% for 2D reads). However, the errors are not random and they significantly impact 343 stretches of the same nucleotides (homopolymers), which seems to be a feature inherent to the 344 ONT sequencing technology. Because the pore detects six nucleotides at a time, segmentation 345 of events is problematic in genomic regions with homopolymers longer than six bases (David 346 et al. 2016). With the current R7.3 release, homopolymers are prone to base deletion 347 (representing 66% of the errors observed in homopolymers). It may be improved with a 348 steadier passing speed through the pore or by increasing the speed of the molecule through the 349 pore. In the same way, the basecaller algorithm could be optimized to increase the accuracy problematic for genome assembly because they lead to the construction of less accurate 357 consensus sequences. Furthermore, indels negatively impact gene prediction because they can 358 create frameshifts in the coding regions of genes. We concluded that nanopore-only 359 assemblies are difficult to use for analysis at the gene level unless they are polished. However, 360 polishing based only on nanopore reads was not sufficient because although it reduced the 361 number of indels by more than seven times, we still had about 3,700 genes that were affected 362 by potential frameshifts. The recently developed R9 chemistry greatly improved the overall 363 quality of the consensus sequences, because starting with only 45x of 2D reads we obtained 364 an assembly with the same contiguity but with a decrease of nearly 30% in the number of 365 indels (95,012 compared with 133,676). We consider that the ONT sequencing platform will 366 evolve in the coming years to produce high quality long reads. Until then, a mixed strategy 367 using high quality short reads remains the only way to obtain high quality consensus 368 sequences as well as a high level of contiguity. Indeed, for the assembly of repetitive regions, 369 the nanopore-only assemblies outperformed the short-reads assemblies. 370 Our benchmark of nanopore-only assemblers shows that unfortunately a single "best 371 assembler" does not exist. Canu reconstructed the telomeric regions better and provided a 372 consensus of higher quality than Miniasm and SMARTdenovo. ABruijn seemed to produce 373 the most continuous assembly but some of the contigs were chimeric. However, ABruijn was 374 the only assembler to fully assemble the mitochondrial genome, and that is why we chose it to 375 assemble the Mt genomes of the 22 yeast strains. SMARTdenovo provided good overall 376 results for repetitive regions, completeness, contiguity, and speed. It was the most appropriate 377 choice to assemble the genome of all the yeast strains even if its major drawback was the 378 absence of the Mt genome sequence among the contig output. 379 The high contiguity of the 22 nanopore-only assemblies allowed us to detect transposable 380 element insertions and to provide a complete cartography of these elements. Ty1 was the most 381 abundant element and it was spread across the entire genome. Chromosome 12 was always 382 the most fragmented in our assemblies due to the presence of the rDNA cluster (around 100 383 copies in tandem). Furthermore, we easily identified known translocations (between 384 chromosomes 5 and 14 in the ABH isolate and between chromosomes 7 and 12 in the AVB 385 isolate). The high contiguity of the assemblies seemed to be limited by the read size rather 386 than the error rate. Work is still needed to prepare high-weight molecular DNA, enriched in 387 long fragments. The yeast genomes were successfully assembled with 8 kb and 20kb 388 fragment-sized libraries, but more complex genomes will require longer reads. After the Illumina sequencing, an in-house quality control process was applied to the reads 453 that passed the Illumina quality filters. The first step discards low-quality nucleotides (Q<20) 454 from both ends of the reads. Next, Illumina sequencing adapters and primers sequences were 455 removed from the reads. Then, reads shorter than 30 nucleotides after trimming were 456 discarded. These trimming and removal steps were achieved using in-house-designed 457 software based on the FastX package (FASTX-Toolkit). The last step identifies and discards 458 read pairs that mapped to the phage phiX genome, using SOAP (Li et al. 2009b) and the phiX 459 reference sequence (GenBank: NC_001422.1). This processing resulted in high-quality data 460 and improvement of the subsequent analyses. 461

462
To determine the assembler to use on the de novo sequenced 22 yeast strains, tests were 463 conducted on S288C, the only S. cerevisiae strain for which there is an established reference 464 genome. We used different subsets of the reads as input to Canu (github commit ae9eecc), 465 Miniasm (github commit 17d5bd1), SMARTdenovo (github commit 61cf13d), and ABruijn 466 (github commit dc209ee), four assemblers that can take advantage of long reads. These 467 subsets consisted of varying coverages of 1D, 2D, 2D pass reads, which are 2D reads that 468 have an average quality greater than nine, and reads corrected by Canu. Canu was executed 469 with the following parameters: genomeSize=12m, minReadLength=5000, 470 mhapSensitivity=high, corMhapSensitivity=high, errorRate=0.01 and corOutCoverage=500. 471 Miniasm was run with the default parameters indicated on the github website. SMARTdenovo 472 was executed with the default parameters and -c 1 to run the consensus step. ABruijn was run 473 with default parameters. After the assembly step, we polished each set of contigs with Pilon, 474 using 300X of Illumina 2x250 bp paired-end reads. Assemblies were aligned to the S288C 475 reference genome using Quast in conjunction with the GFF file of S288C to detect assembly 476 errors, and complete and partial genes. We also visualized the alignments using mummerplot 477 to detect chimeric contigs. 478

Genes and transposons detection 479
To detect genes and transposons in the assemblies, we extracted the corresponding sequences 480 from the reference genome. We then mapped these elements to the assemblies using the Last 481 aligner. Only alignments that showed more than 80% identity over at least 90% of the 482 sequence length were retained and considered as a match. We used a similar procedure to 483 count the maximum number of gene in the Nanopore reads dataset, the only modification was 484 that the percentage identity had to be at least 70% to account for the high error rate of the 485 reads. To estimate the number of copies in the Illumina reads, we aligned paired-end reads to 486 the reference genome with BWA aln and then computed the coverage using samtools mpileup 487 algorithm (Li et al. 2009a) and divided the number we obtained for each region of interest by 488 the median coverage of the corresponding chromosome. 489

Feature number estimation 490
We generated an Illumina-only assembly using Spades version v3.7.0 with default parameters 491 and compare the completeness of this assembly to the nanopore-only assemblies. To estimate 492 the number of features across all S288C assemblies, we aligned each post-polishing consensus 493 sequence to the S288C reference genome using NUCmer. Only the best alignments were 494 conserved by using the delta-filter -1 command. Next, we used the bedtools suite (Quinlan 495 and Hall 2010) with the command bedtools intersect -u -wa -f 0.99 to compare the alignments 496 to the reference GFF file. Finally, we counted the number of features of our interest. 497

Circularization of mitochondrial genomes 498
To circularize the Mt genomes, we split the contig corresponding to the Mt sequence in each 499 strain into two distinct contigs. Then, we gave the two contigs as input to the minimus2 500 (Schatz et al. 2013) tool from the AMOS package. As a result, we obtained a single contig 501 that did not contain the overlap corresponding to the circularization zone. Finally, to start the 502 Mt sequence of all isolates at the same position as the reference, we mapped each Mt 503 sequence to the reference using NUCmer. 300x of 2x250bp Illumina reads as input to Pilon. The resulting corrected assembly was then 563 aligned to the S288C reference genome using Quast. 564