Long-read only assembly of Drechmeria coniospora genomes reveals widespread chromosome plasticity and illustrates the limitations of current nanopore methods

Abstract Background Long-read sequencing is increasingly being used to determine eukaryotic genomes. We used nanopore technology to generate chromosome-level assemblies for 3 different strains of Drechmeria coniospora, a nematophagous fungus used extensively in the study of innate immunity in Caenorhabditis elegans. Results One natural geographical isolate demonstrated high stability over decades, whereas a second isolate not only had a profoundly altered genome structure but exhibited extensive instability. We conducted an in-depth analysis of sequence errors within the 3 genomes and established that even with state-of-the-art tools, nanopore methods alone are insufficient to generate eukaryotic genome sequences of sufficient accuracy to merit inclusion in public databases. Conclusions Although nanopore long-read sequencing is not accurate enough to produce publishable eukaryotic genomes, in our case, it has revealed new information about genome plasticity in D. coniospora and provided a backbone that will permit future detailed study to characterize gene evolution in this important model fungal pathogen.


Background
Drechmeria coniospora (NCBI:txid98403) is an obligate parasitic fungus belonging to the order of Hypocreales. This fungus forms spores that adhere to the cuticle of a range of different nematodes to infect them [1]. We adopted D. coniospora strain ATCC 96-282, derived from a strain isolated in Sweden, as a model pathogen for Caenorhabditis elegans 20 years ago [2]. We have cultured this strain, referred to here as Swe1, continuously since then, using it to understand innate immune mechanisms in its nematode host [3,4].
As part of our characterization of the interaction between D. coniospora and C. elegans, in 2013, we extracted DNA from our laboratory strain of the time (referred to here as Swe2) and deter- mined its genome. Despite attempts to complete the assembly, the Swe2 genome remained fragmented, with an N50 of 3.86 Mb [5]. In addition to the genome of Swe2, a second D. coniospora genome is available (referred to here as Dan2) [6], derived from a strain related to a Danish isolate (Dan1; Fig. 1). Although corresponding to a chromosome-level assembly, this latter genome still contains large stretches (up to 500 kb) of undetermined sequence. In this study, we used Oxford Nanopore Technology (ONT) long-read sequencing to assemble complete fungal genomes. This revealed that the 2 isolates (Swe1 and Dan1) display strikingly different levels of genomic stability. We provide a detailed analysis that illustrates the continuing challenges to using only ONT long-read sequencing for genome assembly. Because the genome sequences were of insufficient quality to allow accurate gene prediction, we polished the genomes using short DNA reads to generate high-quality sequences, providing a resource for future comparative studies.

Results
An all-against-all in silico genome comparison of the 2 publicly available D. coniospora genome sequences, for Dan2 [6] and Swe2 [5], indicated the presence of extensive genomic rearrangements ( Fig. 2A). These could reflect real differences or assembly errors in 1 or both genomes. We directly confirmed 1 major rearrangement by PCR ( Fig. 2B and C), suggesting that the differences could be real. To characterize this genomic plasticity, we determined the genomes of 3 strains related to the 2 that had been sequenced previously (Fig. 1). We used ONT nanopore sequencing to generate long reads and current assembly tools to construct chromosome-level assemblies for all 3 strains ( Supplementary  Fig. S1, Supplementary Table S1). Manual curation allowed complete ∼30 kb mitochondrial genomes to be predicted from the assemblies generated by Canu [7]. All 3 nuclear genomes were divided in 3 similarly sized chromosomes, an unusual arrangement for such a fungus, as previously noted by Zhang et al. for Dan2 [6]. For the 2 strains related to Swe2, there was almost complete synteny of their nuclear Circos plot representing regions >6 kb that are very similar between Dan2 (left, olive) and Swe2 (right, light blue) assemblies as determined by an all-againstall LAST analysis. Swe2 contig numbers are the last 2 digits of the accession ID (shown in B), preceding the suffix. Red and dark blue rectangles represent rearrangement junctions probed by PCR. B. Conceptual design of the PCR primers. C. Amplicons from the PCR were visualized after electrophoresis. Each pair gave 1 specific band of the expected size. The colour code is the same for the 3 panels.
genomes. Inspection of the 1 anomalous region in Swe1 where synteny broke down revealed that it was supported by only 1 long (215 kb) read and corresponded to a local discontinuity in the read coverage, as well as a break in the alignment between Canu-generated contigs and unitigs. All these factors indicated that this was an assembly artefact with a contig misassembled on the basis of an individual very long chimeric read (Supplementary Fig. S2). The same was true for the distinct unique nonsyntenic region of the Swe3 assembly ( Supplementary Fig. S3).
These were exceptional cases because almost all chimeric reads were identified and either trimmed or excluded from the assembly process by Canu ( Supplementary Figs S4 and S5). An in-depth analysis of the Swe1 chimeric reads revealed that a large proportion were in fact the consequence of sequencing errors. In almost 40% of cases (1,010 of 2,566), the 2 regions flanking the presumptive site of chimerism mapped to within 50 nucleotides of each other on the corresponding single scaffold. There was no discernible pattern to the distribution of this interval in the remaining candidate chimeric reads (Supplementary Fig. S6A and B), nor were there any regions that were more likely to be the site of chimeric junctions ( Supplementary Fig. S6C).
Notably the single chimeric read that escaped censoring, leading to a misassembly of Swe1, was not identified by the dedicated tool YACRD but was flagged as anomalous in reads recalled by Guppy (see Methods). This is an indication of the continuing improvement to base-calling tools. Also, these specific Swe1 and Swe3 misassemblies were absent from the corresponding chromosome assemblies produced by the de novo assembler Flye [8] (Fig. 3A). This latter, however, introduced other assembly artefacts, including an erroneous fusion of contigs for the Dan1 assembly. This could not be ascribed to the inclusion of chimeric reads but rather seemed to result from the incorrect treatment of repeat sequences, including telomeric repeats at the extremity of 1 of the fused contigs ( Fig. 3B-D). These results illustrate the interest of using >1 tool to aid in genome assembly. Therefore, starting with the Canu-generated sequences, we manually corrected anomalous regions and thereby produced assemblies for Swe1 and Swe3 that were entirely collinear (Fig. 4A). These 2 genomes have 3 large chromosomes (8.5, 11.6, and 11.6 Mb), each with identifiable telomeric [9] and centromeric regions, indicating that the overall genome structure has remained constant over 20 years of laboratory culture. This allowed us then to use the Swe1 sequence to scaffold the fragmented Swe2 genome (Fig. 4B). To our great satisfaction, we were able to produce an entirely collinear chromosome-scale assembly. Thus, it appears that there were no assembly errors in the published Swe2 genome; it was simply incompletely scaffolded. This applies equally to the genomic regions containing copies of some mitochondrial genes that we previously suggested might indicate assembly errors [5]. They were revealed to be accurate; D. coniospora has nuclear paralogous copies of 10 mitochondrial protein-coding and 15 transfer RNA genes (so-called "numts" [nuclear sequences of mitochondrial origin] [10]). These results give further support to the existence of long-term stability of the genome of the Swe2-related strains. A whole-genome comparison between Swe1 and Dan2, however, revealed multiple and extensive genome rearrangements, involving intra-and interchromosomal translocations and inversions (Fig. 4C).
Using the same strategy described above, we assembled and polished the Dan1 genome to give chromosome-level sequences. When we compared Dan1 and Dan2, we were surprised to find 2 major events of reciprocal exchange of chromosome ends, and an intra-chromosomal inversion (Fig. 4D). These events were supported in a coherent and consistent manner by all the available data ( Supplementary Fig. S7). In other fungal species, such chromosomal rearrangements have been reported to be the result of ectopic recombination between non-allelic homologous sequences, including repeated DNA elements [11,12]. A search of the 50kb regions flanking each break point for transposable elements [13] and repetitive DNA families [14] failed to reveal any significant repeat sequence signature (see Supplementary Methods). Because the Dan2 assembly is of high confidence, supported by long reads and optical mapping [6], given the short time of in vitro culture that separates it from Dan1, this suggests that the genome of the Dan1 isolate is not stable.
In alignments of the sequence of Swe1, generated using only nanopore reads, with that of Swe2, there were stretches of complete nucleotide identity extending over >25 kb. This is a testament to the general reliability of nanopore sequencing. We therefore identified the complete set of proteins identical in Swe2 and Dan2 corresponding to single-copy, single-exon genes (see Methods). These would be expected to be present in the newly assembled Swe1, Swe3, and Dan1 genomes. Indeed, using these 305 genes as a query, we could identify homologous sequences for each in all 3 genomes. Fewer than onesixth of the corresponding genes, however, were predicted to encode full-length proteins in any of the 3 new genomes (Fig. 5A). While nanopore reads are very useful for genome assembly, they are hindered by a high error rate, especially in homopolymer stretches. Sequence quality can be improved using polishing tools that aim to ameliorate consensus sequences generally by going back to raw reads and applying integrative algorithms [15]. In our case, applying current best practices, while providing a very substantial improvement (up to 5-fold in the best case), did not take the prediction level beyond 82% accuracy. The quality of the prediction seen with the Dan1 genome was strikingly lower than that of the other 2 genomes (Fig. 5A, Supplementary Table S1).
Inspection suggested that the majority of errors were in homopolymer sequences, as expected, with nucleotide insertions and deletions leading to alterations of the reading frame. To investigate this poor homopolymer predictive performance systematically, we computed the number of G/C or A/T homopolymer stretches of ≥4 nucleotides for each of the 305 genes. We plotted these values, indicating the proportion of genes that encoded the expected full-length predicted protein for each of the 3 genomes. While there was the expected inverse relationship between accuracy and the number of homopolymer stretches, there were striking exceptions. Curiously some of these exceptions were specific to a single genome ( Fig. 5B-D). Furthermore, and unexpectedly, polishing introduced more nucleotide insertion errors than deletions, frequently on the basis of tenuous read support. Overall, however, there was no obvious pattern to explain why errors were introduced, given the underlying reads used to build the consensus sequence (Supplementary Table S2).
During the inspection of the assembled and polished genomes, we found 2 other types of anomalies. The first concerned the regions flanking the nuclear genomic copies of mitochondrial genes (numts), where polishing added short extraneous low-complexity sequences (mean length 15 nt, mainly As or Ts), for which, surprisingly, there was no sequence support from the reads used by the assembler (Fig. 6A). This probably arose because of the very high nucleotide similarity between regions of the nuclear and mitochondrial genomes that extended across >25 kb, including a repeat of 9.8 kb (Supplementary Fig. S8A and B). Notably, despite using high-coverage ONT long reads, we could not establish with absolute certainty the precise copy number for the unit sequence in the Swe genomes (Supplementary Fig. S8B).
In the second case, for the Swe3 genome, a large (∼10 kb) region, with a complex sequence, well supported by the Canucorrected and trimmed reads, was inexplicably excluded from the initial Canu assembly and only imprecisely restored by polishing ( Fig. 6B and C, Supplementary Fig. S8C). Here, while there was no evidence for repeated DNA elements on both sides of the point of sequence discontinuity, there was a single such 1.2 kb duplication (Supplementary Fig. S8C and D). These few regions were identified because of discontinuities in the depth of read coverage, which otherwise was remarkably constant across the complete genomes. With the resolution of these assembly errors, we were able to generate complete genomes of high overall structural quality using ONT long reads only.
As explained above, however, these assemblies were not of sufficient sequence quality to allow accurate gene prediction. Therefore, to extend our analysis, we used Illumina sequencing to generate very deep short-read coverage for the Swe1, Swe3, and Dan1 genomes. This allowed high-quality final sequences to be generated for all 3 strains. While short-read-based polishing did not alter the global structure, it allowed homopolymer length errors to be corrected and the generation of entirely contiguous chromosome sequences (Supplementary Table S1).
To confirm the correctness of the short-read polished assemblies, we returned to our 305 single-copy orthologues. After the short-read polishing, all 305 genes could be identified in each of   Table S1). We also benchmarked our successive assemblies using BUSCO that searches for a set of universal single-copy orthologues (USCOs) by sequence similarity. While the initial genome assemblies gave low scores, with ∼65% of complete USCOs and 35% fragmented or missing (Table 1), after long-read polishing the score for complete USCOs increased up to as high as 97%. Given the demonstrably low quality of the genome sequences (Fig. 5), we investigated the basis of this disparity. We identified among the USCOs those that corresponded to single-exon genes in the Dan2 and Swe2 reference genomes. These genes were then used as queries for high-stringency searches of the Dan1, Swe1, and Swe3 genomes at successive steps of assembly and polishing and the results compared with the results of the corresponding BUSCO analysis.   While BUSCO gave no false-negative results, it gave a large number of false-positive results, except in the analysis of the shortread polished genomes (Fig. 7A). These arose because BUSCO was not sufficiently sensitive to the presence of short indels. As an example, the Swe1 gene corresponding to RJ55 06485 had the expected sequence after short-read polishing. Two errors in homopolymer sequences led to 2 frameshifts in the unpolished assembly. One of these was corrected by long-read polishing, but for the other there was an overcompensation, leading to a dif-ferent frameshift (Fig. 7B). In both assemblies, these errors were compatible with open reading frames that collectively reconstituted a close ortholog of RJ55 06485 leading to the erroneous BUSCO result. As discussed below, this analysis highlights the fact that BUSCO scores based on sequence alignments are not an appropriate measure for ONT-only eukaryotic genomes. The BUSCO score increased to nearly 99% after the short-read polishing. In this case, the figures accurately reflect genome completeness and quality (Fig. 7A). These figures are comparable to those for the previous Dan2 and Swe2 assemblies. The new Swe1, Swe3, and Dan1 genomes therefore represent the starting point for future detailed analysis to characterize the molecular evolution of D. coniospora.

Discussion
Previous genome assemblies for D. coniospora required a combination of sequencing approaches [5,6]. Here, using only long reads and Canu, we produced the first complete circular mitochondrial genome for D. coniospora and were able to generate chromosome-scale assemblies for the nuclear genome. The rare misassembled contigs, formed by Canu because of single very long chimeric reads, as previously described [16], could be detected by read coverage anomalies and comparisons with unitigs, suggesting that solutions to avoid their creation could be implemented within Canu. The majority of reads that were flagged as chimeric arose from sequencing or polishing errors. They reflected a short (<50 bp) discrepancy between the individual reads and the final sequence. There was no indication of any sequence bias at the break points of the remaining chimeric reads, supporting the notion that these reads arise from excessively rapid reloading of the sequencing pore [17]. The use of other genome assembly tools, and the comparison of assembly discrepancies, is an additional method to produce high-confidence genomes. Here, we used Flye, which for these genomes required run times that were 10-fold shorter than Canu. A comparison of the assemblies highlighted ambiguous regions in the genome that could then be resolved by manual inspection. On the other hand, Flye was confounded by telomeric repeats. Because telomeres can be identified on the basis of their sequence, there is also clear room for algorithmic improvement to Flye through the explicit definition of chromosome ends.
One clear and well-established advantage of using long reads is the possibility of resolving very extended stretches of complex tandem repeats (VeCTRs) [18] and other repetitive sequences including centromeres. These correspond to most of the breaks in the continuity of the published Swe2 genome. In addition to acrocentric regional centromeres, Zhang et al. reported the presence of a vestigial centromere from a putative chromosomal fusion event [6]. These were also found in the fully assembled Swe1 and Swe3 genomes, indicating that chromosomal fusions were present in the common ancestor of the Swe1 and Dan1 strains.
For Swe1, Swe3, and Dan1 we were able to reconstruct complete mitochondrial genomes, with features typical of fungi of the order Hypocreales. On the other hand, unlike Dan1 (and Dan2), the nuclear genomes of Swe1 and its derivatives Swe2 and Swe3 contained different numbers of copies of sequence very similar to parts of their own mitochondrial DNA. This type of event, and more generally repeated regions with long and nearly identical sequences, are more readily detectable with long reads [19] and are particularly challenging for polishing even with short reads [20].
The duplication of mitochondrial genes in the nuclear genome has been described in other fungal genomes [10] and must have occurred after the divergence of Dan1 and Swe1. Despite this genome plasticity, even after 20 years of continuous laboratory culture the Swe1 and Swe3 genomes were entirely collinear. This contrasts with the rearrangements seen between the Dan1 and Dan2 genomes, which in principle should be from strains that have had little opportunity to diverge (L. Castrillo, Curator, ARS Collection of Entomopathogenic Fungal Cultures, personal communication). It will be interesting in the future to characterize the reasons for the marked difference in genomic stability between Dan1 and Swe1.
The accuracy of ONT long-read sequencing is increasing because of improvements in the chemistry used and signal detection, as well as base-calling [21]. Despite good read depth, however, our assemblies were not of sufficient quality at the nucleotide level to allow accurate gene prediction. Furthermore, we noted that although polishing using only long reads dramatically increased overall sequence accuracy, it introduced errors around the numts. Similar errors during polishing of nearidentical sequences have been noted in ONT-based metagenomic studies [22]. Despite these limitations, research groups are publishing and submitting to public sequence databases genomes for fungi, plants, and animals based on nanopore sequencing alone (86 for eukaryotes in addition to the 134 bacterial genomes in "Assembly" from GenBank release 236 from 15 February 2020). This is problematic because low-quality genome sequences compromise the accuracy of sequence similarity searches in public databases. On the basis of our results, a reanalysis of the completeness of these "nanopore-only" genomes is merited, to confirm that they are indeed of low quality. Similar concerns do not apply to fungal genomes assembled using only long reads generated with Pacific Biosciences technology [23] because these are not hindered by the intrinsic problem of homopolymer length errors that we found to be the most significant quality barrier when using ONT reads. On the basis of our detailed analysis and in line with the consensus regarding de novo assembly with ONT long reads (e.g., [24]), we polished our 3 assemblies with short reads. This greatly improved their quality.
Regarding the homopolymer sequence errors, as noted above, they were not consistent across the sequenced genomes; even between Swe1 and Swe3 there were instances of widely differing rates of errors in orthologous genes, despite very similar underlying reads. Indeed there was no clear pattern in the inaccuracies, which will render bioinformatic approaches to remedy this problem more difficult. On the other hand, the errors were more often overprediction of homopolymer length, despite having a majority of reads supporting the correct sequence. It is possible that polishing tools have not kept pace with improvements in base-calling, leading to an overcompensation in the inference of homopolymer length.
It is standard practice to check the completeness of de novo genome assemblies with a strategy based on the detection of predicted groups of conserved orthologous proteins. One popular and much-cited tool is BUSCO [25], which was developed before ONT-based sequencing became prevalent. Because BUSCO relies on in silico translation, small indels can be overlooked because the resulting virtual sequence can be recapitulated despite a frameshift. This explains the disparity between the BUSCO results and our own analyses that were deliberately restricted to mono-exonic genes. Contrary to BUSCO, our analysis indicated that roughly one-fifth of the genes after long-read polishing had an incorrect sequence. Current BUSCO-type approaches, based on sequence similarity and not excluding genes with improbably short introns, cannot be used as a quality metric for ONT-only assemblies and are appropriate only after shortread correction.

Conclusions
Nanopore long-read sequencing provides a powerful way to assemble complex genomes with limited manual curation but still falls short of the quality required to produce publishable eukary-otic genomes. In our case, it has revealed new information about genome plasticity in D. coniospora and provided a backbone that will permit future detailed study to characterize gene evolution in this important model fungal pathogen.

DNA extraction
D. coniospora spores were cultured in liquid NGMY medium [26] at 37 • C for 5 days. Fungal DNA was extracted according to a published protocol (from p. 13 onwards of [27]) [28], with the following modifications: instead of centrifugation to collect DNA after precipitation with isopropanol, we recovered the DNA filaments with a glass hook, washed and dried them as described [29], and resuspended the DNA without agitation in Tris-EDTA buffer.

Illumina sequencing library preparations
The same DNA samples were used to prepare paired-end libraries with insert size of ∼680 bp, following the manufacturer's instructions for the kit NEBNext R Ultra TM II DNA (New England Biolabs Inc., Ipswich, MA, USA). The libraries were sequenced using an Illumina NextSeq500 system (Illumina NextSeq 500, RRID:SCR 014983) (serial No. NB501764).

Whole-genome alignments
Genomes were aligned using LAST v979 (LAST, RRID:SCR 006119) [32]. A database was first generated (last-db -cR01), and then lastal and last-dotplot with default parameters were used to generate, respectively, an alignment file and a dot-plot. For the circular visualization of genome alignments, we used the command lastal with -f BlastTab parameter, then parsed the alignment to filter out short alignments and generate the links file needed by Circos (Circos, RRID:SCR 011798) [33].
For the manual curation of the assemblies, we generated whole assembly alignments and dot-plots of Swe1, Swe2, and Swe3 two by two. For Swe1 and Swe3, Canu contigs were ordered by synthesizing the results from the 3 possible all-against-all alignments. To confirm a link between 2 contigs, we used the following strategy: when a contig of the Swe1 assembly spanned 2 contigs of Swe3, long reads of Swe1 present in this spanning area were extracted from the Swe1 corrected and trimmed reads provided by Canu. Then this set of reads was mapped on Swe2 and Swe3 assemblies. The 2 targeted contigs of Swe3 were considered "linked" if different parts of several unique reads mapped on the 2 Swe3 contigs' ends. If the reads that supported the link had different mapping orientation (forward or reverse), 1 contig was complemented before the last step (see Solving links between contigs) to ensure a correct orientation of the final chromosome.
To guide correct assembly, we also searched for centromeres in the contigs. They were identified as highly duplicated regions in the all-against-all alignment dot-plots produced by LAST. The identification of the repeated canonical telomeric sequence (TTAGGG) n [9] and its reverse complement (CCCTAA) n at the beginning or end of certain contigs allowed the identification of chromosome ends. The Dan1 assembly was manually curated using a similar strategy with the Dan2 genome as a reference.

Solving links between contigs
Overlaps between linked contigs were identified by a BLASTN (NCBI BLAST, RRID:SCR 004870) [38] alignment of their last 100 kb. Any duplicate sequence was trimmed out from 1 contig and both contigs were joined. The inferred junction was then validated by verification of the underlying read support. For the linked contigs that did not overlap, the sequence in the gap was extrapolated from the reads that matched and extended the ends of contigs, on the basis of alignments at the last 1 kb of each contig. These sequences were aligned with MAFFT v7.427 (MAFFT, RRID:SCR 011811) [39]. The alignment was visualized with SeaView (SeaView, RRID:SCR 015059) [40], and only the portion of the alignment strictly between the 2 contigs sequences was kept. SeaView also generated a consensus sequence (on the basis of 60% sequence identify by default). The resulting sequence was inserted between the 2 contigs to link them and the supposed continuity verified by a further cycle of read mapping.

Mitochondrial genome circularization
Canu assembles small circular elements as contigs with tandem duplications of the element. We resolved the mitochondrial genomes as recommended by Canu's authors [7]. MUMmer suite v4.0.0.beta2 (MUMmer, RRID:SCR 018171) [42] was used to align the contig identified as the putative mitochondria on itself with NUCmer and parameters -maxmatch -nosimplify. Coordinates of a full copy were identified with the show-coords command and -lrcT parameters.

PCR
PCR was carried out to test a genome rearrangement between Swe2 and Dan2 genomes, with primers P1F (GAGATATC-GAACGTCGCATGG), P1R (ACATCAAGCCTTTGTCGAGGA), and P3F (GCTCAGGACCGACGTACAAG). PCR reactions were run according to the GoTaq R G2 Flexi DNA polymerase instructions (Promega, Madison, WI, USA), with 50 ng of template DNA and 1 mM of each forward and reverse primer, in a final volume of 25 μL. The reaction started by initial denaturation at 95 • C for 2 min, followed by 30 amplification cycles (95 • C for 30 sec, 60 • C for 30 sec, and 72 • C for 30 sec), and a final elongation for 5 min at 72 • C.

Defining a set of 305 identical proteins
Identical proteins shared by the 2 D. coniospora genomes available (Swe2 and Dan2) were recovered using a reciprocal best BLAST [38] hit strategy on the 2 proteomes. Proteins that were duplicated in 1 or both genomes were filtered out. The set was further refined by only retaining proteins corresponding to mono-exonic genes.

Assessment of gene sequence in ONT-only assemblies
TBLASTN searches were run using the amino acid sequence of the set of 305 identical proteins against the different nanoporeonly assemblies. A gene was considered correct if the query coverage, i.e., the ratio of alignment length to the query length, was equal to 1.

Assessing the genome integrity
The genome integrity was assessed with BUSCO v3.1.0 (BUSCO, RRID:SCR 015008) and the curated set ascomycota odb9 version 2016-02-13 [25]. A BLASTP search enabled Swe2 monoexonic genes present among USCOs to be identified. This list of 219 Swe2 genes was then used as a TBLASTN query against the different assemblies of Swe1 and Swe3. A gene was considered correct when it matched the corresponding Swe2 gene perfectly in length. An analogous analysis was carried out for Dan1, on the basis of the 273 Dan2 monoexonic genes that are USCOs.

Characterization of chimeric reads
Swe1 reads identified as chimeric by YACRD were aligned on the final (short-read polished) Swe1 assembly. The main alignment was identified using samtools view -F 2308. The CIGAR string was then parsed to determine whether the longest residual part of the read was 5 or 3 to the main alignment, thereby giving an orientation to the putative chimeric read and localizing the potential chimeric break point. The 500 bp of sequence 5 and 3 of this point were extracted and individually mapped back on the Swe1 final assembly and the number of unique reads in a 10-kb non-overlapping sliding window was calculated. For the reads for which both 500 bp fragments mapped on the same chromosome, the smallest distance between the 2 fragments was calculated.

Availability of Supporting Data and Materials
Genomes of the strains Swe1, Swe3, and Dan1 are available on our institute website [46]. All supporting data can be accessed at the GigaScience GigaDB database [47]. The reads used in this work can be found at the European Nucleotide Archive (ENA) under the study numbers PRJEB35969, PRJEB35970, and PRJEB35971. The raw signal runs are available under the accessions ERR3774158, ERR3774162, and ERR3774163; the FASTQ files of base-called reads (Guppy v3.0.3) are available under the accessions ERR3997391, ERR3997394, and ERR3997483; the FASTQ files of Illumina paired-end reads are available under the accessions ERR3997389, ERR3997392, and ERR3997395. Accession numbers are given in the order Swe1, Swe3, and Dan1.

Additional Files
Supplementary Figure Table 2. Read support, in Swe1, Swe3, and Dan1 assemblies, for the predicted correct sequence for each homopolymer stretch in the genes corresponding to 10 of the 305 candidate identical proteins. Supplementary Methods. Additional methodological details.