Pushing the limits of de novo genome assembly for complex prokaryotic genomes harboring very long, near identical repeats

Abstract Generating a complete, de novo genome assembly for prokaryotes is often considered a solved problem. However, we here show that Pseudomonas koreensis P19E3 harbors multiple, near identical repeat pairs up to 70 kilobase pairs in length, which contained several genes that may confer fitness advantages to the strain. Its complex genome, which also included a variable shufflon region, could not be de novo assembled with long reads produced by Pacific Biosciences’ technology, but required very long reads from Oxford Nanopore Technologies. Importantly, a repeat analysis, whose results we release for over 9600 prokaryotes, indicated that very complex bacterial genomes represent a general phenomenon beyond Pseudomonas. Roughly 10% of 9331 complete bacterial and a handful of 293 complete archaeal genomes represented this ‘dark matter’ for de novo genome assembly of prokaryotes. Several of these ‘dark matter’ genome assemblies contained repeats far beyond the resolution of the sequencing technology employed and likely contain errors, other genomes were closed employing labor-intense steps like cosmid libraries, primer walking or optical mapping. Using very long sequencing reads in combination with assembly algorithms capable of resolving long, near identical repeats will bring most prokaryotic genomes within reach of fast and complete de novo genome assembly.


INTRODUCTION
The enormous pace in next generation sequencing (NGS) technology development (1) has led to an exponential increase in the number of publicly available, complete prokaryotic genome assemblies (2). Due to their lower complexity, the genome assembly problem, mainly caused by the presence of long repeats, had been considered solved for prokaryotes. However, despite advances from Pacific Biosciences (PacBio) and more recently Oxford Nanopore Technologies (ONT) to sequence very long reads (>15 kb and well beyond) which allow de novo bacterial genome assembly (3), the percentage of complete genomes is still low compared to the large number of fragmented assemblies based on Illumina short reads (2,4), most of which remain at a permanent draft stage. This can in part be attributed to a considerable lag between the development of a new technology and its broader adoption, and to the higher costs of PacBio and ONT. In addition, the higher error rate of these technologies raised initial concerns. However, over the past years both theoretical work (5) and several detailed studies have shown that the error of these long read technologies is random. It can thus be removed with suitable computational methods both for PacBio data (6) and for ONT data (3,7), provided that a sufficient genome coverage is available. The fragmented Illumina-based genome assemblies can represent serious limitations for follow up analyses (8)(9)(10). We and many others could recently illustrate the benefits of relying on complete genomes, e.g., for comparative genomics, where a study of repeat-rich genomes indicated that even core genes could be missed on top of accessory genes, and for metagenomic studies, which benefited from the availability of complete genomes as a larger fraction of reads could be assigned down to the strain level (11). Moreover, complete genome sequences represent the optimal basis for accurate genome annotation (12), to create accurate, genome-scale metabolic models (13), to track the spread of mobile genetic elements including those carrying antimicrobial resistance genes (4), to explore patterns for the emergence of drug resistance (14) and as the optimal starting point for functional genomics studies (8). The development of sophisticated assembly algorithms ( (6,15) and bioRxiv: https://doi.org/10.1101/247148) has and will continue to increase the percentage of complete genomes. Several large genome sequencing initiatives for prokaryotes will benefit from these advances. These include the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, which aims to sequence the type strain of ∼11,000 species with a validly published name (16) and the NCTC 3000 project, which aims to completely sequence reference bacteria and viruses of public health importance (http://www. sanger.ac.uk/resources/downloads/bacteria/nctc/). Furthermore, efforts to isolate individual strains from the complex soil or rhizosphere microbiomes which synthesize urgently needed, novel antibiotics (17) or carry out promising biological functions, e.g., the protection of crops from pathogens (18,19), are gaining momentum. The genus Pseudomonas, which belongs to the gammaproteobacteria, is of particular interest as it includes several pathogens like Pseudomonas aeruginosa and Pseudomonas syringae, but also many species with a potential use as biocontrol agents (18,19) or for bio-remediation (20). P. koreensis was first isolated from Korean agricultural soil (21). It too exhibited antagonistic activities: a biosurfactant was active both against Phytium ultimum in a tomato model system (22) and against Phytophthora infestans infecting potato plants (23). Another P. koreensis strain, CRS05-R05, was isolated from the rice rhizosphere and showed biocontrol activity against the rice weevil Sitophilus oryzae (24), a pest of stored foods.
Challenged by the complexity of the Pseudomonas koreensis isolate (P19E3) genome, we here push the limits of de novo genome assembly for a very complex bacterial genome. We exclusively relied on sequencing data without laborintense steps like cosmid or BAC libraries, optical mapping or primer-walking that otherwise would be required to scaffold and close gaps of highly complex prokaryotic genomes. The long, near identical repeats of strain P19E3 could only be resolved by very long ONT reads; in addition, the assembly was further complicated by a variable shufflon region. Importantly, the analysis of the repeat complexity of 9331 complete bacterial and 293 archaeal genomes, which we make publicly available here, established that ∼10% of genomes represented this 'dark matter' of prokaryotic de novo genome assembly (25), i.e. genomes that are particularly difficult to assemble, either due to the presence of several hundred repeats (ca. 7%), or very long, highly similar repeats (ca. 3%).

Repeat analysis for genus Pseudomonas and all completely sequenced prokaryotes
All publicly available, completely sequenced genomes of the genus Pseudomonas (a total of 270) were downloaded from the National Center of Biotechnology Information's (NCBI) GenBank (Feb. 8, 2018), i.e. NCBI assembly level 'Complete genome', the highest level compared to levels 'Chromosome', 'Contigs', and 'Scaffold'. A repeat analysis (repeats longer than 500 bp and with greater 95% identity) was carried out as described earlier (11), allowing us to triage the genomes into three genome assembly complexity classes (26) and to visualize the data. The same analysis was performed for all complete prokaryotic genomes from Gen-Bank (9331 bacteria and 293 archaea, 23 February 2018).

Bacterial strain, genomic DNA extraction & sequencing
Pseudomonas koreensis P19E3 was isolated from healthy marjoram (Origanum marjorana) leaf material during an isolation survey on an organic herb farm (Boppelsen, Switzerland) in summer 2014 and classified by matrixassisted laser desorption/ionization time of flight (MALDI-TOF) biotyping analysis (27). Genomic DNA (gDNA) was isolated (Sigma GenElute kit), and PacBio SMRT sequencing was performed on an RS II machine (4 SMRT cells, P6-C4 chemistry) aiming to de novo assemble a complete genome as described (12). A size selection was performed with the BluePippin system, first for fragments larger than 10 kb (2 SMRT cells), later for fragments larger 20 kb (2 SMRT cells). For ONT sequencing, high molecular weight gDNA was extracted according to a recent protocol (7). The ONT libraries were prepared using a 1D 2 sequencing kit (SQK-LSK308) and sequenced on two R9.5 flow cells (FLO-MIN107). Base calling was performed using Albacore v.2.1.3. Finally, a 2 × 300 base pair (bp) paired end library was prepared using Illumina's Nextera XT DNA kit and sequenced on a MiSeq. ONT, PacBio and Illumina data were uploaded to NCBI SRA and can be accessed via Bio-Project PRJNA436895.

Genome assembly
Assemblies based on PacBio data. PacBio reads were assembled, once with HGAP v.3 (6) and once with Flye v.2.3 (bioRxiv: https://doi.org/10.1101/247148). HGAP3 was run on SMRT Analysis v.2.3.0 and Protocol 'RS HGAP Assembly.3' with data of all four SMRT cells (default parameters, except: min. subread length: 500; min. polymerase read quality: 0.80; estimated genome size: 7.5 Mb). To assemble the data with Flye, we first extracted the subreads using Protocol 'RS Subreads.1' of SMRT Analysis (min. subread length: 500, min. polymerase read quality: 0.83). The resulting subread FASTA files were then processed using Flye v.2.3 applying standard parameters and an estimated genome size of 7.5 Mb. No further polishing was performed for these two assemblies on top of the polishing and correction steps included in the assembly pipelines.
Flye assembly of ONT data. Base called ONT reads (we used 1D reads, which corresponded to 93% of the reads, and which on average are longer than 1D 2 reads) were filtered (30 kb or longer), and assembled with Flye v.2.3 (bioRxiv: https://doi.org/10.1101/247148) using standard parameters and performing three polishing iterations on Flye's polishing stage (-i 3 -nano-raw [ont reads.fastq] -g 7.7m). The genome size was estimated by a first, explorative Flye assembly, before removing a few spurious contigs, i.e., short Nucleic Acids Research, 2018, Vol. 46, No. 17 8955 contigs mainly consisting of repeated short DNA motifs and with a low coverage. The resulting circular contigs were linearized (start of dnaA gene for chromosome, putative intergenic region for plasmids). Also, a repeat analysis was performed to ensure that repeats longer than 10 kb were not disrupted when linearizing the circular chromosome and plasmids.
SPAdes assembly based on Illumina data. The 2 × 300 bp MiSeq data was assembled using SPAdes (v.3.11.1) applying standard parameters, aiming to catch potential small plasmids which could be missed due to size selection for long reads. The resulting Illumina assembly was also mapped to the final ONT assembly using minimap2 v.2.6 (with parameter '-x asm5'), and its output was converted to BED format, allowing to detect genomic regions not covered in the Illumina assembly compared to the ONT assembly using BEDtools v2.21.0 (28), as well as genes not covered or disrupted. The results were plotted using Circos v.0.69 (29).

Error correction & polishing of the ONT Flye assembly.
The post-processed ONT draft assembly was subsequently corrected by three iterative Racon runs (30). For each run, the filtered ONT reads were first mapped to the draft assembly using GraphMap v.0.5.2 (31). Reads entirely contained in long genomic repeats (longer than 10 kb) were excluded from mapping using samtools v.1.6 (32). The resulting SAM file was then provided to Racon v.0.5.0 for correction of the assembly. A polishing step using the ONT data was performed applying Nanopolish v.0.8.5 (methylation aware mode '-methylation-aware = dcm,dam') (33). Like before, reads which were entirely contained in a repeat region were discarded for polishing. To correct any remaining small assembly errors, a last polishing step was performed using the 2 × 300 bp Illumina data by mapping the raw reads to the assembly (bwa mem v.0.7.12) and using FreeBayes v.1.1.0 (https://arxiv.org/abs/1207.3907v2). As final step, a repetitive rearrangement region was manually corrected (see below).
Comparison of different assembly stages and check of final assembly. The different assembly stages were compared to the final Illumina-polished assembly using Quast v.4.6.3 (34). A last validation of the final assembly with respect to potential large scale mis-assemblies was performed using Sniffles v.1.0.8 (35) based on a mapping of ONT data with the NGM-LR mapper v.0.2.6 (35). The ONT mapping was subsequently visually inspected using the Integrative Genomics Viewer (IGV) (36). Potential small-scale errors were checked using Illumina data and FreeBayes in combination with visual inspection of the mapping. An overview of the steps carried out for genome assembly, polishing and comparison of different assemblies is shown as workflow in Supplementary Figure S1.
Local hybrid assembly of a highly repetitive rearrangement region. A highly repetitive--and by the Flye assembler mis-assembled--rearrangement region on plasmid 2 of the polished assembly was resolved using a hybrid assembly strategy applying ONT and Illumina data. First, all ONT reads spanning the region and covering about 10 kb of unique regions on both sides were extracted from a previous mapping with GraphMap (see above). Second, proovread v.2.14.1 (37) was used to polish the ONT reads with Illumina data. Two specific ONT reads which could be polished by proovread (without apparent dips in Illumina data coverage) were selected and again processed with proovread and all Illumina reads. Both polished ONT reads were cut at the same unique positions up-and downstream of the rearrangement region and compared using progressiveMauve v.2.3.1 (38). One of the polished reads was then used to replace the mis-assembled region of the polished assembly of plasmid 2. Next, the inverted repeats on the rearrangement region were identified and a multiple sequence alignment was created using MUSCLE v.3.8 (39) and visualized with WebLogo (40).
Repeat analysis for ONT assembly and comparison to the PacBio assemblies. A repeat analysis was performed for the final ONT assembly as described earlier (26), only considering repeats longer than 30 kb. The output was simplified and formatted as BED file to enable visualization as a circular plot. Nucmer 3.1 (41) was used to compare and map the PacBio assemblies (as query) onto the final ONT assembly (serving as reference) (parameters: '-mum -simplify -b 5000 -D 30 -d 0.25'). Matches smaller than 20 kb and below 95% identity were removed. The output was again formatted as BED file. The final ONT assembly, results from its repeat analysis and the mapping of PacBio assemblies were combined and plotted using Circos v.0.69 (29). The large repeats were also analysed for their gene content, which was classified into functional subsystems with the RAST annotation server.
Phylogenetic analysis. GenBank records of selected, completely sequenced Pseudomonas strains were downloaded from NCBI RefSeq (March 2018). The predicted coding sequences (CDS) of 107 known housekeeping genes (42) were extracted from 18 strains (including P. koreensis P19E3, and Azotobacter vinelandii DJ as outgroup) and used to calculate a maximum likelihood phylogenetic tree using bcgTree (43), as described previously (11).
Genome annotation. The genome sequence of P. koreensis P19E3 was submitted to NCBI GenBank and annotated with the genome annotation pipeline for prokaryotes (PGAP) (44). It is available under GenBank accession numbers CP027477 to CP027481.

Repeat analysis for the genus Pseudomonas
A bacterial strain isolated during a screening of herbal plants for food spoiling and pathogenic bacteria was assigned to the species P. koreensis by MALDI biotyping (27). However, although PacBio long read sequencing was used, it was not possible to assemble the genome de novo into a single complete chromosome.
Therefore, we explored the predicted genome assembly complexity (26) for the genus Pseudomonas, relying on the same thresholds used in that publication. Koren and colleagues had classified bacterial genomes in three categories based on (i) the number of repeats >500 bp with a nucleic acid sequence similarity higher than 95% and (ii) the length of the longest repeat >95% similarity: class I genomes contain <100 repeats of up to 7 kb (approximate length of the rDNA operon), class II genomes with >100 repeats of up to 7 kb, and finally class III genomes with repeats that can be substantially longer than the 7 kb of the rDNA operon.
We analyzed 270 publicly available complete Pseudomonas genome sequences (see Methods), which included two P. koreensis strains ( Figure 1). When selecting a plot range identical to that used by Koren et al., 245 genomes with repeat lengths of up to 30 kb and up to 300 repeats of 500 bp or above were captured ( Figure 1A).
Our repeat analysis indicated that 15 Pseudomonas genomes had longer repeats and 10 exhibited a higher overall number of repeats (see upper right box, Figure 1A), corresponding to roughly 9% of the Pseudomonas strains analyzed. When extending the plot range ( Figure 1B), we captured these highly complex genomes as well. Of note was a cluster of 10 Pseudomonas genomes with more than 400 repeats ( Figure 1B, Supplementary Table S1), 8 of which represented different P. syringae pathovar (pv.) actinidiae (Psa) strains, the causal agent of Kiwi fruit plant canker (a plant disease marked by gradual decay) (45). The only completely sequenced Psa strain not covered in this extended range (>300 repeats) was strain MAFF212063 that harbored 277 repeats ( Figure 1B). Among the most complex genomes in terms of repeat length were many P. aeruginosa strains (11 of 15), P. koreensis strain 26 (with repeats of 38 kb), P. knackmussii strain B13 with repeats of around 103 kb and P. putida DLL-E4 with repeats close to 120 kb (Figure 1B, Table 1). These strains also included our P. koreensis strain P19E3, which harbored near identical repeats of almost 70 kb (red circle, Figure 1B). The steps that finally enabled us to de novo assemble its complete genome purely by using long sequencing reads are described in more detail in the sections below.

Genome assembly of P. koreensis P19E3
Due to the genome complexity classification of previously sequenced P. koreensis strains, where one strain had a complex class III genome with repeats of around 38 kb (Table  1), we first attempted to sequence and de novo assemble the genome using PacBio's long read technology (46) combined with a size selection step (inserts greater 10 kb) and two SMRT cells, a strategy that had proven useful for a class III genome before (12). However, an assembly with HGAP3 resulted in 25 contigs and could not resolve the chromosome and plasmids; presumably due to the presence of long repeat sequences. Adding data from two more SMRT cells, this time relying on an even longer BluePippin size selection cut-off (>20 kb), improved the assembly; it returned 14, i.e. fewer contigs, but did not allow us to generate a complete genome assembly.
To assess the possibility that HGAP3 had difficulties in resolving long, near identical repeats, we then explored the Flye assembler (bioRxiv: https://doi.org/10.1101/247148), which was specifically designed to address this issue. The assembly created by Flye reduced the number of contigs to 7 and also showed less redundancy in repeat regions com-pared to the HGAP3 assembly. However, it did not result in a completely gapless assembly either.
To create a gapless complete genome assembly, we therefore ended up sequencing high molecular weight gDNA (see Materials and Methods) on the ONT MinION platform, relying on reads from two ONT 1D 2 libraries (Table 2). With an N50 read length of >44 kb (after having filtered for reads longer 30 kb), the read length of these two runs was significantly longer than the average N50 of close to 17 kb from the previous PacBio runs (Table 2, Figure 2).
A final de novo genome assembly attempt based on ONT MinION data resulted in one chromosome (6.44 Mb) and three large plasmids (468 kb, 300 kb, 283 kb) with a coverage exceeding 150-fold (Supplementary Table S3). Illumina data (∼100-fold coverage) was mainly used for correction of potential single base-pair errors (48), to achieve a very low final error rate (3,4) and to capture small plasmids that otherwise would be missed when relying on size selection approaches. Notably, a small plasmid of 2.8 kb size was identified by using SPAdes (47) (Figure 3, Table  3). As noted before (11), a de novo genome assembly based only on Illumina reads did miss a substantial fraction of the genomic sequence (here ∼4%), including about 3% of the CDS (Supplemental Figure 2). The various steps carried out for assembly, polishing and comparison of different assemblies are shown in Figure S1. A multilocus sequence analysis based on the comparison of 107 conserved housekeeping genes (42) allowed to establish the phylogenetic relationship of our strain with selected Pseudomonas strains and confirmed the MALDI biotyping assignment as a P. koreensis strain (Supplementary Figure S2).
We also analyzed the repeat complexity of the final genome assembly and identified three repeat pairs and one repeat triplet when filtering for repeats longer than 30 kb (Table 4). All repeats had a sequence identity higher than 99.7% with few gaps in the respective blastn alignment of repeat pairs. The longest repeat pair (69 kb) was shared between the chromosome and plasmid 3 ( Figure 3). This repeat pair could not be resolved when relying solely on PacBio data. The second longest repeat was a triplet of 46 kb located on the chromosome, on plasmid 1, and on plasmid 2, which could not be resolved either using only PacBio data. The same held true for the third longest repeat (37 kb) located on the chromosome and on plasmid 2. In contrast to HGAP3, a fourth repeat pair of 34 kb length could be resolved by the Flye assembler using only PacBio data. Of note, one copy of this fourth repeat pair was directly adjacent to the longest repeat on the copy located on the chromosome, potentially further complicating this assembly.
Finally, a putative shufflon recombination site on plasmid 2 (see below, and Figure 3, red mark), which had not been spanned by the HGAP3 PacBio assembly, was also resolved by the Flye PacBio assembly.

Genes encoded by the long repeats of the P. koreensis P19E3 genome
Of the 516 genes found in the long repeats described above, 159 (31%) could be assigned to a functional subsystem, 129 (25%) were identified as non-hypothetical genes that were not assigned to a functional subsystem and 228 (44%)  Fifteen genomes are shown whose longest repeat pair was >30 kb (sequence identity above 95%). Pseudomonas koreensis P19E3, de novo assembled in this study is shown in bold.   Table 4; they coincide with areas where the PacBio-based assemblies were fragmented. A genomic region identified as a structural variation (ca. 5 kb) is also shown (red mark on plasmid 2). For the HGAP3 assembly with PacBio reads (innermost circle), regions are marked (in yellow) which, due to an assembly error, mapped to both chromosome and plasmid; Flye was able to resolve this region. Since plasmid 4 did not get assembled with the long read technologies, due to its short size, there is no PacBio counterpart in the PacBio tracks. were identified as hypothetical using the RAST annotation server (49). These included genes related to mobile genetic elements such as Tn7-transposon like genes, indicating the presence of several different transposon insertion events. Additionally, phage related proteins and a toxin-antitoxin system were identified. A total of 31 genes were annotated as transposase related protein coding genes. The repeats in P. koreensis P19E3 therefore may be the result of a combination of phage and transposon activities. Interestingly, next to mobile genetic element related genes, many genes putatively encoding for copper, cobalt-zinc-cadmium or arsenic resistance proteins, copper transporters, and heavy metal sensor histidine kinases were detected. Of note were also genes annotated to encode for metabolically relevant proteins such as vanillate transporters and aromatic compound degradation proteins, which may confer a fitness advantage to P. koreensis when growing on aromatic herbs, that exude aromatic compounds to their leaf surfaces. Strikingly, genes encoding DNA helicases which are usually involved in DNA repair occurred in multiple copies in the repeats, which may reflect an adaptation of the bacterium to its phyllosphere habitat (see Discussion).

Polishing and quality assessment of the final assembly
Error correction and polishing of the P19E3 genome was done in several stages (Supplementary Figure S1; for software and respective versions used, see Methods). As the Flye assembler does not use quality value information, we first carried out three rounds of error correction with the ONT data using Racon (30), which can employ quality information encoded in FASTQ files. Compared to the initial Flye assembly, this step reduced the number of indel and mismatch assembly errors (Supplementary Table S2). Another step using Nanopolish (33) further reduced their amount. Remaining errors (compared to the final assembly), i.e. mostly single base-pair errors, were removed in a final polishing step using Illumina MiSeq data. We next mapped the Illumina data back against the polished genome assembly to assess if small scale misassemblies could be detected, combining an automated detection using Freebayes and a visual inspection with IGV.
No smaller scale mis-assemblies were detected. To find evidence for potential larger scale errors like inversions, we first mapped the ONT data on the final assembly using NGM-LR and then ran Sniffles on the mapping (35). However, except for a complex rearrangement region which was resolved later (see below), no evidence for a mis-assembly was found. The mapping statistics for all sequencing technologies are listed in Supplementary Table S3.

Resolving a highly repetitive rearrangement region
During the error correction of the ONT assembly, we identified a region of ∼5 kb on plasmid 2, which stood out due to the poor mappability of both ONT and Illumina reads (Positions 99 590 to 10 400). A tblastx search indicated that this region contained a shufflon system (50) ( Figure 4A). To resolve this putative shufflon region, a tailored hybrid polishing strategy using a combination of ONT and Illumina data from this region (see Methods) and the software proovread (37) was applied, which allowed to uncover evidence for two major variants of the shufflon ( Figure 4A and B). Since a mapping of ONT and Illumina data did not indicate an abundance difference of the two variants, one variant was picked to represent the shufflon in the final assembly. The analysis of the shufflon required a significant time investment, much greater than that required to arrive at the final, polished assembly. Once we had the ONT long reads, the de novo genome assembly and polishing could in fact be done quite fast (in less than a week), while the resolution of the shufflon region required substantially more time and manual input.
Of note, we detected 19 bp inverted repeats in the shufflon sequence, which all pointed inwards to a position at around 102.5 kb (track 4, Figure 4A). The coverage of mapped Illumina reads (track 5, Figure 4A) exhibited characteristic coverage dips around most repeat positions, with a coverage peak exactly at the repeat position (shown in black on 'Coverage Illumina' track, Figure 4A) (51). Importantly, rearrangements for most inverted repeat sites between the two recombination variants were observed (black arrows, Figure 4B), except for two repeat positions which did not show any rearrangement (white arrows). To additionally check the replaced region, ONT and Illumina data were mapped to this region. We could confirm that the mapping in the corrected region was now of a high quality, although showing a lower coverage compared to the rest of plasmid 2, since the mapped reads originated from differently recombined versions of the shufflon. Importantly, a highly conserved sequence motif of the inverted repeats could be identified for this shufflon system ( Figure 4C), where only the bases at position 14 and 16 of the 19 bp motif showed variation. The motif sequence was closely related to a repeat reported for an Inc2 plasmid shufflon (51).

Prevalence of very complex to assemble prokaryotes
To explore the extent of genome assembly complexity for prokaryotes in general, we first analyzed the repeats of 9331 publicly available, complete bacterial genomes. Notably, the existence of complex genomes was not unique for the genus Pseudomonas, but a general feature of bacteria. Overall, 663 genomes (7.1%) harbored more than 300 repeats and 300 strains had genomes that contained near identical repeats larger than 30 kb (3.2%) ( Figure 5, Supplementary Table  S4). A total of 151 genomes were not covered in the plot (all of them bacteria), including 102 genomes with >600 repeats and 49 genomes whose longest repeats exceeded 100 kb ( Figure 5, upper right box).
Among the complex, overall very repeat-rich genomes the following five species predominated (the top 10 species are listed in Table 5 (Table 5). For the opportunistic pathogen E. faecium, 53% (23 of 43) of the genomes harbored more than 300 repeats. These data might reflect a greater impetus to sequence the full genomes of bacteria with pathogenic lifestyles. In support of this hypothesis, the histogram on top of Figure 5 showed a peak of strains with around 300 repeats (see also Table 5). This peak originated from an overrepresentation of Bordetella pertussis genomes in NCBI GenBank, apparently due to a focused sequencing project for this pathogen.
Furthermore, for strains whose genomes contained very long near identical repeats, the following five species were most prevalent (the top 10 species are listed in Table 5, lower panel): Escherichia coli (25 of 300), Francisella tularensis (14 of 300), Pseudomonas aeruginosa (11 of 300), Burkholderia mallei (8 of 300) and Mycobacterium bovis (7 of 300). For these species, all described as pathogens, only a smaller fraction of the sequenced strains contained many repeats: 10% (11 of 106) for P. aeruginosa strains and 5% (25 of 457) for E. coli. A higher percentage was observed for F. tularensis strains with 38% (14 of 37) and for M. bovis strains with 44% (7 of 16). Strikingly, B. mallei strains apparently combine both very long repeats (8 of 28, 29%) with a large number of repeats (28 of 28, 100%). They will thus represent a formidable challenge when aiming for a complete de novo genome assembly.
We also explored complete genomes of archaea, for which 293 genomes have been deposited at NCBI (Supplementary  Table S4). While seven genomes harbored more than 200 repeats, none of them contained more than 300 repeats. In Figure 5. Genome assembly complexity of 9331 bacteria and 293 archaea whose genomes have been completed. Blue/orange dots represent bacterial/archaeal assemblies, respectively, with the overall number of repeats as x-axis coordinate (up to 600 repeats) and the longest repeat as y-axis coordinate (up to 100 kb). The dotted lines separate the three genome assembly complexity classes; the solid black line represents the original plot dimensions by Koren and colleagues (26). The light gray zone represents genomes that potentially can be assembled with PacBio reads depending on the quality of the library and sequencing run. Above 45 kb repeat length (dark gray zone), ONT reads are expected to provide significant benefit whereas closed genomes with PacBio reads only would depend on further advances in technology. Barplots visualize the distribution of the overall number of repeats per strain on top and longest repeat per strain on the right. Table 5. Overview of top 10 species among 663 genomes with more than 300 repeats (upper panel) and among 300 genomes with very long repeats (> 30 kb), lower panel. contrast, six genomes featured long repeats >30 kb (Supplementary Table S4), indicating that a sizeable number of archaea likely also contain very complex genomes. Finally, for 70 bacterial assemblies we found repeats longer than the 69 kb reported for P. koreensis P19E3 (0.7% of 9331 bacterial assemblies; Supplementary Table S4). An examination of 23 of these cases, where the assembly could be easily linked to a publication via its GenBank record, indicated that virtually all of the assemblies had required additional, very labor-intensive methods like cosmid libraries, optical mapping, primer walking or some form of manual curation. In another case, ONT data was used in combination with PacBio to resolve repeats of 212 kb for E. coli O157 (see Supplementary Table S4). The longest repeat (2 974 674 bp) was identified for Calothrix sp. NIES-4101 (Supplementary Table S4), whose assembly was based on short Illumina reads and with a moderate coverage of 53fold. A closer inspection revealed that the sequence of plasmid 1 (2,97 Mbp) was entirely contained in the chromosome sequence (7.24 Mbp) and was identical over its entire length (Supplementary Figure S3), suggesting a mis-assembly.

DISCUSSION
The de novo genome assembly of P. koreensis P19E3 turned out to be extremely challenging, which could be attributed to the presence of long, near identical repeat pairs of 34, 37, 46 and 69 kb, and a complex shufflon region. Only by using very long reads from the ONT platform (mean read length 44 kb) was it possible to fully resolve this genome, indicating the presence of 'dark matter' of genome assembly also in prokaryotes (25). Notably, the long repeats were shared between the chromosome and three large plasmids (Figure 3), a feature we had not observed before in ∼50 other complete de novo assembly projects. Two long, near identical repeats of 69 and 34 kb were located adjacent to each other on the chromosome, only separated by a few bases of unique sequence. This example indicates that the assembly complexity for some genomes will even be greater than suggested by mere analysis of individual repeats. In this specific case, an ONT read of 164 kb spanned these two repeats and unique sequences on both ends. To create our final assembly, we relied on 1D ONT reads, which are longer than 1D 2 reads and are particularly valuable for de novo genome assembly of complex genomes. Of note, the mappability of ONT reads drops substantially for very long reads (Supplementary Table S5). A significant advantage of the long reads compared to an Illumina short read only assembly concerned the correct resolution of all large plasmid sequences (and thereby clear separation from the chromosome), which otherwise would have been missed at least partially, thereby losing important information for a biological interpretation of the genome sequence. Conversely, the short reads allowed iden-tification of a small plasmid of 2.8 kb, indicating that a combination of technologies is required for a complete assembly.
On top of the long near-identical repeats, a highly repetitive shufflon region represented a second major obstacle for complete genome assembly of P. koreensis P19E3 (Figure 4) (54). Again, long read data was critical in allowing us to resolve two similarly abundant variants; identification of potential additional, lower abundant variants would require a deeper sequencing of this genomic region, which was beyond the scope of this study. A genome assembly based only on Illumina data resulted in a highly fragmented assembly in this shufflon region (Supplementary Figure S2). The large majority of the time-consuming manual analysis was spent on this shufflon system. In contrast, the final de novo assembly using ONT data and the Flye assembly algorithm plus the downstream sequence polishing only took a few workdays. This represents a significant time gain over the time-consuming steps otherwise required to close the gaps in highly complex genomes, as indicated by our exploratory analysis of the most complex genomes (see below). While we do not claim that our approach is the best for all complex prokaryotic assemblies, in our hands Flye was faster than Canu (15) and did not require post-processing, i.e. removal of terminal direct repeats, which are an artefact of the assembly for many assemblers, like for example with Canu.
The original repeat analysis study by Koren et al. (26) reported that Class III genomes contain large phage-mediated repeats, segmental duplications, or large tandem arrays that are significantly larger than the rDNA operon. Previously, we could show that repeat-rich class II genomes can harbor a large number of transposon and insertion sequences (11). Moreover, our analysis of Streptomyces genomes established that a fraction of the very long repeats are linear, inverted, terminal repeats. The analysis of the long P. koreensis P19E3 repeats indicates that some of the duplication events may have conveyed a fitness advantage to the strain, which was isolated from plant leaves of aromatic herbs: First, leaves are light harvesting organs, and bacteria living on leaves are therefore constantly exposed to DNA damaging sunlight. The multiple DNA helicases harbored in the long repeats may therefore be advantageous and carry out a DNA repair function (55,56). Second, aromatic herbs are sources of aromatic compounds; the ability to degrade aromatic compounds may thus confer a fitness advantage to the strain in its natural habitat. Third, in Switzerland, organic farms often use copper for treatment of plant diseases; the abundance of copper and heavy metal resistance genes in the genome may therefore be a direct result of such treatments. Together, these data suggest that at least part of the long repeat sequences in prokaryotic genomes may indeed be functionally relevant.
As a service to the community, we here publicly release the results of the repeat analysis for over 9600 prokaryotic genomes (Supplementary Table S4). An analysis as shown in Table 5 could thus be readily carried out for many taxonomic groups enabling researchers to assess the expected difficulty for complete de novo genome assembly projects and develop the best-possible strategy for their organisms of interest. An identification of complex repeat regions as mentioned above and a graphical visualization on top of identifying individual repeat elements could represent a valuable Nucleic Acids Research, 2018, Vol. 46, No. 17 8963 future addition to the genome assembly complexity classification.
Importantly, the analysis of all publicly available, complete prokaryotic genomes from NCBI Genbank uncovered an unexpected high percentage of complex genomes. Roughly 7% were class II genomes, which are characterized by a large number or repeats, but none longer than the rDNA operon. For the genus Pseudomonas, the plant pathogenic Psa strains contained the largest number of repeats (ranging from 277 to 587). We have recently shown that such class II genomes can be readily assembled into complete genomes relying on PacBio data (11). In that case study, complete genomes greatly facilitated a comparative genomics study, as the repeats also contained core genes, that, by relying on Illumina data alone, would have been missed. Similarly, PacBio long reads have facilitated complete de novo assembly for many class III genomes (12). Based on our experience with ∼50 de novo genome assemblies, they provide a very high quality genomic sequence information, that has fewer errors or homopolymer stretches than an ONT assembly. However, we here identified a sizeable and particularly difficult subset of strains (∼3%) whose genomes harbor long near identical repeats above 30 kb to over 100 kb in length. These cases will greatly benefit from very long ONT reads and appropriate assembly algorithms like Flye (bioRxiv: https://doi.org/10.1101/ 247148). As complex genomes will be under-represented per se among completely sequenced genomes, we assume that the fraction of complex genomes may even be higher than this estimate. The fraction of class II and III genomes may also increase when results of larger sequencing efforts of diverse prokaryotes like the GEBA initiative (16) and the NCTC 3000 project will become available. So far, we noted a bias for prokaryotes with pathogenic lifestyles, that likely reflects research priorities or funding opportunities. For archaea, more data is needed for a more meaningful comparison with bacteria.
Considering the difficulties in resolving the 69 kb repeats of P. koreensis P19E3, even with PacBio data, the repeat analysis results also raise questions: How were researchers able to assemble genomes with near identical repeats of 100 kb and longer? While only a fraction of the complete genome assemblies specified the sequencing technology they had used, we expect that some of the assemblies that did not rely on PacBio or even ONT data contain assembly errors. In the case of Pseudomonas aeruginosa PA1R (57), we noted that the assembly differed in its repeat structure between two assembly versions: while the first assembly relied on Illumina data and contained a 44 kb repeat, the second had relied on PacBio data; there, the repeat was not present any more. Also, the assembly with the longest repeat of 2.97 Mb, upon closer inspection, appeared to represent a mis-assembly. We believe that NCBI Genbank would benefit from flagging assemblies with large repeats that are only based on Illumina short read data without evidence for further efforts to close the gaps. A closer inspection of 23 assembly projects of genomes with repeats longer than 69 kb (70 projects in total) indicated that the researchers had to put in a serious effort to arrive at a complete genome sequence: they had either relied on larger insert clone libraries like cosmids (up to about 45 kb inserts), bacterial artificial chromosomes (BACs, with larger insert sizes), primer-walking or some form of manual curation to generate a complete genome sequence. For P. putida DLL-E4, the most complex Pseudomonad, time-consuming primer-walking was used to close the genome (58), whereas in the case of P. knackmussi B13, optical mapping was used (59). With the availability of very long DNA reads and appropriate bioinformatics tools, such labor-intensive steps likely will become obsolete for prokaryotes. In our hands, the longest mappable ONT read was 288 kb in length. These advances will further accelerate the exponential increase in complete genome sequences and will also have a big impact on the de novo genome assembly of eukaryotic genomes, as exemplified for the human genome (7), and for nematodes (60).
Finally, we like to caution researchers that plan to rely on a reference-based genome assembly strategy, i.e. mapping sequencing reads against a reference instead of opting for a de novo genome assembly: they may easily oversee existing genome differences among closely related strains like single nucleotide variations or larger regions unique to either genome (12). Combined with a publicly available solution for accurate genome annotation of prokaryotes by proteogenomics (12), long reads and modern de novo genome assemblers will enable the research community to exploit the increasing number of complete bacterial genome sequences. Among other benefits, we expect this will help to unravel new functions encoded by diverse microbiomes, push application of SMRT sequencing for diagnostics (61) and accelerate the identification of urgently needed new antibiotics.

DATA AVAILABILITY
The genome assembly difficulty classification along with the total number of repeats and the length of the longest repeat is available for more than 9300 bacteria and close to 300 archaea in Supplementary Table S4. The genome sequence of P. koreensis P19E3 is available under GenBank accession numbers CP027477 to CP027481. The raw sequencing data was submitted to the NCBI Sequence Read Archive (NCBI BioProject PRJNA436895, BioSample SAMN08634678).