Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity

Abstract Background Although draft genomes are available for most agronomically important plant species, the majority are incomplete, highly fragmented, and often riddled with assembly and scaffolding errors. These assembly issues hinder advances in tool development for functional genomics and systems biology. Findings Here we utilized a robust, cost-effective approach to produce high-quality reference genomes. We report a near-complete genome of diploid woodland strawberry (Fragaria vesca) using single-molecule real-time sequencing from Pacific Biosciences (PacBio). This assembly has a contig N50 length of ∼7.9 million base pairs (Mb), representing a ∼300-fold improvement of the previous version. The vast majority (>99.8%) of the assembly was anchored to 7 pseudomolecules using 2 sets of optical maps from Bionano Genomics. We obtained ∼24.96 Mb of sequence not present in the previous version of the F. vesca genome and produced an improved annotation that includes 1496 new genes. Comparative syntenic analyses uncovered numerous, large-scale scaffolding errors present in each chromosome in the previously published version of the F. vesca genome. Conclusions Our results highlight the need to improve existing short-read based reference genomes. Furthermore, we demonstrate how genome quality impacts commonly used analyses for addressing both fundamental and applied biological questions.

Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?
Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Eukaryotic genomes, particularly plants, are notoriously difficult to assemble because of issues related to high repeat content, a history of gene and whole genome duplications, and regions of highly skewed nucleotide composition 1 . The short-reads (50-300 bp) generated by next-generation sequencing (NGS) technologies are often insufficient to resolve complex genomic features and regions. NGS reads are unable to span large repetitive regions resulting in sequence gaps and ambiguities in the assembly graph structures. Despite this known limitation, NGS has been used for the majority of genome sequencing projects over the past decade resulting in a series of unfinished, fragmented draft genome assemblies 2 . For instance, the genome of woodland strawberry (Fragaria vesca 'Hawaii-4') was assembled using a mixture of different short read technologies and yielded 16,487 contigs in 3,263 scaffolds with an N50 length of 27 kb 3 . Dense linkage maps were later utilized to split multiple chimeric scaffolds and improve anchoring to the seven pseudomolecules 4 . However, the F. vesca (version 2; V2) genome remains incomplete with 6.99% gaps, missing megabase-sized regions, and scaffolding errors.
Fragaria vesca serves as an important model system for genetic studies for the Rosaceae community, due to its small stature, short generation time, a simple and efficient system for genetic transformation, and an increasing number of genetic resources [5][6][7] . With more than 2,500 described species, Rosaceae is one of the most speciose eudicot families and includes a breadth of important crops (e.g. almonds, apples, apricots, blackberries, cherries, peaches, pears, plums, raspberries, roses and strawberries) 8 . Furthermore, F. vesca is a valuable genetic resource because it is the putative diploid progenitor of the A subgenome of the cultivated octoploid strawberry (F. x ananassa) 9 . Strawberries are of major economic importance worldwide with 373,435 hectares planted and 8,114,373 metric tonnes of fruit produced in 2014 10 . Previous versions of the F. vesca genome have been used to uncover underlying genetic factors regulating plant and fruit development, seasonal flowering, sex determination, metabolite diversity, and disease resistance [11][12][13][14][15][16] . A high-quality reference genome for F. vesca would further enable family-wide comparative studies and leverage the strengths offered by this model system for both fundamental and applied research.
We aimed to improve the F. vesca 'Hawaii-4' reference genome using a long-read PacBio single-molecule real-time (SMRT) sequencing approach. We generated 2.5 million PacBio reads collectively spanning 19.4 Gb (80.8x coverage) with a subread N50 length of 9.2 kb (Supplemental Figure 1; NCBI BioProject ID PRJNA383733). The raw PacBio reads were error corrected and assembled using the Canu 17 assembler followed by two rounds of polishing with Quiver 18 . High coverage (~40x) Illumina data was aligned to the PacBio assembly and residual errors were corrected using Pilon 19 . After removing the complete chloroplast and mitochondrial genomes, the final assembly spanned 219 Mb across 61 contigs with an N50 length of 7.9 Mb. Half of the assembly is contained in the largest 9 contigs, including five that exceed 10 Mb. The assembly graph is relatively simple with few ambiguities excluding a small cluster of five contigs corresponding to rRNA gene arrays from the nucleolar organizer region (Supplemental Figure 2). This represents a ~300 fold improvement in contiguity compared to the Illumina and 454 based F. vesca assembly 3 . 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 The PacBio based contigs were anchored into a chromosome-scale assembly using a two-enzyme BioNano genome map. Contigs were scaffolded first using the BsqQI map and this hybrid assembly was used as a reference for the BssSI map. The combined BioNano and PacBio assembly spans 220.8 Mb across 31 scaffolds with an N50 length of 36.1 Mb and 99.8% of the assembly captured in 9 scaffolds (Supplemental Table 1). Five of the seven F. vesca chromosomes are complete and two chromosomes were assembled into chromosome arms. The two pairs of chromosome arms were anchored using support from genetic maps 3 . The PacBio and BioNano assembly (hereon referred to as F. vesca V4) captures ~24.96 Mb of additional sequences with significant improvements in contiguity. F. vesca V4 has nine terminal telomere tracks with sequence and genome map support (Figure 1, Supplemental Figure 3), suggesting that the assembly is largely complete. Tandem arrays of centromeric repeats with monomeric lengths of 140, 143, and 147 bp were found in all seven chromosomes, consistent with previous findings 3 . F. vesca V4 contains three nucleolus organizer regions (NOR) at the beginning of Fvb1 and Fvb7 and at the end of Fvb5, consistent with previous cytological observations 20 . NOR rRNA arrays are complete on Fvb1 and Fvb5, but fragmented on Fvb7, based on sequence and genome map support. The 5S rRNA array is located 5 Mb upstream of the NOR on Fvb7 (Supplemental Figure 4). The F. vesca V4 assembly and annotation will be made publicly available on Genome Database for Rosaceae (https://www.rosaceae.org/), Phytozome (www.phytozome.net/) and CyVerse CoGe platform (https://genomevolution.org/; Genome ID: 34925).
A whole genome comparison of F. vesca V4 to V2 4 uncovered numerous, large-scale scaffolding errors made in each of the chromosomes in the previous version (Figure 2). The overall quality of the F. vesca V4 assembly, compared to V2, is also supported by the distribution pattern of DNA methylation across chromosomes (Supplemental Figure 5). These types of errors considerably hinder various genomic analyses, including fine-mapping genes underlying traits 21 and identifying structural variants via comparative genomics. Here we demonstrate the superior quality of F. vesca V4 by making comparisons to a high-density linkage map of Fragaria iinumae 22 , which is another putative diploid progenitor species of the cultivated octoploid strawberry. The total number of collinear markers against the F. iinumae genetic map increased by over 10% using F. vesca V4, compared to V2, and identified a distinctive chromosomal inversion between the two species near the pericentromeric region on chromosome 3 (Supplemental Figure 6, Supplemental Table 2, Table S1).
Although the quality of previous annotations of the F. vesca genome 3,23 is comparable to other annotations of short-read assemblies, they are, unavoidably, incomplete and fragmented resulting in errors in gene identification and gene number predictions 24 . Thus, despite the increasing volume of transcript and protein sequence information generated from various experimental studies, the task of improving genome annotation of such genomes remains a major challenge. Using the MAKER-P annotation pipeline 25 , publicly available transcriptome data of F. vesca, and protein sequences from Arabidopsis thaliana and the UniprotKB database as evidence, we identified 28,588 gene models in F. vesca V4, of which 70% have a known Pfam domain. The mean length of the predicted genes is 1,475 bp (Supplemental Table 3). Repetitive elements were annotated, including long terminal repeat retrotransposons (LTR-RTs) 1 2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 (e.g., gypsy and copia; Figure 1), non-LTR retrotransposons, and DNA transposons, using RepeatModeler 26 , MITE_Hunter 27 , and LTR_retriever 28 . Most repetitive elements are unassembled, incomplete or collapsed in short-read based reference genomes, which results in the underestimation of the repeat content of most eukaryotic genomes 29 . The improvement in genome quality of F. vesca V4 permitted the identification of additional LTR-RTs compared to previous versions of the genome (Supplemental Table 4). Furthermore, an analysis of the insertion times of each LTR-RTs indicates that there were two major LTR-RT bursts; approximately 1.8 and 1.2 million years before present (Supplemental Figure 7). Organellar genomes from the plastid and mitochondrion were also annotated and verified for completeness (Supplemental Figures 8-9).
The Benchmarking Universal Single-Copy Orthologs (BUSCO V2 30 ) method was used to estimate the completeness of genome assembly and quality of gene annotation of F. vesca V4. The majority (95%) of the 1,440 core genes in the embryophyta dataset were identified in the annotation, which is supportive of a high-quality assembly and annotation similar to other 'platinum' grade genomes [31][32][33] . The overall quality of the annotation is further supported by the distribution of DNA methylation across the gene bodies (Figure 3). The F. vesca V4 annotation shows much sharper distribution patterns, especially in the CG context, and lower CHG and CHH (where H=A, T or C) methylation in the gene bodies. These patterns are expected for annotations that are more accurate and contain fewer mis-annotations (e.g., pseudogenes, transposons, etc). Additionally, F. vesca V4 contains 1,496 newly predicted gene models, with a mean length of 1,505 bp, that were not present in the previous versions of the genome 3,23 . The vast majority of these new genes (1,463 total) are expressed in different fruit tissues and developmental stages (Figure 4; Table S2). Thus, previous expression studies may have missed key genes controlling fruit development and maturation in F. vesca 34,35 . Of the new genes in F. vesca V4, 810 genes did not show similarity at the protein level (query length < 30%, E= 10 -10 ) to any paralogs in the V2 genome but exhibit unique expression patterns ( Figure  4). We also identified significantly more tandemly duplicated genes and larger tandem arrays in F. vesca V4 (Supplemental Figure 10). Long-read single molecule sequencing approaches have been shown to better resolve tandemly repeated copies [36][37][38] . The identification of tandemly duplicated genes is important since such genes are known to be highly enriched for both abiotic and biotic stress related functions 39 . For example, many important plant defense genes, including nucleotide-binding site leucine-rich repeat (NBS-LRR) 40 and cytochrome p450s (CYPs) 41 , are tandemly duplicated and exhibit high levels of copy number variation within a species.