Construction of a chromosome-scale long-read reference genome assembly for potato

Abstract Background Worldwide, the cultivated potato, Solanum tuberosum L., is the No. 1 vegetable crop and a critical food security crop. The genome sequence of DM1–3 516 R44, a doubled monoploid clone of S. tuberosum Group Phureja, was published in 2011 using a whole-genome shotgun sequencing approach with short-read sequence data. Current advanced sequencing technologies now permit generation of near-complete, high-quality chromosome-scale genome assemblies at minimal cost. Findings Here, we present an updated version of the DM1–3 516 R44 genome sequence (v6.1) using Oxford Nanopore Technologies long reads coupled with proximity-by-ligation scaffolding (Hi-C), yielding a chromosome-scale assembly. The new (v6.1) assembly represents 741.6 Mb of sequence (87.8%) of the estimated 844 Mb genome, of which 741.5 Mb is non-gapped with 731.2 Mb anchored to the 12 chromosomes. Use of Oxford Nanopore Technologies full-length complementary DNA sequencing enabled annotation of 32,917 high-confidence protein-coding genes encoding 44,851 gene models that had a significantly improved representation of conserved orthologs compared with the previous annotation. The new assembly has improved contiguity with a 595-fold increase in N50 contig size, 99% reduction in the number of contigs, a 44-fold increase in N50 scaffold size, and an LTR Assembly Index score of 13.56, placing it in the category of reference genome quality. The improved assembly also permitted annotation of the centromeres via alignment to sequencing reads derived from CENH3 nucleosomes. Conclusions Access to advanced sequencing technologies and improved software permitted generation of a high-quality, long-read, chromosome-scale assembly and improved annotation dataset for the reference genotype of potato that will facilitate research aimed at improving agronomic traits and understanding genome evolution.

Abstract: Background : Worldwide, the cultivated potato, Solanum tuberosum L . , is the number one vegetable crop and a critical food security crop. The genome sequence of DM1-3 516 R44, a doubled monoploid clone of S. tuberosum Group Phureja, was published in 2011 using a whole-genome shotgun sequencing approach with short read sequence data. Current advanced sequencing technologies now permit generation of near-complete, high-quality chromosome-scale genome assemblies at a minimal cost. Findings : Here, we present an updated version of the DM1-3 516 R44 genome sequence (v6.1) using Oxford Nanopore Technologies long reads coupled with proximity-by-ligation scaffolding (Hi-C) yielding a chromosome-scale assembly.
The new (v6.1) assembly represents 741. 6 Mb of sequence (87.8 %) of the estimated 844 Mb genome, of which, 741.5 Mb is non-gapped with 731.2 Mb anchored to the 12 chromosomes. Use of Oxford Nanopore Technologies full-length cDNA sequencing enabled annotation of 32,917 high-confidence protein-coding genes encoding 44,851 gene models that had a significantly improved representation of conserved orthologs compared to the previous annotation. The new assembly has improved contiguity with a 595-fold increase in N50 contig size, 99% reduction in the number of contigs, a 44fold increase in N50 scaffold size, and an LTR Assembly Index score of 13.56, placing it in the category of reference genome quality. The improved assembly also permitted annotation of the centromeres via alignment to sequencing reads derived from CENH3 nucleosomes. Conclusions : Access to advanced sequencing technologies and improved software permitted generation of a high-quality, long-read, chromosomescale assembly and improved annotation dataset for the reference genotype of potato that will facilitate research aimed at improving agronomic traits and understanding genome evolution. We have revised our manuscript to address the reviewer's comments and provided (below) a point-by-point response to their comments. We have also added the RRIDs and made a few other minor edits to the manuscript. We have uploaded a marked up copy of the revised manuscript along with the final revised manuscript to the GigaScience website. We have released our files on the Dryad Digital Repository and in the NCBI SRA as well. We hope our manuscript is now suitable for publication in GigaScience.

C. Robin Buell
Response to Reviewer's Comments Reviewer reports: Reviewer #1: Review: Construction of a chromosome-scale long-read reference genome assembly for potato The authors described the updated genome assembly for potato and provided the genome annotations, especially the annotation of centromeres. The reported genome assembly represented much improvement over the released ones. This study and the associated data are very much valuable to the potato genetic and breeding communities.
While the manuscript is well written, we have a few minor comments: 1. First of all, the main text has no line numbers for reviewers. It is a little bit hard to input specific comments.
Author Response: We have inserted line numbers in the revised document.
2. Would the authors like to report the ONT sequencing in depth both the main text and Tables, like Table S2?
Author Response: The coverage of the reads used in the assembly has been added to the main text and Table S2. 3. For each polishing steps, would you like to report the improvement (or changes) you gained. Also, I may want to know why you chose three rounds of Pilon polishment? Why not 2 or 4 rounds?
Author Response: Polishing is a tradeoff between fixing true assembly errors and polishing errors into the assembly, especially in the later stages where you run the risk of degrading repetitive regions while fixing few true errors. At the third round of pilon we hit a plateau of errors fixed and a maximum BUSCO score. We feel the final BUSCO metrics for the genome assembly and annotation and the genome assembly LAI score show the polishing was sufficient.
Author Response: As the DM potato is a doubled monoploid propagated by cloning, it is homozygous albeit mutation could introduce variants. To illustrate this, we have performed a GenomeScope analysis using the Illumina whole genome shotgun sequencing reads. We have added the estimated heterozygosity (0.0383%) to the genome assessment section and added a new supplemental Figure (Figure S2) showing the kmer distribution generated by GenomeScope which clearly shows DM to be homozygous so the presence of haplotigs is not expected.

5.
In the genome assessment section, would you like to report the heterozygosites (polymorphic sites) you called from shotgun reads alignment? This is interesting for readers.
Author Response: See response to #4 above.
Reviewer #2: Pham et al. present a reference-quality genome assembly for a doubled monoploid potato clone using Oxford Nanopore long reads and Hi-C scaffolding. Previously generated resources, including a genetic map from 190 individuals, were used to validate the placement of scaffolds onto chromosome-sized pseudomolecules. New Oxford Nanopore cDNAs and published RNA-seq libraries were used to annotate gene models, which yielded complete representation of ~93% of the BUSCO orthologs.
This new cultivated potato assembly is a considerable improvement over previous versions and will be a welcome addition to the growing number of high-quality plant genome assemblies. Overall, the manuscript is well written and organized with adequate detail to reproduce the assembly and annotations. I think the depth of analysis here is probably more than sufficient for a data note, and all figures, tables, and supplementary materials are warranted and clearly presented.
I have some minor comments on a few places where I feel additional details or clarification would be helpful: 1. Was the average size of size of the isolated high molecular weight DNA measured?
Author Response: Based on the Fragment Analyzer results, we estimate the size of the high molecular weight DNA as > 60 kbp.
subcommand. This is described in our methods for the polishing: "An updated consensus VCF file was generated using nanopolish variants --consensus -x 5000 and the polished assembly generated using the VCF file with nanopolish vcf2fasta." 6. How many Illumina reads were used with Pilon? The ~459 million mentioned in the contiguity and accuracy section?
Author Response: Correct, we have updated the text to specify the Illumina library id (PEP_AA_01) in the section describing pilon polishing and the contiguity and accuracy section. Also, Table S1 was reporting read pair count for the Illumina libraries and this has been updated to show the total read count.
7. Could you elaborate on how "recombination bins [were] manually adjusted to eliminate incorrect bins"? How were these bins identified as incorrect?
Author Response: There are occasionally mistakes in the genotyping data of one or two individuals in the population that creates the appearance of double recombination events in the genetic map. This is highly unlikely in one individual and these positions were rescored as 'no call'.
8. Intact LTRs were annotated using LTRharvest, LTR_finder and LTR_retriever for assessing assembly continuity using the LAI metric. Were these identified LTRs later used in RepeatModeler to mask the assembly or included in the final custom repeat library? Which set(s) of repeats were used to soft mask the genome prior to gene prediction?
Author Response: The annotation of the genome assembly and the genome LAI analysis were performed independently.
The construction of the custom repeat library, repeat masking, and the use of the repeat masked genomes in the annotation, including the programs and commands used, are fully described in the methods.
9. How much cDNA and RNA-Seq transcript data were ultimately aligned to the genome and used for gene annotation?
Author Response: We have added the alignment rates for the nanopore and the RNAseq data to the text. 10. What do the green boxes in Fig S1 represent?
Author Response: The green boxes are the individual scaffolds within the pseudomolecule. The boundaries of the pseudomolecules are represented by blue boxes. We have added text to the legend of Figure S1 to clarify what the blue and green boxes represent. Mb anchored to the 12 chromosomes. Use of Oxford Nanopore Technologies full-length 42 cDNA sequencing enabled annotation of 32,917 high-confidence protein-coding genes 43 encoding 44,851 gene models that had a significantly improved representation of 44 conserved orthologs compared to the previous annotation. The new assembly has 45 improved contiguity with a 595-fold increase in N50 contig size, 99% reduction in the 46 number of contigs, a 44-fold increase in N50 scaffold size, and an LTR Assembly Index 47 score of 13.56, placing it in the category of reference genome quality. The improved 48 assembly also permitted annotation of the centromeres via alignment to sequencing 49 reads derived from CENH3 nucleosomes. Conclusions: Access to advanced sequencing 50 technologies and improved software permitted generation of a high-quality, long-read, 51 chromosome-scale assembly and improved annotation dataset for the reference 52 genotype of potato that will facilitate research aimed at improving agronomic traits and 53 understanding genome evolution. The genome of the vegetable crop potato (Solanum tuberosum L., NCBI:txid4113) was 59 published in 2011 by the Potato Genome Sequencing Consortium (PGSC) using a whole- 60 genome shotgun sequencing approach [1]. At that time, Illumina sequencing was a newly 61 available approach with high accuracy and throughput relative to previously available 62 technologies. The reference genome was generated from the doubled monoploid clone, 63 DM1-3 516 R44 (hereafter referred to as DM; Figure 1), to reduce assembly difficulties 64 due to the heterozygous and polyploid nature of tetraploid potato. The PGSC DM genome 65 was assembled using a combination of 36 nucleotide (nt) reads from the Illumina Genome 66 Analyzer platform and scaffolded using longer end sequence reads from fosmid and 67 bacterial artificial chromosome clones generated using Sanger sequencing technology. 68 This resulted in a highly fragmented genome assembly, with 90% of the assembly 69 contained in 443 super-scaffolds with an N90 super-scaffold length of 359 kb and an N50 70 contig length of 31.4 kb [1]. With access to additional genetic maps and comparative data 71 with tomato, the ordering, orientation and anchoring of the initial PGSC assembly to the 72 12 chromosomes of potato was improved, yielding v4.03 of the DM genome [2]. DM v4.03 73 was then supplemented by the addition of new, unscaffolded contigs (v4.04) [3] (Table 1) 74 generated through whole-genome sequencing and assembly of unaligned reads.