De novo genome assembly of two tomato ancestors, Solanum pimpinellifolium and Solanum lycopersicum var. cerasiforme, by long-read sequencing

Abstract The ancestral tomato species are known to possess genes that are valuable for improving traits in breeding. Here, we aimed to construct high-quality de novo genome assemblies of Solanum pimpinellifolium ‘LA1670’ and S. lycopersicum var. cerasiforme ‘LA1673’, originating from Peru. The Pacific Biosciences (PacBio) long-read sequences with 110× and 104× coverages were assembled and polished to generate 244 and 202 contigs spanning 808.8 Mbp for ‘LA1670’ and 804.5 Mbp for ‘LA1673’, respectively. After chromosome-level scaffolding with reference guiding, 14 scaffold sequences corresponding to 12 tomato chromosomes and 2 unassigned sequences were constructed. High-quality genome assemblies were confirmed using the Benchmarking Universal Single-Copy Orthologs and long terminal repeat assembly index. The protein-coding sequences were then predicted, and their transcriptomes were confirmed. The de novo assembled genomes of S. pimpinellifolium and S. lycopersicum var. cerasiforme were predicted to have 71,945 and 75,230 protein-coding genes, including 29,629 and 29,185 non-redundant genes, respectively, as supported by the transcriptome analysis results. The chromosome-level genome assemblies coupled with transcriptome data sets of the two accessions would be valuable for gaining insights into tomato domestication and understanding genome-scale breeding.


Introduction
Plant domestication and improvements enable the selection of genes and alleles that are favourable to humans, whereas numerous genes/ alleles that are unfavourable to humans are lost in the process. Some of these losses have proved undesirable for breeding. Cultivated tomato [Solanum lycopersicum var. lycopersicum (SLL)], which originates from South America, likely occurred in both Peru and Ecuador. 1 Cultivated tomatoes have become one of the most important vegetable crops worldwide; they are a rich source of vitamins, minerals, and fibre, as well as a dietary source of antioxidants, although some valuable genes (e.g. a subset of genes associated with biotic and abiotic stresses) were likely lost as an unwanted effect of domestication and improvements. [2][3][4][5] Due to its global nutritional and economical importance, whole-genome sequence (WGS) of a domesticated tomato accession 'Heinz 1706 (LA4345)' was determined in 2012, and it has been updated to date, serving as a valuable resource in both basic and applied research. 6,7 On the contrary, ancestral species close to cultivated accessions carry genetic potentials that exert favourable traits. For example, some accessions of S. pimpinellifolium (SP), known to be ancestral tomatoes, exhibit higher tolerance to salt stress and tomato yellow leaf curl virus. [8][9][10] Furthermore, S. lycopersicum var. cerasiforme (SLC), which originated from SP, is also expected to be an ancestor of domesticated tomato and possesses rich genetic loci contributing to increased fruit size. 11,12 Additionally, genome-wide association studies of 163 accessions within these Solanum species have revealed a large extent of morphological variations in their flowers and fruit and high intraspecific genome diversity and interspecific diversity. 1 Recently, tomato pan-genome analyses have revealed several presence/absence variations within both inter-and intra-genic regions, and they are likely associated with fruit and flower morphological variations and contribute to improve traits. 13,14 Thus, increasing available genome information of these ancestral species will facilitate the identification of unique or novel genes that have been lost during the breeding processes.
To obtain WGS information, 'LA1589' was used to obtain the first draft genome of SP, 6 and genome assembly and annotation of 'LA0480' were achieved using 101 bp paired-end (PE) short-read libraries and 5 mate-pair libraries. However, both data sets are highly fragmented and incomplete. 15 Recently, chromosome-scale genome sequence of 'LA2093' was determined using a long-read PacBio sequencer and was annotated to contain 35,761 protein-coding genes, of which 35,535 were validated by the published transcriptome data and/or homologs in the NCBI non-redundant protein database. 16 Furthermore, draft genome sequences of several SP and SLC accessions have been determined using a long-read Nanopore sequencer, and several structural variant hotspots have been identified. 14 However, chromosome-level genome assemblies, particularly those generated using long-read sequencers, are still limited and should be further improved. In this study, we newly constructed genome assemblies of two ancestral accessions, SP accession 'LA1670' and SLC accession 'LA1673', which originated in Tana and Lima, Peru, respectively. We first assembled long-read sequences generated using the PacBio Sequel system, and they were polished with Illumina short reads to generate chromosome-level scaffolds. Over 70,000 proteincoding sequences were annotated in both accessions and transcriptome datasets supported the mRNA expression for approximately 39% of the annotated genes. The newly constructed genome assemblies of tomato ancestors will be valuable to explore interspecific and intraspecific variations as well as to identify genes that could improve traits in breeding.

Processing data for the WGS of tomato
DNA from young leaves of 'LA1670' and 'LA1673' (Fig. 1 and Supplementary Fig. S1) was extracted from a single plant of each line using the Maxwell 16 Cell DNA Purification Kit (Promega, Madison, WI, USA) and used for Illumina sequencing. Sequencing data were obtained using Illumina HiSeq 2000 (Illumina, San Diego, CA, USA) with a 150 bp PE mode. Sequenced bases with a quality score of <30 and adaptor sequences were removed using TrimGalore (version 0.6.4) with option -length 90. The quality of the trimmed sequences was checked using FastQC. SMRT sequence libraries were constructed using the SMRTbell Express Template Prep Kit (PacBio, Menlo Park, CA, USA) and used to sequence on a PacBio Sequel system (PacBio, Menlo Park, CA, USA). The longread sequences were assembled, and the two haplotype sequences of the diploid genome were resolved using Falcon-unzip. Errors in the resultant assembled data were corrected twice by Arrow and polished with 47Â Illumina PE data using Pilon, which includes correcting bases, fixing misassemblies, and filling gaps. Chromosome-scale pseudomolecules were produced using reference-guided scaffolding software RaGOO. The pseudomolecule sequences of SP 'LA1670' (SPI_r1.0 pmol) and SLC 'LA1673' (SLYcer_r1.0 pmol) were named from SPI1.1ch00 to SPI1.1ch12 and from SLYcer1.1ch00 to SLYcer1.1ch12, respectively, according to the mapped chromosomes from SL4.0ch00 to SL4.0ch12 in the SLL 'Heinz' genome assemblies (SL4.0). The remaining unassigned sequences of SP and SLC were grouped into SPIUN and SLYcerUN, respectively (UN, unknown). The genome structure of 'LA1670' (SPI_r1.0 pmol) and 'LA1673' (SLYcer_r1.0 pmol) were compared against 'LA4345' (SL4.0, Heinz 1706) using D-GENIES with Minimap2 aligner, and resultant genome assemble of SP and SLC were named as SPI_r1.1 pmol and SLYcer_r1.1 pmol.
To estimate the genome size, k-mer counting analysis was performed using jellyfish software as described previously. 17 The software and related references used in this study are shown in Supplementary Table S1.

Assessing assembly completeness
We first assessed the completeness of 'LA1670' and 'LA1673' genome assemblies using the Benchmarking Universal Single-Copy Orthologs (BUSCO) database (v.4.1.1) in the genome mode to search for genes conserved in Embryophyta species. The embryophy-ta_odb10 data set, created on 20 November 2019, comprises 57 species and 425 genes. The genome completeness of SP and SLC was further analysed using long terminal region (LTR) assessing index (LAI), which is used to evaluate genome quality based on intergenic genome information using LTR retrotransposon (LTR-RT). 18 LTR-RT candidates were obtained using LTRharvest and LTR_FINDER_parallel and their accurate LTR-RTs were determined using LTR_retriever as described previously. 18 The accurate LTR-RTs were then used to evaluate assembly continuity of the input genome using the LAI with default parameters.

Gene annotation validation with RNA-Seq
Putative protein-coding genes in the genome sequences were predicted using MAKER pipeline, for which we used the RNA-Seq data and peptide sequences predicted from genomes of a tomato (SLL, ITAG4.0), potato (SolTub_3.0), and pepper (ASM51225v2). Two training sets from AUGUSTUS and SNAP were also used in the prediction. Sequences with an annotation edit distance score of >0.5 were selected as high-confidence genes.
The total RNA was extracted from 17 organ samples consisting of seven anther samples of different developmental stages (bud size: 3-4, 4-5, 5-6, 6-7, 7-8, and 8-9 mm, and anthesis), anthesis pollen, calyx, green fruit, red fruit, leaf, ovary, white petal of a flower 1-day before anthesis, petal of an anthesis flower, root, and stem from hydroponic-cultivated plants of 'LA1670' and 'LA1673'. The plants were grown in a greenhouse at the University of Tsukuba, Japan. All organs were sampled in the morning between December 2018 and January 2019, and RNA was extracted using RNeasy Mini Kit, and then purified using RNase-Free DNase Set (QIAGEN, USA). Genome-wide RNA expression levels were obtained using Novaseq-PE150 (Illumina) at a depth of 78 Gbp reads. The RNA-Seq data were first trimmed; the bases with a quality score of <30 and adapters were removed using TrimGalore with the option -length 90. We filtered ribosomal RNA (rRNA) reads by applying SortMeRNA. For Iso-Seq analysis, the total RNA was purified as described above and used for cDNA synthesis and amplification using the NEBNext Single Cell/Low Input cDNA Synthesis & Amplification Module and Iso-Seq Express Oligo Kit. A sequencing library was constructed using the SMRTbell Express Template Prep Kit 2.0 and was run on a PacBio Sequel system using the Sequel Binding Kit 3.0/Sequel Sequencing Kit 3.0 according to the manufacturer's protocol.

Estimating gene and isoform expression levels from RNA-Seq data
The cleaned reads, devoid of rRNA, were mapped against the assembled genomes and the transcription levels of each gene were calculated using STAR-RSEM pipeline. Genes with transcripts per million (TPM) values of >0 were regarded as expressed. After filtering genes whose expression levels were zero in all tissue samples, the expressed genes were counted using an in-house python script. The transcripts whose isoforms were identified by full-length Iso-Seq sequencing were also regarded as expressed genes.

Results and discussion
3.1 Overview of the two unimproved tomato assembly process  Tables S2 and S3). The obtained PacBio data of SP and SLC were then assembled into 244 and 201 primary contigs. To correct sequencing errors caused by PacBio sequencers, these contigs were polished using the 47Â coverage of cleaned Illumina PE reads with Pilon. The resultant error-corrected contig sequences were 808.8 Mbp with contig N50 of 10.12 Mbp for SP and 804.5 Mbp with N50 of 9.48 Mbp for SLC; these contig sets were named as SPI_r1.1, SLC_r1.1, respectively (Table 1). These assembly sizes were almost equivalent to 786.2 and 770.4 Mbp of genome sizes estimated using the k-mer counting analyses (k-mer ¼ 21; Supplementary Fig. S2) and were comparable size of the SLL reference 'LA4345' (SL4.0; 782,520,133).

Comparison of genome assemblies
Reference-guided scaffolding of the error-corrected contigs was performed using RaGOO combined with the SLL (SL4.0), which included the tomato 12 chromosomes (ch01-ch12) and unanchored genome sequences designated ch00 that are not mapped to any of the 12 chromosomes. Polished contigs were ordered and orientated using the SL4.0 reference. Then, 242 and 200 contigs of SP and SLC, respectively, were mapped to the 12 tomato chromosomes (SL4.0ch01-ch12; Table 2). Furthermore, two contigs consisting of 1,449,357 bp from SP and 61,018 bp from SLC were non-localized to the corresponding reference genome and assigned as 'unknown' (SPIUN and SLYcerUN). These unassigned 'unknown' contigs were considered as genomic segments specifically present in SP and SLC. Finally, RaGOO analysis resulted in the total scaffold lengths of 808.8 and 804.5 Mbp, and average scaffold length of 57.8 and 57.5 Mbp, respectively, for SP and SLC (Table 3). These chromosomelevel assemblies of SP and SLY were named SPI_r1.1 pmol and SLYcer_r1.1 pmol, respectively ( Table 3). The dot plot analysis using D-GENIES between SLL (SL4.0) and both SP and SLC revealed significant consistency for genome organization, while identifying potential structural variants (i.e. indels and inversions) between these species (Fig. 3).

Estimation of assembly completeness
The completeness of the genome assembly for all scaffolds and nonlocalized contigs were quantified using the BUSCO database. We used the embryophyta_odb10 data set containing 1,624 core genes. Scaffolds and non-localized contigs of SP contained a higher number of benchmarking genes (96.2%), and a few of those (1.4%) were duplicated as isoform genes according to our BUSCO analyses (Table 4). In contrast, 96.5% of the expected Embryophyta genes were detected as complete BUSCOs in the scaffolded genome sequence of SLC. The benchmarking gene preservation of SP and SLC was almost equivalent to that of SLL (SL4.0; 97.5%). 19 The completeness of these genome assemblies and that of SLL (SL4.0) were further validated using LAI, which is used to evaluate assembly continuity and quality of intergenic and repetitive sequences by counting the proportion of intact LTR-RTs in the assembled genome. A higher LAI score is associated with a higher assembly quality of repetitive and intergenic sequence spaces, because of the likely identification of a higher number of intact LTR-RTs. 18 The overall distribution of LAI scores was similar among three accessions ( Supplementary Fig. S3), but the whole genome LAI score of SLL SL4.0 was higher (LAI ¼ 10.37) than that of SLL SL2.4 (LAI ¼ 8.0), 19 which were sequenced by short reads. Interestingly, SP and SLC presented relatively higher LAI scores, 14.18 and 13.10, respectively. Wild tomato species most likely have a higher number of identifiable intact LTRs and/or longer internal regions than cultivated species, as the LAI score of other SP (LA2079; LAI ¼ 13.14) and S. pennellii (LAI ¼ 14.8) is higher than that of SLL. 16,18 According to these quality assessment results, the SP and SLC genomes were classified to be of reference quality based on the assembly of repetitive and intergenic sequence spaces (draft quality, LAI < 10; reference quality, 10 < LAI < 20; and gold quality LAI > 20). 19 We then analysed repetitive and transposal sequences in SP (SPI_r1.1 pmol) and SLC (SLYcer_r1.1 pmol). The repetitive  In the Chromosome column, each chromosome name (from Ch00 to Ch12) refers from SPI00 to SPI12 in S. pimpinellifolium 'LA1670' and from SLYcer00 to SLYcer12 in S. lycopersicum var. cerasiforme 'LA1673' which assigned to from Solyc00 to Solyc12 of the reference genome of S. lycopersicum var.     Table S4). The dominant types in the pseudomolecule sequences were LTR-RT (409.7 Mb for SP and 402.0 Mb for SLC), followed by DNA transposons. Specific repeat sequences that were currently unavailable in public databases were 54.3 Mb for SP and 55.3 Mb for SLC. These results indicated that both SP and SLC contain a similar proportion of repetitive and transposal sequences.
Next, putative protein-coding genes in the constructed SP (808.8 Mbp) and SLC (804.5 Mbp) genome assemblies were predicted using MAKER pipeline. As a result, 71,945 and 75,230 potential coding sequences were predicted, respectively ( Table 2). We then analysed the number of genes commonly or uniquely present in the reference SLL (ITAG4.0), SP, and SLC. Among 34,075 SLL genes, 32,275 (27,201 þ 5,074) and 27,760 (27,201 þ 559) homologous genes existed in SP and SLC, respectively (Fig. 4). Furthermore, 27,201 SLL genes (79.8%) were present in these three genotypes, with 40,245 and 30,487 genes in SP and SLC, respectively. The fact that 94.7% (32,275/34,075) and 81.5% (27,760/34,075) of SLL genes were present in SP and SLC, respectively, suggests that the genome of 'LA1670' is more closely related to 'LA4345' than to 'LA1673'.

Expression gene analyses
Together with the Illumina RNA-Seq and Iso-Seq reads, 29,629 and 29,185 non-redundant protein-coding sequences for SP and SLC were supported by transcriptome datasets, respectively (Table 5). We obtained RNA-Seq Illumina reads derived from 17 samples of each line to validate gene prediction. All raw reads were trimmed using TrimGalore, which generated 79.7 and 78.1 Gb of high-quality data with a high mapping rate (>96%) from 112.4 and 110.8 Gb of raw data in 'LA1670' and 'LA1673', respectively (Supplementary Tables  S5 and S6). After removing rRNA contamination using SortMeRNA, the cleaned reads were mapped to each of the corresponding genome assemblies of SP (SPI_r1.1 pmol) and SLC (SLYcer_r1.1 pmol) using the STAR-RSEM pipeline, resulting in a mapping rate of 97.0% for SP and 97.8% for SLC. We regarded genes with a TPM value of >0 as expressed genes and identified 29,343 (SP) and 29,075 (SLC) expressed genes in the 17 samples. The protein-coding sequences were further validated by Iso-Seq to generate full-length cDNA sequences or isoforms. The Iso-Seq produced 38,569 and 27,302 isoforms with an average length of 2,657 and 2,428 bp from SP pollen and SLC mature anther, respectively (Supplementary Table S7). The isoform average size was approximately 12 kb with N50 of 2.6-2.8 kb. These isoforms were also mapped on the constructed genome assemblies using Minimap2, resulting in a high mapping rate of 99.98% for SP and 99.88% for SLC. Among the 71,945 SP and 75,230 SLC protein-coding sequences predicted from genome information, 8,296 and 4,135 of them were supported by Iso-Seq reads, respectively, and they were also regarded as expressed genes.
In conclusion, in this study, we constructed novel genome assemblies for two Solanum accessions, SP and SLC, using the PacBio long-read sequencer, and predicted their protein-coding sequences, 39% of which were supported by transcriptome datasets ( Table 5). The genome size of SP and SLC was 808.8 and 804.5 Mbp, respectively (Table 3), which were almost equivalent to the genome sizes estimated using the k-mer counting analysis (Supplementary Fig. S2). Furthermore, using reference-guided scaffolding, we constructed 12 pseudomolecules at the chromosome level with the average scaffold  Total number of expressed genes for SP was calculated as non-redundant genes that were supported by 29,343 illumina RNA-Seq reads and 8,296 Iso-Seq reads. Total number of expressed genes for SLC was calculated as non-redundant genes that were supported by 29,075 illumina RNA-Seq reads and 4,135 Iso-Seq reads. a Gene, the number of genes annotated on each chromosome. b (%), the proportion of gene numbers to the total gene number. length of 57.8 and 57.5 Mbp (Fig. 3, Table 3), which were considerably higher than that of the reference generated by short-read sequences (scaffold average length ¼ $5 kbp). 6,15 Besides, both SP and SLC genome assemblies presented high BUSCO completeness values (Table 4, 96.2% and 96.5%, respectively), and the LAI values were equivalent to another high-quality genomes, indicating the effectiveness of the long-read sequencing-based de novo assemble strategy.
SP has substantial intraspecific genome diversity because of the rich geographic and climatic varieties in their originated and migrated areas (e.g. Ecuador and Peru), [20][21][22] whereas SLC shares SP genome structure as the most recent common ancestor with the cultivated SLL. 23 Thus, our newly constructed genome assemblies will facilitate dissection of the genetic and molecular aspects of tomato domestication and evolution. For instance, SP 'LA1670' is known to show heat susceptibility, 24 whereas SLC 'LA1673' shows salinity tolerance and accumulates higher levels of sugars, organic acids, and volatiles associated with tomato flavour than cultivated varieties. 25,26 Moreover, in this study, we identified 6,452 and 10,241 unique genes from SP and SLC, respectively, which are not present in SLL (SL4.0; Fig. 4). Genome and gene information obtained in this study will be beneficial for dissecting these traits for breeding; it can also be used to explore gene function of these ancestral-specific genes.