Genome sequence of a rice pest, the white-backed planthopper (Sogatella furcifera)

Abstract Background: Sogatella furcifera is an important phloem sap-sucking and plant virus-transmitting migratory insect of rice. Because of its high reproductive potential, dispersal capability and transmission of plant viral diseases, S. furcifera causes considerable damage to rice grain production and has great economical and agricultural impacts. Comprehensive studies into ecological aspects and virus–host interactions of S. furcifera have been limited because of the lack of a well-assembled genome sequence. Findings: A total of 241.3 Gb of raw reads from the whole genome of S. furcifera were generated by Illumina sequencing using different combinations of mate-pair and paired-end libraries from 17 insert libraries ranging between 180 bp and 40 kbp. The final genome assembly (0.72 Gb), with average N50 contig size of 70.7 kb and scaffold N50 of 1.18 Mb, covers 98.6 % of the estimated genome size of S. furcifera. Genome annotation, assisted by eight different developmental stages (embryos, 1st-5th instar nymphs, 5-day-old adults and 10-day-old adults), generated 21 254 protein-coding genes, which captured 99.59 % (247/248) of core CEGMA genes and 91.7 % (2453/2675) of BUSCO genes. Conclusions: We report the first assembled and annotated whole genome sequence and transcriptome of S. furcifera. The assembled draft genome of S. furcifera will be a valuable resource for ecological and virus–host interaction studies of this pest.


Data description
The white-backed planthopper, Sogatella furcifera (Horvath), an rstrategy Hemiptera insect species, primarily feeds on rice plants and can migrate over long distances in the temperate and tropical regions of Asia and Australia [1]. The sucking of plant sap by S. furcifera reduces plant vigor, delays tillering and causes stunting, chlorosis, shriveling grains, and hopper burn, ultimately leading to rice plant death; it is responsible for the destruction of approximately 10 million hectares of rice crops annually [2]. More importantly, the planthopper transmits devastating rice viruses, including the Southern rice black-streaked dwarf virus (SRBSDV), which poses an additional threat to rice plants [3]. Insecticide misuse, cultivation of hybrid high-nutritional rice varieties, cultural and climatic factors, long-distance migration capability and robust fecundity of S. furcifera, together result in outbreaks of S. furcifera [2,4].

Samples and sequencing
Inbred laboratory strains of Sogatella furcifera (Fig. 1) originated from the University of Science and Technology of China. A laboratory colony was maintained at 26 • C with 70 % humidity under a 16:8 h light/dark photoperiod for two years, spanning at least 20 generations. An inbreeding line was obtained by single pair sib-mating for 18 generations. For genome sequencing, S. furcifera specimens from the F6 generation were used. Seventeen DNA libraries with insert sizes ranging between 180 bp and 40 kb were constructed to perform whole genome shotgun (WGS) sequencing following the standard protocol (Illumina, San Diego, CA, USA). This generated 241.3 Gb raw sequence reads with coverage of approximately 330× (Table 1).
For whole transcriptome sequencing, total RNA was prepared from S. furcifella specimens at different developmental stages. Briefly, eight different developmental stages of S. furcifera were washed three times with 95 % ethanol to reduce microbial contamination from the body surface. After ethanol volatilization, each sample was quickly ground into a fine powder in liquid  (Table S1). For RNA-seq libraries with an insert size of 300 bp, low quality bases (Phred score< 30) were trimmed and duplicated reads were removed.

Evaluation of genome size
Two different methods were used to estimate the genome size of S. furcifera. Firstly, flow cytometry was used according to a previous method [5] to estimate a genome size of about 730 Mb. Briefly, the heads of five male or female S. furcifera and five male Drosophila melanogaster (W118) specimens were ground in a tube with pestle in 200 μL labeling solution, because the genome size of Drosophila melanogaster has been clearly determined [6]. The mixture was filtered with a 70 μm filter(BD Falcon),then treated with RNaseA for 10 min at 37 • C and stained with 5 μg/mL propidium iodide (PI) for 2 h on ice. About 10 000 cell particles were analyzed on a flow cytometer (BD Falcon). The fluorescence value of PI was analyzed using Flowjo [7]. The mean C value (0.184 pg) and genome size (180 Mb) was calculated based on the internal D. melanogaster control. As a result, the genome size of a male S. furcifera is about 733 Mb or 0.75 pg (Fig. S1A). Experiments were conducted in triplicate. In addition, genome size was estimated based on the k-mer approach (k-mers, with k = 17) using Jellyfish [8]. In this study, approximately 56 Gb of the short-insert library sequencing data was used to generate a 17-mer depth-frequency curve (Fig. S1B). The S. furcifera genome size was estimated to be 735 Mb (Table S2) with 0.38 % heterozygosity as calculated by mlRho [3].

Genome assembly and evaluation
Adapter sequences, low-quality and duplicated reads were filtered out prior to read assembly; error correction was also performed to eliminate sequencing errors. For whole-genome assembly, short reads were first assembled using SOAPdenovo2 [9], longer reads were further used for scaffolding using SSPACE [10], and finally GapCloser [11] was used to fill the gaps with short reads. Briefly, sequences derived from the short-insert libraries were decomposed into k-mers to construct the de Bruijn graph, which was simplified to allow remaining k-mers to be joined as contigs. All short-insert and large-insert libraries were mapped onto contigs for scaffold building by utilizing pairedend and mate-pair information. Paired-end and mate-pair information was subsequently applied to link contigs into scaffolds using a step-wise approach (from small to large-insert libraries). Finally, intra-scaffold gaps were filled using short-insert libraries in which one read uniquely mapped to a contig, and the other member of the pair located to a gap region. The resulting genome assembly size was 720 Mb, representing 98.6 % of the estimated genome,with a final scaffold N50 length of 1 185 287 bp and a contig N50 length of 70 730 bp. The longest contig and scaffold were 799 kb and 12.7 Mb, respectively ( Table 2). In addition, reads which were mapped onto the mitochondria and Wolbachia symbiont genome sequence were extracted and assembled, and the assembled genome sizes were 16 Kb and 1.6 Mb, respectively. Four independent measures were used to assess the accuracy of the assembly. First, reads from paired-end libraries were mapped onto the assembly, and a greater than 10-fold effective depth was obtained across 95.55 % of the draft genome (Table S3). Second, when assembled transcripts derived from mRNA, expressed sequence tags (ESTs), and RNA-seq data from multiple developmental stages (Table S4) were mapped to the genome assembly, this revealed a transcript coverage rate of ∼98 %, suggesting that the genome was sufficiently complete for gene prediction and analysis (Tables S5-S7). Third, a core eukaryotic genes (CEG) mapping approach (CEGMA) dataset comprising 248 CEGs [12] was used to evaluate the completeness of the draft: 94.8 % (235/248) of genes were completely covered by the assembly, and 99.6 % (247/248)of genes were at least partially covered by the assembly genome (Table S8). The 5 % (13/248) incomplete genes, including three transcription factors, one ubiquitin-conjugating enzyme E2, and one mannose-6-phosphate isomerase, exhibited no obvious functional bias. Fourth, a benchmarking universal single-copy orthologs (BUSCO) dataset was used to evaluate the completeness of the draft: 91.7 % (2453/2675) of genes were covered by the assembly genome (Table S9), which was higher than 81.34% of Nilaparvata lugens and 81.71% of Acyrthosiphon pisum, respectively, because S. furcifera, N. lugens and A. pisum all belong to the Hemiptera order and were used for comparison.

Genome characterization and repeat annotation
The S. furcifera genome has a G+C content of 31.6 %; compared to other hemipteran species, this is slightly higher than that of the pea aphid A. pisum (29.6 %) [13] but lower than that of the brown planthopper N. lugens (34.6 %) [14]. Homology-based and de novo prediction analyses were both used to identify the transposable elements (TEs) and other repetitive content in the S. furcifera genome. For homology-based analysis, Repbase (version 20120418) [15] was used to perform a TE search with Repeat-Masker (3.3.0) [16] and the WuBlast [17] search engine. For de novo prediction analysis, RepeatModeler [18] was used to construct a TE library. Elements within the library were then classified using a homology search with Repbase and a Support Vector Machine (SVM) method (TEClass) [19]. A total of 44.3 % of the S. furcifera genome consists of tandem repeats and TEs (Table 3); lower than the brown planthopper (48.6 %) but much higher than the pea aphid (33.3 %). Class I TEs (retroelements) represent 15.3 % of the total genome assembly (9.52 % long interspersed nuclear elements (LINEs), 4.30 % long terminal repeats (LTRs) and 1.48 % short interspersed nuclear elements (SINEs)), whereas class II TEs (DNA transposons) account for 17.33 %. In addition, the S. furcifera genome was annotated with seven centromeric sequences [20], eight telomeric sequences [20] (Table S10) and 169 932 microsatellite regions respectively.

Annotation of coding and non-coding genes
The S. furcifera genome was annotated by combining homologybased methods referring to protein sequences from seven representative insects, ab initio gene prediction (GENSCAN [21] and AUGUSTUS [22]), and RNA-seq data from eight different developmental stages (embryos, 1 st -5 th instar nymphs, 5-day-old adults and 10-day-old adults) (Table S1). For homology-based gene prediction, we aligned Bombyx mori, D. melanogaster, Apis mellifera, A. pisum, Rhodnius prolixus, Tribolium castaneum and Pediculus humanus proteins (from the Ensembl database [23]) to the S. furcifera genome using TblastN [24] with an E-value ≤ 1E −5 , and then used GeneWise2.2.0 [25] for spliced alignment and prediction of gene structures. Secondly, for ab initio prediction, GENSCAN [21] and AUGUSTUS [22] were used to predict genes based on repeat-masked genome sequences. Short genes of <150 bp coding DNA sequences (CDS) were filtered from the resultant data sets. Thirdly, gene structure was identified using a transcriptome-based approach by mapping all RNA reads of the eight different developmental stages onto the S. furcifera genome using TopHat [26]. Mapping results were subsequently sorted and merged, and Cufflinks [27] was used to identify gene structures for gene annotation (Table 4). Finally, all predicted gene structures were integrated with EVidenceModeler (EVM) [28] to yield a consensus gene set containing 21 254 protein-encoding genes (Table 4), and an estimated 88 184 splice junctions [26]. Based on OrthoMCL analysis [29], the 21 254 protein-encoding genes can be assigned into 9096 single family genes, and 2963 families with 12 158 genes, indicating that S. furcifera has 57% duplicated genes. To designate gene names to each predicted protein-encoding locus, gene function information, protein motifs and domains were assigned through comparison with public databases, including Swiss-Prot [30], the Kyoto Encyclopedia of Genes and Genomes (KEGG) [31], Gene Ontology (GO) [32], TrEMBL [33] and InterPro [34]. A BLASTP [35] search of those proteomes was performed against the SwissProt [30] and TrEMBL database, with an E-value ≤ 1E −5 . KEGG annotation was based on the KEGG Automatic Annotation Server (KAAS) [36] and GO annotation analysis was based on InterPro [34]. The 21 254 protein-encoding genes had 12 699 hits in InterPro, of which 8633 also had GO associations. In total, 59.7 %, 40.61 %, 31.26 %, 52.23 % and 68.47 % of the protein-coding gene can be assigned to known homologs in InterPro [34], GO [32], KEGG [31], SwissProt [30] and TrEMBL databases [33], respectively (Table 5). In combination, 14 990 (70.52 %) were similar to proteins from known databases. Or-thoMCL analysis [29] revealed that 59.55 % and 63 % of S. furcifera proteins have homologs in the brown planthopper and pea aphid, respectively.
By performing a homology search across the whole genome sequence, four types of non-coding RNAs (ncRNAs) were annotated in our analysis: miRNA, tRNA, rRNA, and snRNA. tRNAscan-SE [37] was used to predict tRNA. snRNAs were predicted by alignment using BlastN and using INFERNAL (v.0.81) [38] to search against the Rfam database [39]. rRNAs were found by BlastN alignment against other insects' rRNA as reference sequences. In total, 172 rRNAs were identified and assigned into four rRNA families (Table S11): 2256 tRNAs (including all 20 tRNA genes used for decoding standard amino acids) (Tables S11 and S12), 176 snRNAs (Table S11) comprising 37 H/ACA box RNAs and 139 snRNAs involved in splicing, as well as 382 identified miR-NAs, were assigned into 110 families [40].

Transcriptome analyses of ontogenetic development of S. furcifera
Low-quality bases were trimmed from RNA-seq data using a PERL script and adapters were removed using Cutadapt (version 1.3) [41]. The remaining reads were mapped against the reference sequences of rRNA and Southern rice blackstreaked dwarf virus (SRBSDV) by Bowtie2 [42] using default parameters, and reads matching rRNA or SRBSDV were discarded. Filtered reads were mapped to the assembled S. furcifera genome with STAR [43] with the parameters: -runThreadN 8, -outFilterMultimapNmax 20, -outFilterMismatchNmax 4, -outFilterIntronMotifs: RemoveNoncanonical. Cuffdiff2 [44] was subsequently employed to calculate normalized fragments per kilobase of exon per million fragments mapped (FPKMs) based on the aligned BAM files of all RNA-seq libraries, in which the original FPKMs of each library were scaled by a library size factor computed as the median of ratios to the geometric means of original FPKMs across all libraries. Differentially expressed genes (DEGs, P-value<0.05 and -log 2 (fold change)->1) in at least one comparison between any two conditions were identified with Cuffdiff2 [44] using the blind mode suited for a single replicate in each condition. k-means clustering was then performed on the gene expression profiles of DEGs with k = 8 using Gene Cluster 3.0 [45], and differential expression patterns were visualized using Java TreeView [46].
To obtain a global overview of S. furcifera development, the transcriptomes of eight different developmental stages, including embryo (emb), 1 st -5 th instar nymphs (1in, 2in, 3in, 4in, and 5in), 5-day-old adult (5d) and 10-day-old adult (10d), were investigated using RNA-seq, with pairwise comparisons between each pair of developmental stages. A total of 4166 genes were identified as significant DEGs. Analysis of the number of genes reaching their maximum expression levels at each developmental stage showed that gene expression levels in the embryo and 10-day-old adult differed significantly from those in the other stages. GO enrichment analysis using the R package phyper [47] for all DEGs across all development stages demonstrated that these DEGs are implicated in diverse biological processes, including chitin metabolic processes, proteolysis, oxidationreduction processes, homophilic cell adhesion and microtubulebased processes (Fig. 2).

Expression profile clustering and expression pattern identification
k-means clustering on the 4166 DEGs revealed 8 different expression patterns (Fig. 3, Figs. S1-S8 and Table S13).
(i) A total of 864 genes were highly expressed in the embryo and at the 1 st instar nymph stage, with slightly higher expression levels in the embryo than in the 1 st instar nymph and down-regulation in later stages. GO enrichment analysis [47] associated genes in expression pattern 1 with the G protein-coupled receptor signaling pathway, ion transport, multicellular organismal development, and the Wnt receptor signaling pathway, among others. Wnt genes are important in embryogenesis and cell differentiation in both vertebrates and insects [48]. Wnt-4 (Sfur-175.24), Wnt-10a (Sfur-90.42) and Wnt-16 (Sfur-884.2) were clustered in pattern 1 and exhibited highest expressions during the embryo stage, suggesting that the three Wnt genes are involved in the embryo development of the pest. (ii) A total of 134 genes were classified into expression pattern 2. These genes were specifically expressed in nymph instar stages 1-5. GO enrichment analysis associated these genes with catecholamine biosynthetic processes and aromatic amino acid family metabolic processes. Catecholamine is required for insect cuticle sclerotization; catecholamine conjugates are sequestered in the hemolymph during nymphal feeding periods for later use as tanning-agent precursors in cockroaches [49]. Tyrosine 3-monooxygenase catalyzes the initial and rate-limiting step in catecholamine biosynthesis [49]. Two tyrosine 3monooxygenase (Sfur-187.36 and Sfur-7.15), belonging to expression pattern 2, were specifically expressed in the nymph stages, suggesting that these might be responsible for catecholamine biosynthesis and have roles in the later cuticle sclerotization process. (iii) A total of 525 genes exhibited specifically low expression in embryos but high expression in the post-embryonic stages. These genes have been implicated in processes such as metabolism, oxidation-reduction, and transmembrane transport. Several genes involved in fatty acid synthesis, including fatty acid synthase (Sfur-509.5, Sfur-698.4 and Sfur-72.518) and long-chain-fatty-acid-CoA ligase (Sfur-188.12 and Sfur-215.3), belonged to pattern 3 and were highly expressed in the nymph and adult stages. (iv) A total of 591 genes exhibited relatively higher expression levels from the 2 nd to the 5 th instar nymph, but were slightly down-regulated in the 4 th instar nymph. These genes are enriched in nucleosome assembly, reciprocal meiotic recombination, DNA catabolism, and chitin metabolic processes. (v) A total of 523 genes were specifically expressed in the 3 rd and 4 th instar nymphs, including genes participating in microtubule-based movement, glycolysis, glycerol metabolic processes, and positive regulation of apoptotic processes. Glycolysis can be suppressed during molting to direct a feeding, growing larva to convert to the immobile non-feeding stage [50]. Several genes participating in glycolysis, including fructose-bisphosphate aldolase (Sfur-330.7) and pyruvate kinase (Sfur-409.10 and Sfur-78.49), belonged to pattern 5; these were highly expressed in the 3 rd and 4 th instar nymphs and down-regulated in the 5 th instar nymph. (vi) A total of 218 genes were highly expressed from the 3 rd instar nymph to the 5-day-old adult, but the expression levels were higher in the 3 rd and 4 th instar nymphs than in the 5 th instar nymphs and 5-day-old adults. These genes are related to processes including protein metabolism, microtubule-based processes, and negative regulation of biosynthesis. (vii) A total of 609 genes were highly expressed from the embryo to the 5 th instar nymph, then down-regulated in adults. These genes were enriched in biological processes including chitin metabolism and catabolism, alcohol metabolism, steroid hormone mediated signaling pathways, and ecdysis, chitin-based cuticles. Many genes belonging to pattern 7 were involved in chitin metabolic processes and were highly expressed from embryo to the 5 th instar nymphs; for example,endochitinase (Sfur-47.4), chitinase (Sfur-18.277, Sfur-203.28 and Sfur-453.6), chitotriosidase (Sfur-97.70), and peritrophin (Sfur-105.51, Sfur-84.51 and Sfur-203.26). Chitin is the main component of insect exoskeleton and peritrophic matrix. Chitin metabolism is coupled with insect growth and development. Because the exoskeleton and peritrophic matrix are regularly replaced and renewed during insect growth, chitin metabolic process should be active throughout the embryo and nymph stages until growing and melting stages are completed. The crucial role of chitin in insect development and survival has led to chitin-related genes being targeted for the development of pest control strategies [51]. (viii) A total of 702 genes were specifically expressed in the 5day-old and 10-day-old adults. These genes are involved in processes such as lipid transport, DNA repair, cell-cell signaling, and microtubule-based movement. Vitellogenin, a member of the lipid transport protein family, is a precursor of egg yolk, which is specifically expressed in females and provides the essential nutrients required for egg development [52]. Vitellogenin (Sfur-20.301, Sfur-20.304, Sfur-3160.1, Sfur-496.9, Sfur-15.299 and Sfur-20.302) belonged to pattern 8 and was highly expressed in 5-day-old and