Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q

Abstract The sweetpotato whitefly Bemisia tabaci is a highly destructive agricultural and ornamental crop pest. It damages host plants through both phloem feeding and vectoring plant pathogens. Introductions of B. tabaci are difficult to quarantine and eradicate because of its high reproductive rates, broad host plant range, and insecticide resistance. A total of 791 Gb of raw DNA sequence from whole genome shotgun sequencing, and 13 BAC pooling libraries were generated by Illumina sequencing using different combinations of mate-pair and pair-end libraries. Assembly gave a final genome with a scaffold N50 of 437 kb, and a total length of 658 Mb. Annotation of repetitive elements and coding regions resulted in 265.0 Mb TEs (40.3%) and 20 786 protein-coding genes with putative gene family expansions, respectively. Phylogenetic analysis based on orthologs across 14 arthropod taxa suggested that MED/Q is clustered into a hemipteran clade containing A. pisum and is a sister lineage to a clade containing both R. prolixus and N. lugens. Genome completeness, as estimated using the CEGMA and Benchmarking Universal Single-Copy Orthologs pipelines, reached 96% and 79%. These MED/Q genomic resources lay a foundation for future ‘pan-genomic’ comparisons of invasive vs. noninvasive, invasive vs. invasive, and native vs. exotic Bemisia, which, in return, will open up new avenues of investigation into whitefly biology, evolution, and management.

Individual BAC pools were assembled independently using SOAPdenovo and the whole genome shotgun reads from PE and MP libraries were used to fill gaps in the BAC scaffolds. After sequencing, the raw reads were filtered as described above. In addition, reads representing contamination by E. coli or the plasmid vector were filtered. The pooled reads were separated according to the BAC-reads index, and each BAC was assembled using a combination of "hierarchical assembly" and "de Bruijn graph assembly". First, the reads linked to each BAC were assembled using SOAPdenovo [14], with various combinations of parameters with a K-mer range from 27 to 63 and a step size of 6. The assembly with the longest scaffold N50 was defined as the "best" for each BAC. The resulting BACs were mapped with the large shotgun MP read data to optimize the assembly for each BAC.
The final draft assembly was produced by integrating sequences that overlapped among the scaffolds independently assembled from genome shotgun and BAC reads, and in doing so eliminated the redundant scaffolds using the following steps. In order to integrate the two assemblies, the software Rabbit [15] was applied to identify any relationship between scaffolds, to connect the overlapping regions that shared at least 90% similarity, and to remove redundancy based on a 17-mer frequency. Finally, SSPACE [16] was used to construct super-scaffolds containing 800 bp-40 kb WGS reads, and the 170-800 bp genome shotgun read data were used to fill the gaps using GapCloser [14]. Post-assembly processing included removal of contaminating bacterial and viral DNA sequences, by aligning all assembled sequences to the genome sequences of viruses and bacteria, obtained from previous local BLASTn alignments and by NCBI upload filter. Aligned sequences that shared >90% identity and were >200 bp in size were filtered from the final assembly. The assembled sequences that were covered by at least one EST sequence were retained. Process read data was mapped the the draft MED/Q genome using SOAPaligner software and read counts were made from .bam files and the average depth was computed from all bases in the window. The relation graph of base pair percentages, and each given sequencing depth along the genome, was obtained.
Using genomic DNA from the MED/Q colony, a total of 20 whole genome sequence (WGS) shotgun sequencing libraries were generated (18 pooled male and female PE and MP libraries, and two haploid-male derived WGA PE libraries), from which sequences were generated on an Illumina Hiseq2500 platform. Library sequencing produced a total of 428.2 Gb or an approximate 594.7-fold genome coverage assuming a 0.72 Gbp genome size (based on 17-mer analysis). For the 10 short-insert PE libraries, there were a total of 229.4 gigabases (Gb) (100 bp or 150 bp read length, approximately 318.6-fold genome coverage). Sequencing the eight large-insert (>1 kb) MP libraries produced 80.3 Gb of reads (49 bp read length, 111.5-fold coverage) for use in scaffold construction (S1 Table). The two male WGA libraries produced a total of 118.5 gigabases (Gb) of data (S1 Table) or approximately 164.6fold genome coverage Sequencing of 13 BAC pools generated 362.6 Gbp of raw data (288.4 Gbp processed data; results not shown). The subsequent assembly of this sequence data using our pipeline (S1 Fig) generated a 658 Mbp draft genome assembly for MED/Q consistent with recent flow cytometry estimates [17]. The mean read depth across 10 kb windows indicated that all genome regions were highly represented within the read data, with < 1.5% having a depth of < 10X (remaining data not shown).

Annotation of repetitive elements
Repetitive elements were searched for and identified using Repbase [18] implemented in TRF software [19], and a de novo approach implemented in Piler [20]. For the Repbase-based method, two software programs named RepeatMasker [21] and RepeatProteinMask were used to identify repetitive sequences. In the de novo approach, Piler-DF-1.0 [20], RepeatScout-1.0. 5 [22], and LTR-FINDER-1.0. 5 [23] were used to build de novo repeat libraries from the genome sequences. Finally, the repeated sequences were searched for and classified using the RepeatMasker software. Homology-based annotation of MED/Q repetitive elements was queried against Repbase v.20.05 [18] with RepeatMasker [21]. We found a total of 265.0 Mb TEs, or 40.3% of the MED/Q genome size. This was about 10% higher than the repeat contents of Acyrthosiphon pisum and Rhodnius prolixus, but similar with than that of Nilaparvata lugens (39.8%) (S2 Table). This suggests that long terminal repeat (LTR) (18.5%) are more abundant and contain more nucleotides than all other TE classes. This proliferation of LTR retrotransposons has only been found in one other Hemipteran genome, that of N. lugens (12.29%). The MED/Q genome also contains the high proportion of the DNA-transposon TEs (12.92%) found in other fully-described Hemipteran genomes. As with both N. lugens (0.5%) and R. prolixus (0.01%), the MED/Q genome also appears devoid of short interspersed nuclear elements (SINEs; 0.96%). These other Hemipteran genomes also contain a small amount of long interspersed nuclear elements (LINEs; A. pisum: 2.6%; MED/Q: 3.18%; R. prolixus: 3.2%), but N. lugens (12.84%). This suggests that MED/Q-specific TEs, especially the LTRs, have evolved relatively recently and contribute to the large number of gene sets.

Annotation of coding regions
Initial evaluation of gene coverage rate in the draft MED/Q genome assembly was assessed by comparing against 248 core eukaryotic genes were obtained using CEGMA 2. 4 3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Preliminary evaluation of transcribed regions within the draft MED/Q genome assembly coverage found that ~95.2% of B. tabaci ESTs > 200 bp were present, with 90,652 ESTs showing ≥ 90% length coverage on one scaffold (S7 Table). This alignment encompassed 92.9% of nucleotides within the EST dataset. Analogously, 229 (96%) of the 248 sequences in the CEGMA gene set and 79% complete and fragmented BUSCOs were present in the MED/Q genome assembly (remaining data not shown). The final GLEAN gene models predicted a reference gene set of 20,786 protein-coding genes, a consensus result derived from de novo, orthology, and evidence (RNA-seq)-based prediction methods (S3 Table) and integrated into GLEAN gene models (S4 Table). Among the GLEAN gene models, 16,622 (79.97%) received functional gene annotations using the various databases queried in our analysis pipeline (S5 Table). Tetranychus urticae (C. L. Koch, 1836) (O. Arachnida, Tetranychidae), were used to predict orthologs and to reconstruct the phylogenetic tree. Gene families were identified using TreeFam [38,39], and single-copy gene families were assembled to reconstruct phylogenetic relationships. i) Coding sequences of each single-copy family were concatenated to form one super gene group for each species. ii) All of the nucleotides at codon position 2 of these concatenated genes were extracted to construct the phylogenetic tree by PhyML [40], with a gamma distribution across sites and an HKY85 substitution model. iii) The same set of sequences at codon position 2 was used to estimate divergence times among lineages. iv) The fossil calibrations were set with two previous node data [41,42]. v) The PAML mcmctree program (v.4.5) [43,44] was used to compute split times using the approximate likelihood calculation algorithm. The software Tracer (v.1.5.0) (http://beast.bio.ed.ac.uk/software/tracer/) was utilized to examine the extent of convergence for two independent runs.

Prediction of gene orthology
Phylogenetic analysis based on orthologs across 14 arthropod taxa (S6 Table) suggested that MED/Q is clustered into a hemipteran clade containing A. pisum, and is a sister lineage to a clade containing both R. prolixus and N. lugens (Fig 1A). The range of species-specific genes within the four hemipteran genomes ranged from 38-60%, with higher values for the three phloem-feeding specialists. This led us to investigate interspecific changes in the number and diversity of gene family members (orthologs and paralogs) within this group of Hemiptera (Fig 1C; Fig S2).
c Only contigs and scaffolds>=500 bp were included in the genome assembly. Table 1 Click here to download Table Table 1.docx