WaspBase: a genomic resource for the interactions among parasitic wasps, insect hosts and plants

Abstract Insect pests reduce yield and cause economic losses, which are major problems in agriculture. Parasitic wasps are the natural enemies of many agricultural pests and thus have been widely used as biological control agents. Plants, phytophagous insects and parasitic wasps form a tritrophic food chain. Understanding the interactions in this tritrophic system should be helpful for developing parasitic wasps for pest control and deciphering the mechanisms of parasitism. However, the genomic resources for this tritrophic system are not well organized. Here, we describe the WaspBase, a new database that contains 573 transcriptomes of 35 parasitic wasps and the genomes of 12 parasitic wasps, 5 insect hosts and 8 plants. In addition, we identified long non-coding RNA, untranslated regions and 25 widely studied gene families from the genome and transcriptome data of these species. WaspBase provides conventional web services such as Basic Local Alignment Search Tool, search and download, together with several widely used tools such as profile hidden Markov model, Multiple Alignment using Fast Fourier Transform, automated alignment trimming and JBrowse. We also present a collection of active researchers in the field of parasitic wasps, which should be useful for constructing scientific networks in this field.


Introduction
Insects are the most widely distributed animal species on earth. Most insects are herbivores that cause huge yield losses when feeding on crops. Insects such as houseflies and mosquitos are vectors of pathogens that cause disease in humans and domesticated animals (1). To combat these insect pests, many methods have been developed, and some of which are used in agriculture. Insecticides are one of the main methods of pest control in agriculture. Unfortunately, overuse of insecticides causes serious environment pollution and food safety problems (2). Therefore, alternative, environment-friendly pest control methods should be developed.
Biological control is an environment-friendly pest control method. Parasitic wasps are well-known biological control agents (3,4) as they are effective natural enemies of many economically important insect pests. Parasitic wasps are a group of hymenopteran insects that lay eggs in or on the bodies of hosts (5). The wasp larvae feed on the host until pupation and eventually kill the host (6). However, pest control using parasitic wasps has some apparent disadvantages such as wasp development lagging behind pest outbreaks and low-control efficiencies. Understanding the antagonistic interactions between parasitic wasps and their hosts is an important task to improve control efficiencies (7). At present, the genomes of 34 parasitic wasps have been deposited in public databases such as National Center for Biotechnology Information (NCBI). In addition, the genomes of six hosts of these wasps and eight plants that are damaged by these insect hosts are available. Among these species, five parasitic wasps (4,(8)(9)(10)(11), six insect hosts (12)(13)(14)(15)(16)(17)(18) and six plants (19)(20)(21)(22)(23)(24) were publicly reported.
Though these data can be retrieved from NCBI, they are not well organized and thus have not been fully explored. Here, we collected the genome and transcriptome data of 34 parasitic wasps, 9 insect hosts and 8 plants from NCBI, i5k workspace@NAL (25) and InsectBase (7). Then, we constructed a database, which we named WaspBase, to serve as an integrated genomic resource for a tritrophic system of wasps, hosts and plants.

OGS
The General Feature Format version 3 (Gff3) files containing annotation information were downloaded with the genome data, and the official gene sets (OGSs) were extracted from the genome based on the annotation in the Gff3 file. Then, the nucleotide sequences and protein sequences of annotated genes were produced (Table 2).

Transcriptomes
The raw data of 34 samples of parasitic wasps were downloaded from the NCBI SRA (Sequence Read Archive) database (https://www.ncbi.nlm.nih.gov/sra). We assembled 22 transcriptomes using Trinity and TopHat-Cufflinks with default parameters (26,27). Together with 21 other available transcriptomes, we collected a final transcriptome dataset of 573 RNA-Seq samples from 35 parasitic wasps (Table 3).
lncRNA Long non-coding RNAs (lncRNAs) are transcribed RNA molecules >200 nucleotides in length that are not protein coding (28,29). We predicted lncRNAs of eight parasitic wasps using a previously reported pipeline (30). In total, we predicted 49 607 lncRNAs from eight parasitic wasps.

UTR
We developed a pipeline to predict untranslated regions (UTR) from the transcriptomes and genomes using

Gene families
We used manual annotation by Blastp against known genes (e-value = 10 −5 ), GO annotation and phylogenetic analysis to identify the members of a gene family. We obtained the information of 25 gene families that have been widely studied, including those related to chemoreception, the immune system and detoxification ( Figure 3). We also provided a web server for phylogenetic analysis of selected gene members, and we use ClustalW2 (31) to construct a phylogenetic tree by the neighbor-joining clustering method. The bootstrap value was set as 500. The Newick Utilities V1.6 (32) was used to display the phylogenetic tree.

Database system implementation
WaspBase was developed on an Apache HTTP (Apache 2.4.25) server in a Linux (RedHat 4.8.2) operating system.
The web pages were written using PHP (PHP 5.6.30), html language, Cascading Style Sheets and JavaScript. All data are stored in the MySQL (MySQL 5.7.17) environment. The Apache server handles queries from web clients through PHP scripts to perform searches.

Search function
WaspBase provides search function using keywords, gene ID, gene names, annotation keywords, KEGG ID, KEGG annotation (33), PFam ID or Pfam annotation (34). Once a gene is searched for, all related gene information was presented in the result webpages. The genes from parasitic wasps, insect hosts and plants were given in the searched results.

Tools module
The tools module contains Basic Local Alignment Search Tool (BLAST) (35), profile hidden Markov model (HMMER), Multiple Alignment using Fast Fourier Transform (MAFFT), automated alignment trimming (TrimAl) and JBrowse (36).  BLAST (35) is provided using the Web-based BLAST server 2.6.0+. The data used for nucleotide BLAST (BLASTN, TBLASTN) searches include 12 insect genomes and 9 insect OGSs. The protein data used for amino acid BLAST (BLASTP, TBLASTX, BLASTX) searches contain nine insect protein sequences. In the BLAST results webpage, users can choose to display top 5 hits, top 10 hits or all hits. The top five BLAST hits are used as default.  User can also adjust other parameters such as similarity percentage and BLAST score. Links of the BLAST hits were given to directly connect to NCBI for full annotation information. All sequence can be downloaded. Multiple sequence alignment (MSA) is important for evolutionary analyses. MAFFT (37) is a widely used program for MSA analysis because of its high performance. WaspBase provides a web server of MAFFT and uses TrimAl to trim the aligned sequences (38). To use MAFFT web server, users need to input the sequences in FASTA format with either the default parameters or the customized parameters. To use TrimAl, users need to input the aligned sequences at the TrimAl webpage. The trimmed sequences are showed at TrimAl result webpage. If the number of sequence is more than four, a phylogenetic tree can be constructed using the abovementioned method.
A web server of HMMER is provided to search sequence homologs and to make sequence alignments. It uses probabilistic models called profile hidden Markov models (profile HMMs) (39). To use HMMER, users input the protein sequences at the HMMER webpage. After running the HMMER, the protein sequences are used to search against the Pfam database and the results of protein domain information will be showed at the HMMER result webpage.

Genome visualization
JBrowse is a well-known browser that displays genome annotations by integrating the databases and interactive web pages (36). We used JBrowse in WaspBase to provide interactive views of annotations along with the genome scaffolds. The genome data and the Gff3 files required for JBrowse are stored in a MySQL database using preparerefseqs.pl, flatfile-to-json.pl, add-bam-track.pl and addtrack-json.pl provided by BioPerl. In WaspBase, JBrowse visualizes the annotations and transcriptomes as tracks on the browser for Coding Sequence and coverage of the transcriptome reads. Pop-up balloons in the gene model track display links to gene sequences of interest.

Wasp researchers
To construct a scientific network in the field of parasitic wasp research, we performed reference mining of parasitic wasp studies, which yielded 189 references. Based on publications in the last 5 years, we collected a list of active researchers studying parasitic wasps.

Download
All data can be downloaded, including genomes, transcriptomes, UTR, Gene families and lncRNA. For the convenience of downloading, the gene data of parasitic wasps, insect pests and plants are provided for download at the same webpage (Figure 4).

Conclusions
We constructed WaspBase for parasitic wasps and their corresponding insect hosts and plants. WaspBase provides conventional functions of search, download, domain analysis and phylogenetic analysis, JBrowse display of annotations and other functions described herein. In addition to genomes and transcriptomes, WaspBase also provides lncRNA, UTR and gene family information. A typical feature of WaspBase is that we integrated the gene information of parasitic wasps, their insect hosts and plants targeted by insect pests. Thus, gene data of the tritrophic system in food chains (parasitic wasp-insect pest-plant) were analyzed together, which should be useful for studying cross-species regulation in parasitism and convergent evolution analysis among wasps, hosts and plants.

Future plan
1. As the cost of sequencing has been significantly reduced in recent years, the genomes of an increasing number of parasitic wasps will be sequenced. We plan to update WaspBase periodically to keep the database up-to-date. 2. Genome annotation is still a time-consuming task and significantly lags behind genome sequencing. We noticed that a number of parasitic wasp genomes are not annotated at present though their genome sequences have been uploaded in the NCBI genome database. We will annotate these genomes using OMIGA (Optimized Maker-Based Insect Genome Annotation) (40), a genome annotation pipeline that we developed. 3. It is important to understand cross-species regulation mechanisms and convergent evolution in parasitism. To this end, we will carry out a systematic analysis of more gene families from the OGSs of 'wasps-insects-plants', which should be useful to improve control efficiencies in biological control.