iPathCons and iPathDB: an improved insect pathway construction tool and the database

Insects are one of the most successful animal groups on earth. Some insects, such as the silkworm and honeybee, are beneficial to humans, whereas others are notorious pests of crops. At present, the genomes of 38 insects have been sequenced and made publically available. In addition, the transcriptomes of dozens of insects have been sequenced. As gene data rapidly accumulate, constructing the pathway of molecular interactions becomes increasingly important for entomological research. Here, we developed an improved tool, iPathCons, for knowledge-based construction of pathways from the transcriptomes or the official gene sets of genomes. Considering the high evolution diversity in insects, iPathCons uses a voting system for Kyoto Encyclopedia of Genes and Genomes Orthology assignment. Both stand-alone software and a web server of iPathCons are provided. Using iPathCons, we constructed the pathways of molecular interactions of 52 insects, including 37 genome-sequenced and 15 transcriptome-sequenced ones. These pathways are available in the iPathDB, which provides searches, web server, data downloads, etc. This database will be highly useful for the insect research community. Database URL: http://ento.njau.edu.cn/ipath/


Introduction
Insects are one of the most successful animal groups on earth. They comprise more than a million species, representing about half of all known living organisms. Some insects, such as silkworm (Bombyx mori) and honeybee (Apis mellifera) are beneficial to humans by producing valuable products and/or services (silk, honey, pollination).
In contrast, other species damage crops by feeding on leaves or fruits, causing huge economic losses.

Page 1 of 11
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
In addition, dozens of insects have been sequenced for their transcriptome (the SRA database, September 2014).
Constructing the pathway of molecular interaction from the insect genomes or transcriptomes is important for gene function analysis. Large-scale gene expression analysis is an efficient and widely used technique in molecular biology experiment. However, selecting the right candidate genes for experiment validation is still a challenge. One solution is to find differently expressed genes in a related pathway. To construct pathways, several knowledge-based methods were developed, such as PANTHER (24), Gene Ontology (25), Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) (26), Reactome (27) and PharmGKB (28). Although these methods can be applied to the insect gene data, insect pathway construction is still a difficult work. First, most insects have a heterozygous genome, reducing the quality of genome assembly and annotation and increasing the difficulty in pathway construction. Second, compared with mammals and other groups of animals, insects are species-rich and have high evolution diversity. Here, we developed an improved tool for insect pathway construction and built an insect pathway database, which should be helpful to the entomological community.

Data resources
Official gene sets of 37 insect species
The SRA database only provides the sequences of raw reads. For most insects, the assembled transcriptomes are not available. Therefore, we assembled these 15 transcriptomes de novo. The statistic of assembled transcriptome was presented in the Supplementary Table S3. First, the raw data were cleaned by removing adaptor sequences, empty reads, and low-quality reads that contain N or whose average nucleotides quality is less than 15. Second, we merged the raw reads from different samples of same species to obtain contigs as many as possible. Third, Trinity was used to assemble Illumina Solexa raw reads with default parameters (38). The insects included A. cerana, L. sericata, Ch. suppressalis, P. xylostella, N. lugens, Be. tabaci and C. quinquefasciatus. The Newbler was used to assemble Roche/454 raw data with default parameters, including Sp. exigua, R. pomonella, Ae. albopictus, De. ponderosae, O. fasciatus and M. sexta. For the EST (Expressed Sequence Tag) data of Ga. mellonella, Me. cinxia, and Z. filipendulae, we assembled the transcripts using the Cap3 software (39). Finally, the assembled transcripts were annotated using BLASTX (Basic Local Alignment Search Tool) against the NCBI nr database.

KEGG data
The KEGG database provides the most commonly used resources for pathway analysis (40). KEGG contains the genes of 21 insects (Table 1), which were downloaded from the NCBI RefSeq database (41). Among these 21 insects, 14 were from Diptera, including 11 Drosophila species and three mosquitoes. The KEGG Markup Language (KGML) is a format of the KEGG pathway maps, which can be used to draw KEGG pathways and to model gene and chemical networks. Given that KEGG requires a subscription to its FTP server, we downloaded the KGML files from their individual web download pages.

Data preparation
We downloaded the KEGG genes of 21 genome-sequenced insects, the OGS of 37 genome-sequenced insects and the transcriptome data of 15 species. For the 21 KEGG-annotated insects with OGS data, we compared the sequences in KEGG gene data and OGS data sets, (i) if there is a length difference, we used the long transcript; (ii) we kept those genes even they appear in only one gene data set.

KO assignment
Assigning KO terms is a crucial step in pathway construction. We used a voting system for KO assignment in the iPathCons. The protein data sets of 21 KEGG-annotated insects were divided by species. The protein sequences of each insect were formatted to build the local BLAST database, respectively, which were used as the template for KO assignment of other 16 genome-sequenced insects. For each insect need be annotated, its protein sequences were used to BLASTP against the template of every KEGG-annotated insect. The best BLASTP hit was used to assign KO terms (E-value 10 À5 ), which has been widely used in the KOBAS (42), KAAS (43) and Blast2GO (44,45). In this way, every protein was assigned KO terms for 21 times. The term that appears at the highest frequency (the minimum cutoff is ! 2) was used as the final KO assignment for the protein sequences.
A similar procedure was used to deduce pathway from the transcriptomes of 15 insects. All 37 genome-sequenced insects, including 21 KEGG-annotated and 16 iPathConsannotated ones, were used as the template for KO assignment. The protein sequences of each genome-sequenced insect were used as the local BLAST database, respectively.
The transcriptome sequences were used to BLASTP against the local BLAST database (E-value 10 À5 ). The KO term that appeared at the highest frequency (the minimum cutoff is ! 2) was used as the final KO annotation.

Validation of iPathCons
We used C. quinquefasciatus gene data to validate the iPathCons. The protein sequences of C. quinquefasciatus have been annotated by the KEGG database. We removed all protein sequences of C. quinquefasciatus from the KEGG template and used them to deduce pathways using the iPathCons. The results indicated that the precision reached 95% and the coverage was 94% (E-value 10 À5 ). We also used the transcriptome data of C. quinquefasciatus for pathway construction and obtained a similar result.
We compared the results of the iPathCons with that of other relate tool KAAS, which is a widely used pathway annotation tool provided by the KEGG database. The transcriptome of A. cerana cerana and Ga. mellonella were used to deduce pathways by both iPathCons and KAAS. The results indicated that similar number of KO terms and pathways were annotated in A. cerana cerana by two tools, whereas the iPathCons found 1675 KO terms and 255 pathways, much more than 1511 KO terms and 239 pathways annotated by the KAAS ( Table 2). The iPathCons annotated significantly more contigs than the KAAS, possibly because much more templates were used in the iPathCons. However, it should be noticed that both iPathCons and KAAS relied on homology analysis to deduce the pathway. So, the results should have some false positive and need to be confirmed by molecular experiments.

Availability of iPathCons software
Both stand-alone software and the web server were provided. The stand-alone, command-line program was written using Perl language. The program consists of three parts: the main program, the 'doc' folder containing the index of K number and KO terms, the 'db' folder containing the local BLAST database. iPathCons can complete the following tasks: (i) annotating an insect transcriptome or gene sets for pathway construction; (ii) generating KGML files that can be opened by VANTED (22) and KEGG-ED (23); and (iii) generating links for each pathway showing the KEGG pathways.

Database system implementation
We constructed an insect pathway database named as the iPathDB, which was developed on a Linux operating system (Redhat 5.6, Raleigh, NC, USA). The Apache HTTP server was used to handle queries from web clients through PHP scripts to perform searches. The web pages were written using html, PHP, CSS and JavaScript. The architecture of iPathDB is presented in Figure 2.

Search
Users can search insect pathways using keywords for species, pathway ID and pathway name. When using species name as the search keyword, all pathways for that species will be presented. When using pathway ID or pathway name as the search keyword, the pathway will be given for all species in the database. Search results provide gene sequences, annotations and a pathway map.

Online server
An online iPathCons server was provided. The KEGG-and iPathCons-annotated gene sets from different insect orders, including Diptera, Lepidoptera, Coleoptera, Hymenoptera, Phthiraptera and Hemiptera, are used as the template for constructing insect pathways. Users can select a template according to their requirements. When the queried sequences are less than 10, the results are displayed in the Webpage directly. If the queried sequences are more than 10, a URL link of the iPathCons results will be sent to the user via e-mail.

Disease-associated pathways
Insects have been studied to model human diseases. Insect disease models can provide an efficient way to study mechanisms and screen drugs. Interestingly, the results showed that 72% of human disease pathways could be found in insects ( Figure 3). In total, 17 human disease-associated pathways were found in insects, including bacterial, viral and parasitic infectious disease. In contrast, only two 'immune disease' pathways were found, suggesting that the immune systems of insects and humans are quite different. These results suggested that insects are good candidates for modeling human infectious diseases. A successful example is that of D. melanogaster, which has been used to model cholera (46).

Xenobiotic metabolism pathways
Most insects feed on plants. To protect themselves, plants produce many kinds of secondary metabolites. Insect herbivores have evolved many of xenobiotic degradation and metabolism pathways in response. Almost all insects have the pathways belonging to the category 'xenobiotics biodegradation and metabolism'. We found that all insects contained the 'caprolactam degradation' pathway. Caprolactam is a pesticide intermediate ( Figure 3).

Signaling pathways
Signaling pathways are important signal transduction pathways related with proteins that pass signals from outside of a cell to the inside of the cell. In total, 29 signal pathways were found. The well-studied important signal pathways exist in almost all 52 insects, including Toll-like receptor signal pathway, MAPK signal pathway, NFKB signal pathway, Notch signal pathway, etc. This suggests that these pathways are highly conserved and also play important functions in insects.

Insect hormone biosynthesis
Almost all insects undergo incomplete metamorphosis from immature nymphs, which resemble the adults, or complete metamorphosis from immature larvae, which are significantly different from the adult. Both molting and juvenile hormone control the insect metamorphosis. Hormone biosynthesis pathways were identified in all 52 insects. In the genome-sequenced insects, almost all genes in the insect hormone biosynthesis pathway were found, suggesting that this pathway is highly conserved in insects. All insects with transcriptome data had juvenile hormone epoxide hydrolase, juvenile-hormone esterase and ecdysone oxidase. Ecdysteroid 25-hydroxylase, CYP306A1 (Phm), ecdysteroid 22-hydroxylase and CYP302A1 (Dib) were found in almost all insects ( Figure 4). We compared the pathway members between   holometabous and hemimetabolous insects, finding no apparent difference from present data. A detail analysis of the pathway differences is worthy of further investigation. Because of the low quality of the insect transcriptome data, some genes in the insect hormone biosynthesis pathway were missing. The completeness of this pathway can be used as a parameter to estimate the quality of genome annotation or transcriptome assembly.

Wing development pathway
Insects are characterized by having six legs and four wings, which enable diverse mobile abilities. Insect wing development is an important research topic. However, no wing development pathway is available in the KEGG or other gene network databases. Therefore, we constructed a wing development pathway after reference mining research on wing development in D. melanogaster (47)(48)(49)(50)(51)(52)(53)(54)(55)(56)(57) and Ac. pisum (58). KGML files of wing development pathways in these two species were produced. Then, those files were used as templates to construct the pathways in other species ( Figure 5). To best of our knowledge, this is the first report of an insect wing development pathway. The results indicated that almost all insects have genes in this pathway. However, major parts of genes associated with wing development were missing in the flightless silkworm, B. mori.
Since the silkworm has been domesticated for thousands of years, the impact of domestication on the evolution of wing development requires further investigation.

Conclusion
We developed an improved analysis tool for constructing insect pathways. Both stand-alone software and web servers are provided. Users can construct insect pathways from a list of genes. An insect pathway database was also built that contains well-annotated insect pathways from 52 species.

Future study
1. Knowledge-based construction of insect pathways relies on sequence data. Therefore, we will continually update iPathDB by adding more insect genomes once they are sequenced and published. We will also reconstruct the pathway when new versions of OGSs are released.

Evolutionary analysis of insect pathway is an interest-
ing topic that is worthy of further investigation. In the future, as more reliable insect pathways are added to iPathDB, we will carry out insect pathway conservation analysis. IPathDB will display conserved insect pathways in various insect species.

Supplementary Data
Supplementary data are available at Database Online.