AphidBase aims to (i) store recently acquired genomic resources on aphids and (ii) compare them to other insect resources as functional annotation tools. For that, the Drosophila melanogaster genome has been loaded in the database using the GMOD open source software for a comparison with the 17 069 pea aphid unique transcripts (contigs) and the 13 639 gene transcripts of the Anopheles gambiae. Links to FlyBase and A.gambiae Entrez databases allow a rapid characterization of the putative functions of the aphid sequences. Text mining of the D.melanogaster literature was performed to construct a network of co-cited gene or protein names, which should facilitate functional annotation of aphid homolog sequences. AphidBase represents one of the first genomic databases for a hemipteran insect.
Since the genome of Drosophila melanogaster, several other insect genomes have been or are currently being sequenced. These new genomic resources require the development of databases in order not only to store and analyse each insect genome separately, but also for interspecies comparisons. Aphids are plant-sucking insects attacking plants by feeding on fluids circulating in the plant phloem. These hemipterans diverged from other insects 300 million years ago (Grimaldi and Engels, 2005) and are responsible for considerable damage worldwide to cultivated and ornamental plants. Genomic tools have been recently developed on aphids, mainly on the pea aphid Acyrthosiphon pisum. In the last 3 years, a large collection of ESTs (more than 60 000) have been developed, and the complete genome sequence is currently being released. AphidBase is a project of the International Aphid Genomic Consortium as a resource for users to analyse, compare, retrieve, and annotate aphid sequences, in comparison with D.melanogaster and Anopheles gambiae.
2 DATABASE STRUCTURE
We used the Generic Model Organism Database (GMOD; http://www.gmod.org/home) open source project for administrating of genomic databases. Among GMOD tools, Chado (database architecture on PostgreSQL; http://www.gmod.org/apollo-chado) and Gbrowse (a genomic data browser, Stein et al., 2002) were installed to set up and develop AphidBase. The complete D.melanogaster chromosome sequences (Flybase version r4.2.1; Drysdale et al., 2005), 13 639 A.gambiae gene transcripts (MOZ2a) and 17 069 A.pisum putative unique transcribed sequences (or contigs; see Sabater-Muñoz et al., 2006) were retrieved. These 17 069 A.pisum contigs were assembled from 53 190 ESTs, forming the so-called updated ‘v5’ version of contigs (Sabater-Muñoz et al., 2006 and unpublished data). In order to compare A.pisum and A.gambiae to D.melanogaster sequences, tblastx was performed between the A.pisum contigs or A.gambiae gene transcripts and the D.melanogaster genomic sequence. Matching A.pisum or A.gambiae sequences to fly sequences were displayed onto the D.melanogaster genome. For A.pisum, 12 280 (72%) contigs had no match to D.melanogaster sequences (e value = 10−6) and 4789 (28%) were homologous to D.melanogaster sequences. 5551 (32%) A.pisum contigs were homologous to A.gambiae sequences, and 2922 (17%) matched to both D.melanogaster and A.gambiae sequences. This lack of homology reflects the divergence between hemipteran and dipteran, as already discussed in Sabater-Muñoz et al. (2006).
3 FUNCTIONAL ANNOTATION TOOLBOX
In order to facilitate functional annotation of the pea aphid sequences for which very little data is available on gene and protein functions, several descriptions are proposed for each of the 4789 A.pisum contigs homologous to D.melanogaster and/or A.gambiae sequences. A direct link to the FlyBase report for each D.melanogaster gene was activated, in order to take advantage of the whole molecular and genetic description of fly genes homologous to aphid genes. In parallel, a similar direct link was performed for the A.gambiae gene transcripts annotated at the Ensembl Mosquito Transcript Report. Finally, a contig report for each of the A.pisum sequences was created, containing a list of several features. First, the sequence of each contig as well as EST composition and cDNA libraries of origin were indicated. Second, the translation in the six open reading frames were displayed, as well as the identification of the largest putative ORF after FrameD analysis (Schiex et al., 2003). Third, as each of the aphid transcripts had several matches to D.melanogaster genomic sequences, only the first hit was displayed on Gbrowse for search of clarity. Thus, the complete tblastx report was included on the contig report. A major fraction (2872) of the 4789 pea aphid contigs homologous to D.melanogaster sequences had a single match (the one visualized by Gbrowse) but 1827 contigs still had between 2 and 7 different matches on the D.melanogaster genome (see “Global Report”). Fourth, the result of Uniprot annotation (e value = 10−5, Sabater-Muñoz et al., 2006) was displayed. Finally, a text mining analysis was set up, based on the large D.melanogaster literature. For this, the thesaurus of the D.melanogaster bibliography (about 37 000 bibliographical records from the Medline database) was constructed (i) by automatic extraction of gene names from the abstracts, (ii) by automated pattern recognition and natural language processing methods for names of genes and their products, and (iii) by compiling literature annotations available in various databases such as FlyBase, SwissProt or Entrez Gene. Co-citation clusters and networks were also constructed by automatic clustering approaches, and annotated by biological and functional information from Medline records (indexation keywords) and by available information for D.melanogaster genes and their products in databases and ontologies. Each of the 4487 pea aphid contigs homologous to a D.melanogaster gene was thus linked to literature records, clusters and networks by using the names of D.melanogaster homologous. This bibliographic network of co-occurrence of cited genes in the literature is a quick and efficient tool to infer biological functions in which a given pea aphid contig might be involved.
As soon as the sequence and assembly of the 530 Mb genome of A.pisum will be available, gameXML files of specific regions of interest would be easily extracted in order to be loaded into genome annotation editors. Comparisons with (i) other aphid ESTs or cDNAs sequences (e.g. Hunter et al., 2003), (ii) D.melanogaster and A.gambiae genomes or (iii) other insect genomes under annotation (e.g. the honey bee) will help human expert decisions. AphidBase will also be implemented with new functional annotation modules, mainly for transcript and proteic profilings. AphidBase will thus represent the necessary infrastructure for curation, archiving and functional annotation of an aphid genome, one of the first hemipteran sequenced genome.
S. Cain (Gmod), O. Chenedé (INRA Rennes), J.P. Gaultier (INRA Jouy en Josas), C. Pommier (INRA, URGI), J.C. Simon, C. Soster (INRA Rennes) and L. Stein (CSHL, USA) are acknowledged for their technical support, advice and discussions. Financial support was received from Rennes Metropole and ANR Exdisum.
Conflict of Interest: none declared.