C A DRE is a public resource for housing and analysing genomic data extracted from species of Aspergillus . It arose to enable maintenance of the complete annotated genomic sequence of Aspergillus fumigatus and to provide tools for searching, analysing and visualizing features of fungal genomes. By implementing C A DRE using Ensembl, a framework is in place for storing and comparing several genomes: the resource will thus expand by including other Aspergillus genomes (such as Aspergillus nidulans ) as they become available. C A DRE is accessible at http://www.cadre. man.ac.uk .
Received August 15, 2003; Accepted August 20, 2003
Aspergillus is a genus of fungi found worldwide; over 180 species are officially recognized ( 1 ), some of which are of medical or industrial importance. Aspergillus fumigatus is the most common mould pathogen of humans, causing both life‐threatening invasive disease in immunocompromised patients and allergic disease in patients with atopic immune systems ( 2 ). Aspergillus nidulans , an occasional human pathogen, is a model organism that has contributed to our understanding of genetics, gene regulation and cellular biology ( 3 , 4 ), while Aspergillus niger ( 5 , 6 ) and Aspergillus oryzae ( 7 ) are both used in industrial processes. Several other Aspergillus species are known to be significant allergens or to be responsible for mycotoxin production on stored food ( 8 – 10 ).
Interest in A.fumigatus has increased in recent years, not only because it is the most frequently isolated Aspergillus species from patients, but also because the incidence of invasive aspergillosis is rising ( 11 ). Moreover, it produces a number of toxins, such as fumagillin, which has been developed as a treatment for angiogenesis and microsporidiosis ( 12 ). To gain a better insight into the pathogenicity of this organism, an international consortium was established in 1998 to sequence the small (∼30 Mb) A.fumigatus genome ( 13 ). Sequencing is almost complete and first pass annotation is being carried out by the Wellcome Trust Sanger Institute (UK) and The Institute for Genomic Research (USA). A Central Aspergillus Data REpository (C A DRE) was subsequently established in 2001 to manage the information produced by the sequencing effort, to contribute secondary annotation and to facilitate future comparative studies by incorporating genomic data from A.nidulans and other Aspergillus species as they become available.
The principal role of C A DRE is to manage the complete annotated genomic sequence of A.fumigatus . Using a subset of these data as a test case, we have therefore implemented a database and Web‐based software to facilitate searching and visualization of genomic features. These tools offer relatively simple displays for viewing gene and protein annotation, as well as more complex displays for viewing different gene predictions and other sequence features (e.g. RNA‐encoding genes and repeats).
SOURCE DATA AND METHODS
Source data were provided by the A.fumigatus pilot sequence project. Using a bacterial artificial chromosome (BAC) genomic library, which was constructed using DNA from a clinical isolate of A.fumigatus (AF293), a bidirectional clone‐by‐clone walk was undertaken from niaD ‐positive BAC clones. Sixteen BAC clones were completely sequenced to yield an assembly of 922 kb, for which 360 protein‐coding genes and eight tRNA genes have been predicted. This sequence and set of annotated genes were used to implement a database that will eventually house the full genome.
The infrastructure used to organize the genomic data was provided by the Ensembl system ( 14 ). This comprises: (i) a database schema that has been designed for storing annotated eukaryotic genomes; (ii) BioPerl and Ensembl object‐oriented modules for describing biological entities; (iii) a series of Perl scripts that generate Web pages for viewing genomic data and (iv) an annotation pipeline for predicting genetic features.
As Ensembl was developed for the management and automated annotation of large eukaryotic genomes, we adapted this system both to handle smaller genomes (such as those of fungi), and to accommodate automatic and manual annotation provided by different research groups. For A.fumigatus , annotation has been provided in an XML format, which we hope will be adopted by the Aspergillus annotation community for data exchange.
The foundation of C A DRE is Ensembl version 8.1. The database schema has been implemented using the MySQL relational database management system.
CONTENT OF CURRENT RELEASE
Release 1.0 (September 2003) contains information pertaining to the A.fumigatus pilot sequence project and includes 368 predicted genes. Each of these has been given a unique C A DRE identifier and has been classified as known, putative or novel. Known genes are those corresponding to previously characterized A.fumigatus genes or orthologous genes from other Aspergillus species, whose protein sequences are available in the public databases. Putative genes are those found to be similar to known publicly available protein sequences. Novel genes are those predicted with no similarity to any known protein. Of the total number of predicted protein‐coding genes, 35 have been classified as known genes, 149 as putative and 176 as novel.
DISPLAY AND SEARCH SOFTWARE
Several tools are provided for viewing genomic data within C A DRE, the three main ones being ContigView, GeneView and ProtView. ContigView (Fig. 1 ) is the principal data visualization tool in the Ensembl system. It provides a high‐level view of the contigs that make up a genome assembly, as well as the genomic features that have been mapped onto it. ContigView can also be customized: i.e. display colours can be changed and features can be added or removed to aid data assimilation. Within C A DRE, this view is provided on two levels: (i) an ‘overview’, displaying predicted genes within a 100 kb region of the assembly, and (ii) a ‘detailed view’, showing a range of predicted features within a smaller region of the assembly (by default 10 kb is shown).
The main features of ContigView are predicted transcripts, which are colour‐coded according to our classification system and displayed parallel to the assembly in accordance with their position and strand orientation. Each transcript provides a pop‐up menu of additional information (systematic feature name, C A DRE transcript and gene identifiers) and hyperlinks to other views (GeneView, ProtView and ExportView). The position of other features, such as BAC clones, tRNAs and start/stop codons, can also be presented alongside the transcripts. The ability to integrate a range of features within a single view is a vital facility provided by Ensembl. As the resource expands, it will allow C A DRE to provide results obtained from various prediction programs, as well as data from other research groups, thereby aiding functional assignment. In addition, it will facilitate genome comparison, as transcripts in other Aspergillus genomes found to be similar to those in the currently viewed genome can also be displayed. Ensembl provides two means of handling information gathered by other groups: data can be stored in‐house, as an auxiliary database, or it can be dynamically imported using the Distributed Annotation System ( 15 ). ContigView is extensible and can act as a portal to other databases, providing the opportunity for collaborative genome analysis amongst research groups.
GeneView (Fig. 2 ) provides detailed information about a particular gene. The summary table at the top of the report provides: (i) the systematic feature name; (ii) the standard gene name, as represented in the literature; (iii) the C A DRE gene identifier; (iv) the chromosomal location; (v) a short description of the gene, manually transferred from the external sequence database entry [e.g. SWISS‐PROT ( 16 )] to which the predicted gene mapped; (vi) how the gene was predicted; (vii) a list of predicted transcripts, each of which is hyperlinked to the sequence and its translation; (viii) a list of cross‐references to similar sequences; (ix) GO terms that have been mapped to the gene and (x) a link to ExportView, for data download.
Below the summary table are reports describing each predicted transcript. Each report provides: (i) the cDNA sequence; (ii) an image of the exon structure; (iii) the transcript neighbourhood, highlighting the transcript of interest; (iv) exon information (C A DRE exon identifier, contig identifier, strand orientation, contig coordinates and exon sequence) and (v) splice‐site information (C A DRE identifiers of adjacent exons and splice‐site sequence). For (iv) and (v), an exon may lie across two contigs: in this event, the exon sub‐sequences are distinguished by a numerical suffix, e.g. CADAFUE0000473‐1 and CADAFUE0000473‐2.
ProtView provides information about a particular protein. The summary table at the top of this report provides: (i) the C A DRE protein identifier; (ii) the corresponding C A DRE gene identifier; (iii) the name of the protein, manually transferred from the external sequence database entry to which the predicted protein mapped and (iv) how the protein was predicted. Below this table, the sequence is provided in FASTA format, with a link to the transcript within GeneView. ProtView is also able to provide information about any matches to family‐ or domain‐based databases [e.g. Pfam ( 17 ) and PRINTS ( 18 )] and structural features (e.g. transmembrane, low complexity and coiled‐coil regions). However, this information has not yet been stored for the pilot sequence.
Other views available are CytoView and ExportView. CytoView allows navigation and display of much larger sections of an assembly than ContigView (i.e. up to 50 Mb can be shown). ExportView allows data to be downloaded as a FASTA sequence, a tab‐delimited feature list or a flat file in EMBL or GenBank format.
For all of the above views, a search box is provided, allowing searches against any of the main features present in the database (i.e. sequence, gene, transcript and peptide) using identifiers or descriptions.
To address the need for continuing management and ongoing annotation of Aspergillus genomic data within C A DRE, we are implementing an automated annotation pipeline. We will also establish a community annotation notice board to aid manual annotation. Our policy is to provide reusable code, which will be made available for other groups using Ensembl‐based databases. For all stored genomes, we will eventually provide two sets of transcripts: (i) predicted transcripts—those originally annotated by the sequencing centres and (ii) revised transcripts—those annotated by the sequencing centres and edited over time to reflect current public sequence databases and literature.
Other areas of development will be ‘views’ that form part of the standard Ensembl system that have not yet been implemented in C A DRE, e.g. BLASTView and SyntenyView. BLASTView will allow similarity searches against DNA and protein sequences within C A DRE. SyntenyView will allow us to provide information pertaining to the conservation of large‐scale gene order between any two stored Aspergillus genomes.
Using Ensembl as a foundation, we have provided the Aspergillus research community with a platform for collaboration. Through data integration and its range of analysis tools, C A DRE will increasingly support comparative genomics and functional analyses of an important group of fungi, thereby establishing its role as a central Aspergillus data repository.
We wish to thank the Pathogen Sequencing Unit (Sanger) for providing the primary annotation of the pilot sequence. We also wish to thank the Pathogen Sequencing Unit, the Ensembl Team (Sanger/European Bioinformatics Institute) and the Fungal Genomics Laboratory (North Carolina State University) for useful discussions on implementing C A DRE. C A DRE is funded by the Wellcome Trust (grant no. 062322).