The Diatom Expressed Sequence Tag (EST) Database was constructed to provide integral access to ESTs from these ecologically and evolutionarily interesting microalgae. It has now been updated with 130 000 Phaeodactylum tricornutum ESTs from 16 cDNA libraries and 77 000 Thalassiosira pseudonana ESTs from seven libraries, derived from cells grown in different nutrient and stress regimes. The updated relational database incorporates results from statistical analyses such as log-likelihood ratios and hierarchical clustering, which help to identify differentially expressed genes under different conditions, and allow similarities in gene expression in different libraries to be investigated in a functional context. The database also incorporates links to the recently sequenced genomes of P. tricornutum and T. pseudonana, enabling an easy cross-talk between the expression pattern of diatom orthologs and the genome browsers. These improvements will facilitate exploration of diatom responses to conditions of ecological relevance and will aid gene function identification of diatom-specific genes and in silico gene prediction in this largely unexplored class of eukaryotes. The updated Diatom EST Database is available at http://www.biologie.ens.fr/diatomics/EST3.
Diatoms are globally distributed, eukaryotic brown microalgae that participate in various biogeochemical cycles and play key roles in maintaining the ecological balance of the earth. They are major contributors to global primary production and CO2 sequestration (1,2), and are also receiving attention as a potential source of biofuels (3). They fall within the heterokont branch of the eukaryotic tree (4) and are believed to have evolved from a secondary endosymbiotic process (5–7). The molecular and cellular biology of diatoms is dramatically underexplored. Previous Expressed Sequence Tag (EST) studies (8,9) together with the first whole genome sequences from diatoms, Thalassiosira pseudonana (10) and Phaeodactylum tricornutum (11), have shown that less than 50% of diatom genes can be assigned a putative function using homology-based methods, due to the lack of genomic information from well studied taxonomically related organisms. Similar observations were also made in a pilot study of ESTs derived from the polar diatom Fragilariopsis cylindrus grown at low temperature (12). Our earlier diatom EST database (9) enabled comparative studies of eukaryotic algal genomes and revealed some interesting differences in genes involved in basic cell metabolism (13,14). It also aided the study of key signalling and regulatory pathways (15), silica metabolism (16,17), nitrogen metabolism (18) and carbohydrate metabolism (19).
Furthermore, elucidation of the functions of diatom-specific genes can be facilitated by identifying conditions in which they are expressed. Non normalized EST libraries made from cells grown in different growth conditions can therefore provide a good dataset for comparative, functional as well as phylogenetic studies. For example, comparative study of the mRNAs expressed under different conditions can provide a systematic exploration of the molecular adaptations of a cell by differential gene expression. As a case in point, EST collections derived from cells grown under different conditions have proven to be a good tool for transcriptomics studies and genome annotation in the green alga Chlamydomonas reinhardtii (20–24). By comparing the expression profiles from more than one growth condition, differential gene expression studies can therefore provide a useful means to explore diatom gene function and genome annotation.
In this update we describe EST collections derived from diatom cells grown under different conditions and statistical methods used to explore gene expression. This digital gene expression database contains more than 200 000 ESTs from the two recently sequenced diatom genomes, T. pseudonana (10) and P. tricornutum (11). T. pseudonana is a centric diatom and has been a model organism for physiological studies of widely distributed species belonging to the order Thalassiosirales. P. tricornutum is a pennate diatom for which a range of reverse genetics tools have been generated (25), therefore making it a good model for functional genomic studies. The sequenced diatoms revealed many interesting features of diatom genes and metabolic pathways, although comparative studies also revealed a high level of molecular divergence (11,15). Bearing in mind these striking differences, the updates in the Diatom EST Database described here provide key insights into differential gene expression in diatoms grown in a range of ecologically relevant conditions.
DATA SOURCES AND DATABASE CONSTRUCTION
The Diatom EST Database was initially made with 12 136 ESTs from P. tricornutum and 15 174 ESTs from T. pseudonana, each obtained from a single growth condition (9). These libraries were expanded with 120 411 ESTs from P. tricornutum and 61 913 ESTs from T. pseudonana obtained from cells grown in 15 and 6 additional growth conditions, respectively. The new sets of ESTs were subjected to preliminary analysis such as vector clipping, quality control, etc. (9) and sequence assembly and redundancy checking was then done in two steps. First, the ESTs were clustered together with the predicted gene models from their respective genomes (http://genome.jgi-psf.org/Thaps3/Thaps3.home.html and http://genome.jgi-psf.org/Phatr2/Phatr2.home.html). We were able to assign 120 575 ESTs to 8944 of the 10 402 gene models in P. tricornutum and 43 114 ESTs to 7268 of the 11 776 gene models in T. pseudonana using the BLASTN programme (cut-off e-value 10–10) (26). These 8944 and 7268 transcriptional units (TUs) with predicted gene models were directly added to the non-redundant transcript sets with new sequence identifiers containing ‘G’ as a prefix along with the gene model identifier, e.g., G10065 for gene model 10065. The number of ESTs clustering to each gene model gives the redundancy or cluster size of the transcript. Secondly, transcripts which did not have a predicted gene model (11 513 ESTs from P. tricornutum and 18 073 ESTs from T. pseudonana), mainly due to the fact that ESTs from only a few libraries were used for training the gene prediction programmes (11), were subjected to analysis by CAP3 (27). Sequences with greater than 95% identity over a region longer than 30 base pairs were clustered using this programme and we thus obtained 1330 contigs and 2096 singletons for P. tricornutum and 1769 contigs and 2039 singletons for T. pseudonana. These were added to the non-redundant transcript set with sequence identifiers starting with ‘C’ for contigs and ‘S’ for the singletons. Adding the TUs with gene models to the contigs and singletons obtained from CAP3, we counted 12 370 non-redundant TUs in P. tricornutum and 11 076 TUs in T. pseudonana. Among the non-redundant TUs which do not have a predicted gene model, we found only 612 TUs in P. tricornutum and 1 083 TUs in T. pseudonana that do not align in their respective genomes, likely because of remaining gaps in the genome sequences.
The contribution of ESTs from different libraries to the cluster size of each TU gives the abundance of each expressed transcript across different libraries. The counts were normalized to the library size by converting the counts to frequencies, which allows a statistical comparison to be made of expression levels of transcripts in different conditions. Specifically, the log-likelihood ratio was calculated for each contig (28) to statistically validate whether a difference in frequency across different libraries was random or due to differential expression. The database schematized in Figure 1 provides access to frequency distribution plots (Figure 1E) and log-likelihood ratios (R-values) for each TU, which are catalogued by library (Figure 1C) as well as across libraries (Figure 1D and H). Figure 1E shows an example of a TU with high R-value (i.e. a gene that is strongly differentially expressed in the conditions tested). By cataloguing the TUs based on their R-values we were then able to identify transcripts that are differentially expressed under each condition. For example, transcripts expressed during iron limitation served as a useful starting point to explore the molecular response of P. tricornutum to life at low iron concentrations (29), providing experimental validation of our statistical methods. TUs were also subjected to hierarchical clustering (30) to identify transcripts with similar expression profiles in the different conditions. These analyses together with relevant functional information were visualized using Java Treeview (31). Figure 1F shows a screen shot of hierarchical clustering (30) of P. tricornutum contigs.
The updated dataset and the accompanying results are stored in upgraded servers with the Linux Debian ‘etch’ platform in DELL1850 hosting the relational database PostgreSQL 8.3 and DELL1855 with the web server Apache 2.0 and PHP 5. The relational database was migrated to postgreSQL for faster access and to enable the dynamic clustering of expression data. The new web interface is also linked to the gene models on the JGI diatom genome browsers (http://genome.jgi-psf.org/Thaps3/Thaps3.home.html and http://genome.jgi-psf.org/Phatr2/Phatr2.home.html), which enables the user to have direct access to annotation and gene structure for each TU (Figure 1G).
DATABASE CONTENTS AND WEB INTERFACE
The database provides access to details of each cDNA library and corresponding growth conditions (Figure 1A). The raw sequences are catalogued by library and each raw sequence table gives access to DNA sequence, length and BLAST output. These tables also provide links to the TU that each sequence belongs to. The contig tables give access to the TU of each library, catalogued based on the abundance of ESTs in each condition (Figure 1C). The cluster size of each TU is linked to the dynamically generated frequency plot (Figure 1E), which enables comparison of expression levels in the other libraries. This table also shows R-values and the best BLAST results.
The expression of each TU across all the libraries can be accessed by two different methods (Figure 1D), either in tabular form (Figure 1H) or as a hierarchical cluster visualized using Java Treeview (Figure 1F). The tabular view gives access to all TUs expressed more than once in any given condition and they are catalogued based on cluster size, which is again linked to each frequency plot (Figure 1E). This table also provides a link to the ortholog if present in the other diatom and its expression profile, as well as the corresponding gene models hyperlinked to the genome databases hosted at JGI, providing access to further functional annotation and visualization of neighbouring genes (Figure 1G). The Java Treeview visualizes the two-way hierarchical clustering of all the transcripts which are expressed more than once, helping to identify libraries that cluster together and transcripts with similar expression patterns. The annotations for each TU are hyperlinked to the frequency plots and to the JGI genome browsers.
The new web interface is inspired by Google, having a simplified, self-explanatory look and easy retrieval of data. The database is queryable by keyword, based on annotation from homology search methods and the TU identifier, and sequence retrieval is possible by using either the sequence identifier or TU identifier. Homology searches, using BLAST against each library and the total non-redundant sets are also available via the web interface.
The diatom genomic repository is rapidly expanding with several sequencing projects. For example, the genomes of two additional pennate diatoms, Pseudo-nitzschia multiseries and F. cylindrus, are currently nearing completion at JGI, together with accompanying EST collections. The database analysis and pipeline described here are semi-automated and can easily incorporate these and other data sets from diatoms and related species. Pilot microarray projects in T. pseudonana and P. tricornutum have already provided experimental validation for this EST-based digital transcriptomics database under some conditions (29,32) and possibilities to link microarray based studies to the existing database are currently being explored, as is the incorporation of transcriptomics data from massively parallel sequencing platforms. Reverse genetics studies are providing additional experimental validation for the expression, localization and functions of individual TUs (33) and so information derived from the database can also be used to train the gene prediction programmes to improve in silico gene annotation in diatoms and related organisms.
The Diatom EST database is freely available on the web at http://www.biologie.ens.fr/diatomics/EST3. The P. tricornutum ESTs have been submitted to the NCBI dbEST (Genbank accession numbers CD374840–CD384835, BI306757–BI307753, CD374840–CD384835, BI306757–BI307753, CT868744–CT950687 and CU695349–CU740080). Requests for bulk queries of the expression data and to house EST data from other diatoms can be addressed to Dr Chris Bowler.
Partial funding for the Diatom EST Database was obtained from the EU-funded Diatomics (LSHG-CT-2004-512035) and Marine Genomics Europe projects (GOCE-CT-2004-505403) and the Agence Nationale de la Recherche. P. tricornutum ESTs were funded by Genoscope (Evry, Paris). Generation of T. pseudonana ESTs was funded by a Gordon and Betty Moore Foundation Marine Microbiology Investigator Award (EVA). Funding for open access charge: Centre National de la Recherche Scientifique.
Conflict of interest statement. None declared.
P. tricornutum ESTs were generated and sequenced by Genoscope (Evry, Paris) and the T. pseudonana ESTs were sequenced by JGI, USA. We are grateful to Pierre Vincens and Jean-Pierre Roux for managing the server and the software, to Andrew Allen and Kamel Jabbari for their help and suggestions, Igor V. Grigoriev and Alan Kuo at JGI for providing links to gene models, and to Alok J. Saldanha for his help to integrate the Java Treeview in the database. T. pseudonana cultures were grown with the help of Karie Holtermann. The contact information of the people responsible for each P.tricornutum library can be obtained from the database.