CancerEST: a web-based tool for automatic meta-analysis of public EST data

The identification of cancer-restricted biomarkers is fundamental to the development of novel cancer therapies and diagnostic tools. The construction of comprehensive profiles to define tissue- and cancer-specific gene expression has been central to this. To this end, the exploitation of the current wealth of ‘omic’-scale databases can be facilitated by automated approaches, allowing researchers to directly address specific biological questions. Here we present CancerEST, a user-friendly and intuitive web-based tool for the automated identification of candidate cancer markers/targets, for examining tissue specificity as well as for integrated expression profiling. CancerEST operates by means of constructing and meta-analyzing expressed sequence tag (EST) profiles of user-supplied gene sets across an EST database supporting 36 tissue types. Using a validation data set from the literature, we show the functionality and utility of CancerEST. Database URL: http://www.cancerest.org.uk


Introduction
Identifying novel candidate markers/targets is a key challenge in the development of cancer therapies (1). Tissue-and cancer-specific gene expression profiles provide information about the potential of genes to serve as clinical markers (2). Thus, accessible and automated approaches analyzing the current wealth of 'omic'-scale data are required to facilitate the full exploitation of expression data. Expressed sequence tags (ESTs) are short DNA sequences (200-500 nucleotides) generated by sequencing the 5 0 and/or 3 0 ends of cDNAs that are subsequently clustered and counted (3). In the past decade, a large amount of EST data has been deposited in public repositories such as dbEST (4), which currently holds records of 8 692 773 human ESTs. Unigene has grouped these expression data into clusters and assigned them to genes, facilitating the indexing of the EST data (5). Pipelining the retrieval, the integration and the high-throughput investigation of such data in a fashion specifically tailored to the interests of the user should facilitate wider application by putting EST data in the hands of researchers directly addressing focused biological questions, without requiring the involvement of bioinformaticians. Integration and subsequent investigation of EST data can not only enhance reliability and generalizability of results but can also reveal a comprehensive expression profile across numerous tissues, which can be used to uncover information about tissue-specific expression, cancer expression and, above all, cancer marker/target potential (6). For example, Kim et al. (7) and Campagne and Skrabanek (8) identified potential cancer markers by means of EST data analyses, whereas Hofmann et al. (9) used EST data, reverse transcription polymerase chain reaction (RT-PCR) and other high-throughput gene expression data to evaluate the tissue specificity and the cancer gene expression profiles of ß The Author(s) 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. previously published cancer testis (CT) genes, a group of genes widely used in clinical applications (10).
Here we present CancerEST, a freely accessible pipeline with a user-friendly and intuitive web interface to provide automated high-throughput investigation of public EST data with user-defined sets of biologically significant and related genes to determine (i) their cancer marker/ target potential, (ii) their tissue specificity and (iii) their comprehensive expression profiles across 36 tissues (Supplementary Table S1). The underlying method was developed for a previously published study, where we identified a cohort of novel cancer-specific marker genes (11), and has been improved and automated to provide the basis of CancerEST. The tool provides intuitive data analysis and visualizations and allows biologists/clinicians without skills in bioinformatics to exploit the wealth of publicly available data presented by modern databases. It serves to focus the overwhelming number of putative target genes on a manageable number of candidates, which can be followed up in the laboratory. To validate our approach, we have analyzed a list of testis-restricted genes from literature (9) and could reproduce the published results.

Methods and structure of CancerEST
CancerEST consists of a web interface, pipelined analyses and three relational databases; one holding the analysis data, one holding the Unigene data and another one holding the gene annotation data. The principal workflow is shown in Figure 1.

The CancerEST web interface
First, the CancerEST web interface handles the user specifications and mapping of the user-supplied gene list as well as the job submission. Second, it allows the user to view and download the analysis results and visualizations. When submitting a new job, the user provides a text file consisting of Unigene Cluster IDs, Entrez IDs or curated gene names, for which the identifiers are then mapped to their appropriate Unigene Cluster IDs to show the user which genes can be fed into the analysis. Furthermore, the user has to specify a tissue focus, where submitted genes are allowed to show expression in normal individuals; for example, the testis might be of interest to the user, as it is an immunologically privileged tissue (12). The user can optionally select an interfering tissue(s), where submitted genes are tolerated to show additional expression in normal individuals; for example, brain tissue could be selected by the user, as various genes that have been originally assumed to be testis-restricted are also expressed in the brain, another tissue residing in immunological privilege (13). Finally, the job can be submitted by providing an email address.
When viewing a finished job, the results of the analysis and the visualizations are presented to the user in a simpleto-use web interface. All result files are also available for download. The web site makes use of cookies to ensure that a user only has access to his/her own data and thus can access secure areas of the web site (refer to the CancerEST help section available at http://www.cancerest. org.uk/help.html for more information). We also provide an example data set on our web site (available at http:// www.cancerest.org.uk).

EST data Retrieval, data quality and CancerEST databases
We obtained the complete data available from the Unigene database (Unigene Build #230) (5) and set up a local MySQL database. We excluded ESTs from normalized and subtracted cDNA libraries (6) as well as cDNA libraries deriving from uncharacterized, mixed or embryonic/fetal tissues. The exclusion of cancer cell-line libraries is optional and can be specified by the user. Furthermore, we kept only libraries from cancerous and healthy tissues, and thus excluded libraries deriving from diseases other than cancer. All ESTs of a given tissue type t were merged to a meta-library. However, meta-libraries with a combined EST count for healthy and cancerous tissues below 10 000 were excluded to ensure significance, resulting in cancer and normal meta-libraries for 36 tissue types (Supplementary Table S1, Supplementary Figure S1). For each Unigene cluster, the global expression profile in cancerous and healthy tissues is computed by EST counting, following the concept of the Unigene EST profiles (5). The expression profiles in cancerous and healthy tissues are normalized by calculating the transcripts per million tpm t,c , where m t,c is the number of ESTs for a given cluster c and for a given tissue type t, and n t is the total number of ESTs for that given tissue type t: For annotation purposes, the Ensembl database (14) and the HUGO Gene Nomenclature Committee(HGNC) database (15) were established as a local MySQL database.

The CancerEST pipeline
The pipeline handles the EST meta-analysis, the annotation and the visualizations. For each of the submitted genes, the expression profile is examined to determine the expression in the user-specified tissue focus, in possible interfering tissues, in all other healthy tissues as well as in all cancerderived tissues. Thus, the weighted average tpm av for these four tissue groups is computed, where w t is the weight of the given tissue t belonging to the set of tissues g, represented by the size of the meta-library: Genes are sorted into four classes according to their expression profile to provide information about their potential as cancer antigen-encoding genes: (i) tissue focusrestricted expression in normal individuals as well as cancer expression (Class 1), (ii) tissue focus-and interfering tissue-restricted expression in normal individuals as well as cancer expression (Class 2), (iii) tissue focus-and/or interfering tissue-restricted expression in normal individuals but no cancer expression (Class 3) and (iv) somatic expression in normal individuals (Class 4). The classes are designated with an 'a' if no focus expression was found.
The genes are also sorted into four states to provide information about tissue specificity: (i) tissue-specific (Classes 1-3), (ii) highly tissue-selective ( P tpm t,c 2 for all other healthy tissues), (iii) tissue-selective ( P tpm t,c 5 for all other healthy tissues) and (iv) enriched (the tpm av,c of the tissue focus is twice the tpm t,c of each of the other healthy tissues).
To evaluate the upregulation of genes of interest in cancer, the significance of upregulation is accessed using Fisher's exact test (16). Genes with a p-value <0.05 or with expression in cancerous meta-libraries but not in the corresponding healthy meta-libraries are considered to be upregulated in these cancer types.

Use of CancerEST
CancerEST was developed as a user-friendly and intuitive tool to compute cancer marker/target potential as well as to obtain comprehensive expression profiles and information about the tissue specificity for genes of interest to biologists/clinicians. The CancerEST web interface for viewing the analysis results consists of three sections: the overview, the information and the result section. The overview section provides basic information about the submitted job and a brief explanation on how to interpret the results. The information section includes, among others, the annotated genes of interest and the 36 tissue types supported by CancerEST. The result section includes the EST meta-analysis results comprising a ranked list of genes according to (i) their cancer marker potential, or (ii) their tissue specificity. Furthermore, a comprehensive expression profile across 36 healthy and cancerous tissues is available for each gene. Circos plots visualize the analysis results in their entirety to highlight relationships between the genes and the cancer types. In contrast, bar charts show the complete expression profile across 36 healthy and cancerous tissues for each gene separately. For more information, the CancerEST help section provides detailed documentation, available at http://www.cancerest.org.uk/help.html.

Validation
We used the 39 tight testis-restricted genes determined by Hofmann et al. as a validation data set (four genes could not be mapped to a Unigene cluster ID or to an HGNC gene name, resulting in 35 genes that could be evaluated). Hofmann et al. have evaluated the tissue-and cancer-specific expression of 153 CT genes previously published in the CTdatabase (18) using high-throughput expression data in combination with RT-PCR data (9). We selected 'testis' as tissue focus and chose 'brain' as interfering tissue, as it has been shown that various CT genes also exhibit expression in brain tissue (13). To be in accordance with Hofmann et al., we additionally allowed placental gene expression and included cancer cell-line libraries. CancerEST determined 25 of these genes as not expressed in any healthy tissue or as tight testis-restricted (Supplementary  Table S2). Additionally, seven genes were found to show limited evidence for brain expression, which could have been below the threshold of Hofmann  Table S2), which has also been reported in the literature (19). In total, CancerEST predicted that 19 genes have high cancer marker/target potential by exhibiting a testis-or testis-brain-restricted expression profile as well as cancer expression (Figure 2, Supplementary Table S2). For example, the gene MAGEA1, which encodes the first CT genes to be discovered (20), is, according to CancerEST, expressed in various cancers including melanoma, lung cancer, breast cancer and bone and connective tissue sarcomas (Figure 3), an observation that is supported extensively through literature (21)(22)(23)(24)(25).
The results are consistent with Hofmann et al.; however, CancerEST uses a very stringent cutoff, which could explain the weak evidence for expression in the brain that was found for seven genes as well as the limited evidence for expression in healthy tissues that was found for three genes. Furthermore, with more EST data becoming available, the predictions become increasingly accurate, and CT genes originally believed to have testis-restricted expression profiles have to be adapted to testis-selective (9,13). An alternative explanation for the limited evidence for expression in healthy tissues could be undiagnosed neoplastic change in the tissues analyzed, as many normal tissues are extracted from tissue obtained post mortem and are often pooled from tissues from a number of individuals, many of whom were aged at the time of death. In support of this, Chen et al. found discrepancies concerning the expression of some genes in normal tissues, as they detected expression in tissues from one panel of normal tissues, but could not detect expression in similar tissue types from a distinct second source (26). Thus, genes with testis-selective profiles could indeed be suitable candidates and be of clinical use.

Page 4 of 8 Database tool
Database, Vol. 2014, Article ID bau024, doi:10.1093/database/bau024 range of CT gene expression was found in melanoma (9), which is consistent with our results (Figure 2) and the literature (27).

Comparison to databases and tools currently available
Several tools exist that exploit EST data to construct integrated expression profiles; for example, TissueInfo (34) and TiGER (35) determine the tissue specificity for a given gene or tissue-specific genes for a given tissue, but do not evaluate cancer expression or cancer marker/target potential, and, importantly, neither allow the analysis for sets of genes. In contrast, the Unigene tool Digital Differential Display (DDD) (5) compares EST profiles of user-defined EST libraries to identify genes with significantly different expression levels and another Unigene tool the EST Profile Viewer (5) shows the approximate expression profile for a given gene. However, neither of the two focuses on the cancer marker/target potential for a set of related genes. Several other tools were published but appear to be currently unavailable [DigiNorthern (36), ZooDDD (37), GBA server (38)]. Therefore, a simple-to-use web tool such as CancerEST computing the cancer marker/target potential, the tissue specificity as well as comprehensive expression profiles for a set of genes of interest to biologists/ clinicians is not available to our knowledge.

Conclusion
In summary, we present CancerEST, an integrated bioinformatic analytical pipeline that was used to automate the identification of novel candidate cancer markers/targets and/or to determine the tissue specificity by means of constructing and analyzing the EST expression profiles of usersupplied gene lists across 36 tissue types. Furthermore, such an automated pipeline with a simple-to-use web interface puts an integrated EST analysis in the hands of researchers who are directly addressing biological questions.

Supplementary Data
Supplementary data are available at Database Online.