The Human Anatomic Gene Expression Library (H-ANGEL), the H-Inv integrative display of human gene expression across disparate technologies and platforms

The Human Anatomic Gene Expression Library (H-ANGEL) is a resource for information concerning the anatomical distribution and expression of human gene transcripts. The tool contains protein expression data from multiple platforms that has been associated with both manually annotated full-length cDNAs from H-InvDB and RefSeq sequences. Of the H-Inv predicted genes, 18 897 have associated expression data generated by at least one platform. H-ANGEL utilizes categorized mRNA expression data from both publicly available and proprietary sources. It incorporates data generated by three types of methods from seven different platforms. The data are provided to the user in the form of a web-based viewer with numerous query options. H-ANGEL is updated with each new release of cDNA and genome sequence build. In future editions, we will incorporate the capability for expression data updates from existing and new platforms. H-ANGEL is accessible at http://www.jbirc.aist.go.jp/hinv/h-angel/.


INTRODUCTION
Genome-scale analyses of gene expression have grown exponentially in the last few years, providing clues to the function *To whom correspondence should be addressed. Tel: +81 3 5531 8550; Fax: +81 3 5531 8551; Email: mtanino@jbirc.aist.go.jp The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions@oupjournals.org. of genes and genomes and helping our understanding of the molecular basis of health and disease. A growing number of technological platforms are available for conducting these studies, including solid-support approaches, such as oligonucleotide (1) and cDNA arrays (2,3), PCR-based high-throughput expression profiling methods such as introduced amplified fragment length polymorphism (iAFLP) (4) and random tag identification, such as the serial analysis of gene expression (SAGE) (5) or massively parallel signature sequencing (MPSS) (6). The possibility of integrating all the data already produced has the potential to provide unique insight into understanding the expression pattern of a whole genome. In order to assess the compatibility of data, several research groups have compared results from distinct types of high-throughput expression technologies (7,8).
Comparison has been done at the gene level and some groups have reported achieving good correlations between results produced by different techniques (9). Nonetheless, others have reported significant discrepancies between the output of certain techniques (10,11). However, no group has yet tried to integrate expression data at the resolution of transcript variation with the intention to resolve discrepancies in gene level comparison or across multiple platforms. The Human Anatomic Gene Expression Library (H-ANGEL) was developed for the first international annotation jamboree of the human transcriptome, entitled the Human Full-length cDNA Annotation Invitational (H-Invitational) (12). During the jamboree, the tool was used to present expression data from different methods and platforms in a manner that aided the manual annotation of predicted loci. We combined publicly available expressed sequence tag (EST), SAGE and microarray data with proprietary gene expression data that was generated and analyzed by members of the H-Invitational consortium.
H-ANGEL is the first step towards a global analysis (metaanalysis) of gene expression data, providing an overview of consistencies and discrepancies between expression data generated by different platforms. It is hoped that this display will help us to appreciate the fortes and caveats of the different technologies available, so that in future studies, the maximum amount of beneficial information can be derived from the appropriate use of each method.

Data resources
One of the distinctive features of H-ANGEL is that it contains a substantial amount of disparate and unique data brought together for the jamboree. A large proportion of publicly available data has been created to answer specific questions. For this reason, the experimental information associated with the data can vary in quality and is often brief in content or limited in utility because of intellectual property issues (13). Nevertheless, we found that the benefits of including such data outweighed the caveats inherent in performing an analysis involving this kind of proprietary data. Gene expression data were collected as follows: (i) iAFLP profiling data were generated as described previously (4) using 22 987 primers corresponding to 14 431 independent UniGene clusters for competitive RT-PCR using mRNA from 71 tissue samples. (ii) Long oligomer microarray (Oligoarray) data were generated by dual-color competitive hybridization. Under this process commercially available pools of human tissue RNAs were hybridized against custom-made oligomers of between 50 and 60 nt in length. (iii) MPSS data for human mRNAs were generated by the Lynx Corporation (6) for the National Institute of Genetics. The tag-to-gene mapping was also provided based on the position and direction of the tags by the manufacturer (14) but it may require further qualification. (iv) For cDNA array (custom-made cDNAarray), total and poly(A) + RNAs were purchased from Clontech (Palo Alto, CA) and Stratagene (La Jolla, CA). Probes were prepared using a direct labeling protocol with a reference design experiment (e.g. each sample versus a universal reference design), and double color hybridizations on human cDNA glass slides Dye Swap was performed (15,16).
Data already in the public domain were processed as follows. In UniGene Release 157 (17), the number of EST clones from libraries representing normal adult tissues without normalization or subtraction steps amounted to 745 446 in total. They were combined with 91 509 tag sequences from BodyMap (18) which represented 53 normal adult tissues libraries. The counts of cognate clones were based on UniGene and BodyMap. SAGE tags were selected from GEO (http:// www.ncbi.nlm.nih.gov/geo/) and processed in a manner similar to the ESTs. Tag-to-gene correspondence was achieved through the determination of virtual tags for all transcripts in our dataset, in a similar strategy as the one used for SAGE Genie (19). GeneChip data were obtained from the HuGEIndex site (http://zlab.bu.edu/HugeIndex/index.htm) and the Normal Tissue Database site (http://www2.genome. rcast.u-tokyo.ac.jp/tp/).

Data processing
In order to allow users to compare the results from different platforms in an intuitive way, hundreds of mRNA sources used in the original analysis were manually categorized into 40 practical tissue types, based almost entirely on existing tissue classes used by commercial manufacturers of mRNAs. A table cross-referencing the tissue category originally assigned for each dataset by its provider and the corresponding tissue category manually allocated by H-Invitational consortium members can be viewed at http://www.jbirc.aist. go.jp/hinv/h-angel/title/tissue_html_list.html.
Tags were counted according to the 40 tissue categories and counts were normalized by calculating the total tag counts from each of the 40 tissues. All those values representing relative expression levels across all tissues, and other relative expression values from arrays and iAFLP, were normalized to make the sum total of expression across 40 tissues equal to 1. In studies where expression was not measured in all 40 tissues, the sum of normalized values was given by the total number of tissues under test divided by 40. For example, when only 20 tissues were tested by one platform, the sum of normalized values was set to 0.5.
The Spartan distribution of expression data among some of the 40 tissue often makes direct comparison between tissues and across multiple platforms difficult. Owing to the inherent difficulties associated with direct comparison across the 40 tissue categories, we decided to create 10 supracategories-groups representing related tissues ( Figure S1). When amalgamating the expression data from 40 tissue categories to 10, the average normalized value was given for each category. For tag frequency data, tags were counted again for each category and normalized by sum for each category.
Linking to the full-length cDNA assembly in H-InvDB Using accession number IDs, we were able to cross-refer the clones from the H-InvDB predicted loci with their counterparts from UniGene. If the corresponding UniGene clone had any SAGE or EST expression data linked to it, these data were then associated with the matching H-Inv clone. This association procedure was repeated for RefSeq sequences which were members of predicted loci. The number of loci that could be associated with SAGE and EST data in this way is shown in Table 1. A total of 18 897 H-Inv loci have associated expression data from at least a single platform. The number increases to 24 520 if we take into account those loci in which all the members are RefSeq sequences. Sequences of SAGE tags, Oligo arrays, ESTs and iAFLP primers were mapped onto individual full-length cDNAs collected for the jamboree and all possible relationships between expression patterns and full-length cDNA clones were described.

QUERYING THE DATABASE-THE FUNCTIONALITY OF H-ANGEL
We have developed a web interface to provide easy access to the data stored in the H-ANGEL database. The H-ANGEL home page provides access to two separate web interfaces. These are the 'H-Inv Locus Search for Gene Expression' and the 'Expression Pattern search'. Using the 'H-Inv Locus Search for Gene Expression', the user can search all the expression data available in the database for a particular gene or a gene list using several identifiers, such as H-Inv Cluster ID (HIX), RefSeq/FLcDNA accession numbers from DDBJ/GenBank/EMBL International Nucleotide Sequence Database (INSD), UniGene IDs, LocusLink IDs, definition keywords or gene product name.
After the search has been performed, the resulting web page consists of the following three sections: (i) Display H-Inv Cluster ID Box. This section shows all the H-ANGEL entries ( Figure S2A) corresponding to the submitted query. The users access expression data from a specific locus by selecting the corresponding HIX number and clicking the 'Display' button. (ii) Expression Pattern View. This section is the main view of H-ANGEL that displays an overview of all the expression data stored in H-ANGEL according to classified tissue categories ( Figure S2B). All the H-ANGEL expression data related to each HIX number is listed along with the type of platform used for the analysis and the cDNA clone which is most likely to correspond to a given piece of expression data. Additionally, for iAFLP, SAGE and MPSS data, users can see the position of all tags or probes in relation to the locus or cDNA along with the corresponding exon-intron structure.
For SAGE data, we display the location of any internal adenosine stretches to make the user aware of possible internal priming sites. For ESTs, the frequencies of exon coverage is shown. Gene expression patterns are displayed for both the 10 and 40 tissue category groups using a histogram. For each bar on the histogram, the user can see the tissue expression level as a percentile value by moving the mouse over the histogram bar. (iii) Expression Information in Text. This section shows publicly available information related to the clones on the locus in text format. It also shows up only when a single H-Inv locus entry is selected to be displayed. In the 'iAFLP information Box', conditions of gene expression measured by the iAFLP experiment for each tissue for clones on the locus are reported. In the 'UniGene information Box', tissues in which clone(s) from the UniGene cluster corresponding to the locus are reported ( Figure S2B). Via the 'Expression Pattern Search View' interface, the user can retrieve H-ANGEL entries using a similarity search based on expression patterns among distinct tissue categories ( Figure S2C). Users can set an arbitrary expression pattern across 10 tissue

EXAMPLES OF CONSISTENCY AND DISCREPANCY BETWEEN PLATFORMS
In some cases, alignment of multi-platform expression patterns, individually mapped onto distinct spliced forms, allows users to deduce the expression patterns for each spliced form. Figure 1A shows an H-ANGEL representation of the locus of a potassium channel, subfamily K, member 4 (KCNK4). In this predicted locus, two transcript variants are known among the three clustered RefSeq sequences (see NCBI LocusLink: http://www.ncbi.nlm.nih.gov/LocusLink/LocRpt. cgi?l=50801). Marked discrepancies among expression patterns for this locus suggest that the transcript 2 and 4 are mainly expressed in the testis and transcripts 1 and 3 are expressed mainly in the brain. This observation is generally consistent with the literature (20).
In Figure 1B, a disagreement in the expression patterns of the dopamine receptor D5 (DRD5) between platforms can be clearly observed. The three microarray-based methods report a low-uniform distribution with no high levels of expression in any one tissue. However, the two PCR-based techniques predict that DRD5 is more highly expressed in neural tissues than other tissue types. As DRD5 is a well-studied protein, we know from repeated northern blot analyses that the PCR-based results are in accord with those observed previously (21). This result can be due to the greater sensitivity in many cases of PCR techniques over microarray-based techniques (22).

CONCLUSION AND FUTURE DEVELOPMENTS
Approximately 90% of H-Inv loci could be assigned expression data from at least one platform. In the majority of the predicted loci, some extent of discrepancy across platforms was noted. However, as shown in the examples, careful inspection using the viewer suggested that many of the discrepancies probably did not only represent simple errors in some measurements but that we may be seeing the effects of intrinsic factors associated with measuring expression using particular techniques coming into play. For example, there is growing concern among users of microarray technologies regarding disagreements between measurements due to alternative splicing (23)(24)(25), since several lines of evidence indicate that a large portion of our genes (40-60%) have alternatively spliced forms (26)(27)(28). Currently, even when adequate data are available to make an appropriate assessment an 'informed decision' made by a human is still required in order to confirm the most likely and logical expression pattern for an individual transcript. The next step in the evolution of H-ANGEL will be to automate the process of these informed decisions so that H-ANGEL will be able to systematically deduce the most likely expression patterns for each transcript from conflicting expression data.