Summary:OrderedList is a Bioconductor compliant package for meta-analysis based on ordered gene lists like those resulting from differential gene expression analysis. Our package quantifies the similarity between gene lists. The significance of the similarity score is estimated from random scores computed on perturbed data. OrderedList illustrates list similarity in intuitive plots and determines the score-driving genes for further analysis.
Motivation: In microarray studies, researchers often compare gene expression profiles from two different conditions to generate lists of induced genes ordered according to a measure of upregulation and downregulation. For comparing results generated in different studies, we can search for similarities between ordered gene lists. The OrderedList package is dedicated to this task. The underlying algorithm is described in detail in Yang et al. (2006).
Range of applications: We can compare independent microarray studies addressing the same research question to confirm findings. More interestingly, we can compare studies from different but related contexts, e.g. survival in different types of cancer. Here gene list comparison can discover common markers. Moreover, two studies, which do not reach statistical significance for differential gene expression on their own, may present significant similarities in the corresponding gene lists. Comparisons are also feasible between different technological platforms, for instance between studies performed on different microarrays. Actually, data can also be deduced from heterogeneous data sources, for example, protein activities measured with immunoprecipitation arrays, allele frequencies determined in SNP studies and brain activity per voxel determined by functional magnetic resonance imaging (fMRI) (Loring et al., 2002). Although the method described in Yang et al. (2006) focuses on the comparison of microarray expression studies holding many profiles, the OrderedList package additionally implements a method purely based on lists. This further enlarges its fields of application.
Similarity score: Yang et al. define a similarity score to quantify list similarity. To compute the score, OrderedList determines the number of shared elements Sn in the first n elements of the lists for each n. The final score is a weighted sum over Sn where the ends of the lists receive larger weights, thus ensuring that the more strongly induced genes dominate the score.
Significance Analysis: To estimate the significance of detected list similarities, OrderedList randomly perturbs the input data to compute null distributions of the similarity score. Here we distinguish two modes of operation: the first needs complete sets of gene expression profiles whereas the second works with simple ordered lists. In the first case OrderedList perturbs the input data by subsampling from the profiles, and reordering the genes (Yang et al., 2006). When only single ordered lists are provided, shuffling is used to generate random lists. In both cases, scoring the perturbed lists generates null distributions for similarity scores, from which empirical P-values are deduced. In the presence of sufficient data, however, the first method is preferable, since it avoids that constantly expressed genes obtain prominent ranks in random lists. This is desirable, because otherwise random scores and empirical P-values are underestimated.
Results: An analysis by OrderedList yields a significant estimation for the similarity of gene lists. In addition, the package detects how far into the lists striking similarities occur. Finally, our algorithm determines the genes that drive the observed similarity score, i.e. genes with prominent ranks in all compared lists. These genes are most promissing for further analysis and interpretation.
Availability: The OrderedList software package is written in the language R developed within the R Project for Statistical Computing (R Development Core Team, 2004). It is part of release 1.8 of the Bioconductor suite of packages related to life science applications (Gentleman et al., 2004), free for use under the GNU General Public License and easy to install on various UNIX and Windows systems.
Data formats: OrderedList accepts data in two different formats. For the subsampling mode expression data including several profiles per condition need to be provided in Bioconductor specific format. In addition to the expression levels, the data must contain class labels for each profile. For the shuffling mode, OrderedList expects ordered vectors of character strings, each element identifying one gene. By default, OrderedList considers measurements (in expression data) or ranks (in ordered lists) as being related to the same gene, when they carry the same name. The user can provide mappings, however, to indicate pairs of differing identifiers relating to the same gene. Thus ordered lists generated on different platforms can be compared.
Output: OrderedList determines empirical P-values of similarity scores. It graphically illustrates the list comparison analysis as shown in Figure 1. Here the number of shared genes in the lists up to rank sn is related to the number of shared elements expected in randomly shuffled lists: in addition to the observed Sn, OrderedList draws its expectation and the 95% confidence intervals either according to the empirical distribution obtained from the subsampling if in sampling mode or according to a hypergeometric distribution if in shuffling mode. For further interpretation, OrderedList determines the genes that dominate the similarity score and returns their identifiers.
3 EXEMPLARY ANALYSIS
Example data: We illustrate the functionality of our package by comparing the following two gene expression studies: the breast cancer study by Huang et al. (2003) characterizing differentially expressed genes in patients at high risk versus patients at low risk for relapse, and the prostate cancer study by Singh et al. (2002) relating first diagnosis expression profiles from relapsed patients to those of cured patients. Both datasets were measured on Affymetrix GeneChip® HG-U95av2 arrays.
Results: Within each comparison, OrderedList derived rankings using regularized t-scores. We observed a significant similarity of the two gene lists (P = 0.0470). In Figure 1 we show one graphical output provided by the package. Displayed is the observed number of shared genes for all ranks and the corresponding expectation with 95% confidence intervals. In addition to the P-value of the similarity score, the plot supports the significance of the overlap.
Within the first 1000 top and bottom ranks, we found 102 genes contributing 95% to the total similarity score. In Table 1 we show the top-scorers of the prostate cancer comparison with their corresponding ranks in the breast cancer comparison. Some genes, such as AZGP1, were found at high ranks in both comparisons, other genes, such as MAFF, are far down the list of the breast cancer comparison. This finding shows that OrderedList does not aim for the most significantly induced genes but for a significant overlap of two independent expression studies, when the overlap differs substantially from randomness. Among the top-ranking overlaps we found many genes connected to various kinds of cancer, i.e. AZGP1, MAFF, ODC1, FMOD, JUNB, BTG2 and FOS [see OMIM™ (Online Mendelian Inheritance in Man, 2000, )]. This shows that OrderedList is able to pinpoint genes relevant to both compared studies.
Sorted according to ranks in the prostate study.
This research has been supported by BMBF grants 01GS0445 and 01GR0455 of the German Federal Ministry of Education and Research. In addition X.Y. was supported by a DAAD-Fellowship. Conflict of Interest: none declared.