Data from gene expression arrays are influenced by many experimental parameters that lead to variations not simply accessible by standard quantification methods. To compare measurements from gene expression array experiments, quantitative data are commonly normalised using reference genes or global normalisation methods based on mean or median values. These methods are based on the assumption that (i) selected reference genes are expressed at a standard level in all experiments or (ii) that mean or median signal of expression will give a quantitative reference for each individual experiment. We introduce here a new ranking diagram, with which we can show how the different normalisation methods compare, and how they are influenced by variations in measurements (noise) that occur in every experiment. Furthermore, we show that an upper trimmed mean provides a simple and robust method for normalisation of larger sets of experiments by comparative analysis.
Received February 12, 2002; Revised and Accepted April 9, 2002.
Normalisation of data is always required where variations in experimental conditions contribute greatly to the quantitative differences of measured values observed between samples. After normalisation, the measured data should be adjusted in such a way that subsequent comparison reveals only biological differences relevant for the scientific question being addressed. In gene expression array experiments these should be differences in gene expression and not differences in the experimental process or the measurement conditions used. Today, many different experimental methods are used for gene expression array analysis. Therefore, it is impossible to consider all differences with a simple and easy-to-use normalisation strategy. All normalisation methods depend on similarities between the samples analysed and require that basic assumptions be made about the behaviour of analysed parameters which can be expected (e.g. expression of specific genes). The main assumption of normalisation is a functional coherence between a true biological difference and the corresponding measured values, which should follow an unequivocal function. The second assumption is the presence of a property within the samples, which is unchanged. This could be the total amount of mRNA (globalisation methods), the unchanged expression of one or more housekeeping genes or the ‘well behaviour’ of change, which means the change is distributed normally [centralisation (1)]. Only a direct measurement (count) of the molecular amount of each RNA species per cell could lead to an assumption-free gene expression analysis. Until now there is no such method. Introducing external standards also does not solve the uncertainty of these assumptions (2). These standards, however, are crucial to ensure the quality of the experiments and to determine additional parameters of the experimental process, but do not solve the normalisation problem (3).
MATERIALS AND METHODS
We used gene expression data from several medical projects to evaluate and compare the various normalisation methods available. Data were taken from tumorous and normal samples of lung and kidney tissue, leukaemia blood cells (L.Odyvanova, unpublished data) and breast cancer cell lines (4). In these experiments 33P-labeled cDNAs were generated by reverse transcription from mRNA samples. After hybridisation against cDNA gene expression array membranes (Clontech Human Atlas Array 1.2) signals were detected by phosphorimaging using phosphorimaging screens (Fuji) and the STORM-reader (Molecular Dynamics). These image data were analysed with the AIDA array analyser software package (Raytest, Straubenhart, Germany). Further data analysis was done with Microsoft Excel and Mathworks Matlab 6.
All values derived from the image analysis were background corrected. Each normalisation method was done independently for each experiment. The respective scaling factor was determined by applying the appropriate formula or algorithm given in Table 1. All values of a single experiment were scaled with this factor.
The method of ranking is common in statistics. Several determinations of values like median, percentile, quantile or trimmed mean are based on it. All N values of an experiment were sorted by descending value and were given an ordinal number reflecting the relative position within that experiment. The first rank (#1) corresponds to the highest value, the lowest value corresponds to the last rank (#N where N is the amount of analysed genes on the array).
Visualisation and comparison measure
In the rank intensity plot (RIP) we plot intensity values of an experiment versus their rank. Each curve represents one experiment. Using experimental data from different experiments with the same gene set (defined measured parameter) leads to a diagram featuring global distribution of values. We define the absolute rank deviation (ARD) as the standard deviation of intensities of a given rank. The relative rank deviation (RRD) we define as the ARD divided by mean of intensities of that given rank.
In most gene expression experiments a linear normalisation function is applied. This means a scaling factor can be determined based on the assumed unchanged property. The first is a weak assumption because of non-linear effects like saturation, over shining or noise. Several adjustments have been made to circumvent such obstacles. The various normalisations and their respective assumptions are summarised in Table 2. Still it is difficult to compare the real outcome of all these normalisation methods. Most evaluations use scatter plots of two single experiments (5) for visualising the effects of different normalisation methods. Comparing these methods on larger, more variable sets of experiments with such display is almost impossible. When we analysed our expression data using RIPs for visualising the global distribution of gene expression values, we found that all data from our experiments with cDNA-array-membranes showed similar ranking curves. Assuming that this similarity could be a general property, we investigated whether RIPs can be used as an evaluation for the outcome of the normalisation method.
Figure 1 shows not-normalised measured data of an unsorted gene list. Only an unorganised set of peaks can be seen, which is not accessible for interpretation. After background correction all data were ranked by descending intensity, and RIPs were generated plotting all measured data against their rank. The resulting RIPs can be seen in Figure 2. Each curve represents one experiment. It can be clearly seen that the graphs of the ranked data show a high degree of similarity. After a normalisation step of the same data set the similarity of the RIPs becomes even more obvious (Fig. 3). All curves obtained after ranking follow an exponential-like shape. This shape is characteristic for an experimental setting, and is influenced by the selection of genes and the method for measuring.
Although no conclusion can be made for an individual gene, the RIP gives a good impression about the general properties of the data from one experiment. The distribution of noise is smoothed, because noise mainly influences the position of a gene within the ranking, but has only a weak effect on the intensities of neighbouring ranks. For this global view it does not matter which gene occupies a specific rank. Background noise, however, has an influence on all values, but at a much higher extent to lower and non-expressed genes. This can be seen by negative values, which appear after background subtraction. Often negative values are filtered and set to zero, the smallest possible biologically meaningful expression value (not expressed) or marked as absent or not detected. But for globalisation methods this can prove incorrect. This filtering removes only the negative influence of the background noise. The positive influence of background noise is still contained in the data and leads to a bias in all normalisation methods, which is particularly strong for globalisation methods. If a summation-based method is used (mean), the determination of the scaling factor should be done before positive filtering. A normal distributed noise should tend to zero effect on such a scaling factor if the gene set is large enough.
In the following we show how the RIP can be effectively used to evaluate various normalisation strategies with real gene expression data. In addition to the RIP curves of each experiment, our graphical display includes the mean intensity for each rank displayed as the rank mean curve (RMC). The absolute rankwise deviation of intensities is shown by the ARD curve. The better interpretable relative deviation of intensities is also given as the RRD curve. In Figure 2 not-normalised ranked data are shown. As one might expect, a high variation of intensities is visible. This is expressed in a general relative deviation of >150% over the whole range of ranks indicating that non-normalised data sets cannot be compared.
After normalisation with the mean intensity of each experiment, this large variation is reduced to a relative deviation of most ranks below 80% shown in Figure 3. Especially lower ranks (higher signal intensities) show a smaller relative deviation down to 20%.
Another common method is normalisation based on the median. This is thought to be more robust to extreme values that have a strong effect on the mean. In Figure 4 it can be seen that through the definition of the median a zero deviation occurs at the centre of the rank axis and that, therefore, the centre range of low-level genes shows only a small relative deviation. At higher intensities (lower ranks), however, a widening of the curves leads to a high relative deviation up to 80%. This widening is due to the effect of measurement noise, which has an additive effect of intensities near background. Because most ranks are near background, the larger the noise of the measurement, the larger the scaling factor determined. Therefore, the median represents more the extent of noise than a scaling factor proportional to the total mRNA amount.
The noise effect can be reduced by generalising this method using percentiles. Choosing a higher percentile (e.g. 80%), which is defined at lower ranks, reduces the background influence and narrows the lines at higher intensities (Fig. 5).
A still better result for the RRD of higher intensities is given by the trimmed mean. The trimmed mean method provides a tool to filter saturation effects and strong background variation. The 10% trimmed mean, for example, determines the mean of the centred 90% of values, cutting 5% of the highest and 5% of the lowest values. As can be seen in Figure 5, the symmetric trimming leads to a general low relative deviation in a wide range of ranks (% from position x to position y). If a different percentage of trimming on high and low values is necessary one can use an asymmetrically trimmed mean. The influence of a normal distributed noise on the total signal should be zero. Because ranking also leads to an ordering of values near background according to their extent of noise, filtering off negative values (higher ranks) leads to an addition of noise to a scaling factor determined by summation (mean). In this case higher noise would lead to bigger scaling factors. The application of only an upper trimmed mean could avoid this problem. With our data a 5% upper trimming gives a reasonable normalisation result, which is shown in Figure 6. In comparison both trimming adjustments have a similar outcome. The relative deviation between the experiments is reduced in a wide range of values and gives a best fit at medium high intensities (lower ranks).
We demonstrated that our easily applicable method of generating RIPs provides a good tool for evaluating normalisation methods on larger sample sets. The RRD provides a measure for the extent of normalisation. Based on this assumption we can recommend a 5% upper trimmed or 10% trimmed mean for determining the normalisation factor of samples from atlas filter arrays (human 1.2, Clontech). For other gene sets and experiments the percentage of trimming has to be evaluated. For arrays that contain only a highly regulated subset of genes, it must be expected that the values will behave in such a way that they cannot be normalised with globalisation methods. In such cases, the shape of the rank-intensity curves will show less similarity and the measure of RRD could fail. With larger sets of reference genes, which together represent an expression level proportional to the total amount of mRNA, it is possible to apply a trimmed mean to make subgroups of expressed genes accessible for comparison.
Measuring gene expression arrays is not only sensitive to variations in the experimental process but also to differences in the properties of the measuring equipment used. If the measurement process is too diverse, our methods for normalisation are not sufficient to make two independent measurements comparable. Using readers and/or imaging plates from different suppliers strong variations in signal functions can be observed. Even different image-processing software and image formats will influence the data set (6). With RIP such differences can be visualised and it can be avoided that non-comparable data sets are compared and wrong interpretations are made. The trimmed mean is only applicable when the function of measured values for each method is linear proportional to each other in the compared data set.
We mainly analysed experiments in which radioactive labelled probes and membrane arrays were used. Chemoluminescence behaves similarly to radioactivity and gives comparable results but with a higher tendency for saturation. The extension of our methods (RIP and trimmed mean) to fluorescence labelling is only rational if the following facts apply: the gene set should be rather large (greater than 1000); the amount of spotted material (spot size) should be normally distributed; the used fluorescent dyes must be determinable with the same measurement function. But, as Tseng et al. (7) pointed out, the label normalisation function in this competitive situation is non-linear and depends on the label used and on the given slide. RIP still can display this divergence.
Whether the intensity-rank-function can be used for quality assessment or for the adjustment of non-linear normalisation protocols remains an open question. In competitive hybridisation experiments signals of two measured channels are interdependent. Thus, it is required to adjust the normalisation methods to each experiment. Therefore, the basic assumption of a simple channel-wise scaling as used in our analysis is not appropriate. For normalisation in these types of experiments, methods like local linear regression or windowed ratio centring seem to be adequate (8). These methods, however, do not allow the evaluation of the inherent quality of experimental data sets used for comparative gene expression profiling.
Despite these limitations, the intensity-rank-function as displayed with RIP is a versatile tool to visualise general features of data sets used in comparative analysis. In this way it can be easily used to ensure that only comparable data sets are included in large-scale analysis of gene expression measurements.
We thank Larissa Odyvanova, Joachim Clement and Andreas Burchert for providing the experimental data. Our special thanks go to Reinhard Guthke and Daniel Hahn for critical discussions. The work was supported by a scholarship from the Friedrich-Schiller-University Graduate Student Program and the German Ministry for Science and Technology (BMBF).
To whom correspondence should be addressed. Tel: +49 3641 932261; Fax: +49 3641 932300; Email: email@example.com
N, total number of genes of set.
i, index of genes.
r, rank of gene.
si, background corrected signal of gene with index i.
sr, background corrected signal of gene with rank r.
Nref, number of reference genes.
iref, index of reference genes.
η, scaling factor to be applied.
n, amount of substance.