-
PDF
- Split View
-
Views
-
Cite
Cite
Jakob Raymaekers, Ruben H Zamar, Pooled variable scaling for cluster analysis, Bioinformatics, Volume 36, Issue 12, 15 June 2020, Pages 3849–3855, https://doi.org/10.1093/bioinformatics/btaa243
Close -
Share
Abstract
Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research.
We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients.
An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software.
Supplementary data are available at Bioinformatics online.
1 Introduction
Every time cluster analysis is used to find homogeneous groups in the data, we face the issue of how to scale the variables. The choice of scaling (including the option of not scaling at all) has practical implications because, in general, clustering methods are not scale-invariant. Even those clustering methods that are scale-invariant, become scale-dependent when they are combined with variable selection or regularization. As a consequence, we may get a different data partition if the variables in the data are rescaled. Unfortunately, there is not a generally accepted way to scale the variables for clustering (Jain, 2010; Jain and Dubes, 1988).
There are conflicting recommendations regarding the scaling of the data in the clustering literature. For example, Milligan and Cooper (1988)—often cited as the main benchmark study of this topic in the context of hierarchical clustering—recommend scaling the variables by their range. Steinley (2004) came to the same conclusion using k-means. On the other hand, Vesanto (2001) recommends the use of the SD for k-means clustering. Schaffer and Green (1996) studies clustering on real-data examples and argue against scaling with the range or the SD. Stoddard (1979) also argued against the use of the SD in the analysis of laboratory procedures.
While scaling may not always be necessary, it seems that in general the most reliable approach is to use some sort of scaling for two main reasons. The first reason is that scaling gets rid of the measurement scales of the variables. These measurement scales may have a strong influence on the clustering results, to the extent that a single very large variable can solely determine the whole clustering outcome. Furthermore, it can be of practical importance to get rid of measurement scales, e.g. a variable measuring ‘height’ should have the same effect on the clustering procedure when measured in centimeters or in meters. The second reason is that scaling becomes practically mandatory in the context of high-dimensional clustering with variable selection. Variable selection is usually achieved through the addition of a penalization term to the objective function. This penalization is typically not scale-invariant and thus yields different outcomes depending on the variables’ relative sizes. Therefore, without scaling, the penalty acts differently on each variable, which will often lead to ineffective variable selection. For these reasons, we consider feature scaling as a necessary pre-processing step in cluster analysis.
Though the arguments in favor of scaling before cluster analysis are clear, the issue of how to scale is delicate. The reason is that there are two types of variables: informative variables and noise variables. Informative variables help to separate clusters in the data whereas noise variables do not. If an informative variable is scaled with a very large scale, it will be compressed to a small size and thus there is a risk that much of its clustering power will disappear. In contrast, if a noise variable is scaled with a very small scale, it can potentially be blown up to where it solely determines the clustering outcome. Note that, this issue is even more pronounced in high-dimensional datasets as they typically contain many noise variables.
Interestingly, most of the commonly used scale estimators, such as the SD and the range have a tendency to yield large-scale estimates on potentially very informative variables. More precisely, these estimators may produce unduly large scales for a variable with a multimodal distribution because of the dispersion between the groups. Hence, a drawback of feature scaling by these estimates is the possible undesirable reduction of the relative clustering importance of features that clearly separate different groups. While these features are not guaranteed to reveal the whole clustering structure of the data, they show good potential and should be treated carefully.
The discussion so far suggests that there is a need for further study of both the effect of scaling and the best choice of scale estimator in cluster analysis. Despite the considerable potential impact of scaling on cluster outcomes, papers on this topic are scant, far apart and lack consensus.
In Section 2, we introduce two new scale estimators specially designed to scale variables before clustering: the pooled standard deviation (pSD) and the pooled absolute deviation. By using pooled scale estimators, we aim to scale variables without destroying their clustering power if they have any. If the marginal distribution of a feature shows evidence of clustering, our proposed scaling will enhance its clustering importance. If the marginal distribution of a feature X does not show evidence of clustering, but the joint distribution of some subset of features including X carries important clustering information, our proposed scaling approach will have a neutral effect on this variable. The proposed scaling approach can be used as pre-processing before applying any clustering method, including those with variable selection and subspace clustering, as illustrated with several examples. To calculate our proposed scale, we first run k-means clustering on each of the variables. In order to choose the number of clusters used to estimate the pooled scales, we use the Gap statistic (Tibshirani et al., 2001) with a sped-up bootstrapping procedure that bypasses the otherwise unaffordable computational burden of this approach.
The rest of the article is organized as follows. In Section 3, we compare the new scale estimators with existing scaling procedures in an extensive simulation study. Moreover, we show an application of the pooled scale estimators prior to the hierarchical clustering of gene expressions of breast cancer sample tissues (additional real-data examples are presented in the Supplementary Material). Finally, Section 4 concludes.
2 Materials and methods
2.1 Pooled scale estimators
Our idea is to use Sk, with an appropriate (variable depending) value of k, for scaling the variables prior to the application of a clustering procedure. We will refer to this scale estimator as the pSD. The intuition for this scale estimator is the following. If a certain variable appears to not separate any clusters in the data, we consider the variability of this variable to be uninformative and we use the largest scale, S1 (i.e. the classical SD), to scale it before clustering. However, when a variable does seem to separate, e.g. k clusters, then Sk will tend to be relatively small compared with S1, and using Sk in that case will avoid dampening the variability (i.e. information) in this variable. This way, we hope to preserve as much of the clustering information in each variable as possible, while still making the variables unitless.
Remark 1. Since we are using univariate k-means and k-medians to obtain the pSD and pmad, it is worth noting that there is an algorithm, which guarantees the convergence to a global optimum instead of a local one. This algorithm uses dynamic programing, runs in time and is implemented in the R-package Ckmeans.1d.dp, see Wang and Song (2011).
2.2 Determining k
An important question regarding our proposed pooled scale estimators is how to choose the appropriate number of clusters for each variable in the dataset. There is a vast literature on how to choose the appropriate number of clusters in cluster analysis. Relevant comparative studies giving a good overview of the most common techniques are found in Milligan and Cooper (1985), Halkidi et al. (2001), Maulik and Bandyopadhyay (2002), Brun et al. (2007) and Arbelaitz et al. (2013). In the setting of pooled scale estimators, we need a criterion that is relatively fast to compute and not too sensitive to spurious clusters. Most importantly, we need a criterion that can distinguish between the ‘null-case’ of one-cluster versus the alternative case of two or more clusters. This is an important point, since especially in high-dimensional datasets, we expect that many variables may not have interesting information for clustering the data and thus should be scaled by the largest scale, i.e. the SD or mad. Note that, this requirement makes many popular cluster validation indices unsuitable for us, since many of them do not yield a reasonable comparative assessment of the one-cluster-case.
A well-known criterion that satisfies our needs is the Gap statistic (Hastie et al., 2009; Tibshirani et al., 2001). Given a certain clustering algorithm and the resulting partition of the data, the gap statistic works as follows. For each value of the number of clusters under consideration, the ‘tightness’ of the clusters in the found partition is compared with the tightness of clusters obtained by clustering random datasets using the same algorithm. If this difference is large for a given number of clusters, it indicates ‘stronger than random’ clustering structure and the gap statistic will pick that number of clusters as the true number. A mathematical description of the gap statistic is as follows. Suppose, we have clustered the data into k clusters, , where Cj denotes the indices of the observations in cluster j. Let Wk be the sum of the within-cluster sums of squares around their corresponding cluster means, i.e. where is the mean of the observations in cluster j. In order to identify the number of clusters, the value of is compared to its expected value under a uniform reference distribution on the range of the dataset. If this value deviates too much from its expected value under a uniform distribution, it indicates the existence of clusters in the data. The intuition for the comparison with the uniform distribution is that it is the distribution, which is most likely to generate spurious clusters (within the family of log-concave densities) and will thus on average provide the strongest evidence against the alternative hypothesis.
In principle, the reference distribution of is determined by generating bootstrap samples from the uniform distribution. As we would like scale every variable in the dataset, this appears to lead to a prohibitive computational cost, which would scale poorly with the number of variables. Fortunately, there is an efficient way to bypass this hindrance based on the results of Proposition 1 below.
Therefore, in our setting, the gap statistic will be large when there is a ‘significant difference’ in the estimated pooled scale compared with the pooled-scaled estimate on data coming from a uniform distribution.
In order to estimate and , we apply k-means to B bootstrap samples of size n. The mean of these samples serves as the estimate for and the appropriate scale, which accounts for the simulation error in , is then the SD of the bootstrap samples multiplied by .
We now turn to speeding up the bootstrapping procedure for scaling all the variables in a dataset. The speed-up is achieved by exploiting the fact that we are in the univariate setting and by using the properties of the proposed scaling methods established in the following proposition.
Proposition 1. Let be a sample of univariate observations and let be a partition of resulting from solving the k-means clustering problem. Denote the value of the objective function with Sk as in Equation (1). Let s>0 and consider where for . We then have:
Shift and Scale invariance of k-means clusters: is a solution to the k-means clustering problem on .
Shift invariance and Scale equivariance of k-means objective function: The value of the objective function of this clustering is .
Shift and Scale invariance of gap statistic: The number of clusters selected by the gap statistic is the same for and .
The exact same result holds for clustering with k-medians. These results are intuitive and follow from the fact that Manhattan and Euclidean distances are scale equivariant. We give the proof in the Supplementary Material. Using this proposition, it becomes clear that we need to bootstrap the reference distribution of only once, instead of for every variable separately. The reason is that we can first rescale each variable in the dataset by the range, so that all of them have the same range of length one. Now the reference distribution of Wk is the same for each of these variables, allowing us to bootstrap once from the uniform distribution on to obtain the reference distribution for all variables. Note that, it does not matter on which interval of length 1 the variable takes its values, since the whole procedure is shift invariant as well.
Remark 2. The uniform distribution is not the only possible reference distribution one can use. Another option briefly suggest by Tibshirani et al. (2001) is to use log-concave density estimation, which is possible for univariate distributions. This would yield a fit of a log-concave density to every variable in the data from which bootstrap samples can then be drawn. While this takes into account the individual distributions of every variable, the drawback is that in this case we have to take the slow approach of generating separate reference distributions for every variable.
Remark 3. Instead of the gap statistic, the jump statistic (Sugar and James, 2003) may be used as an alternative. In the univariate setting, the jump statistic considers the ‘distortions’ for several numbers of clusters k. They then define the jumps as , where . Finally, the number of clusters is estimated by taking . One advantage of the jump statistic over the gap statistic is that it faster to compute, since we do not need to bootstrap a reference distribution. In our simulations however, the gap statistic performed better.
2.3 Algorithm
We are now ready to describe the procedure; we propose for scaling a dataset prior to clustering. For a p-variate dataset with n observations per variable, we apply the following steps to scale all the variables:
Generate B bootstrap samples of size n from the uniform distribution on and cluster each of them using k-means. Retain all the values for and . Denote with , the estimates for the location and scale of , respectively. Finally, put .
For all variables Xj, , do:
Rescale the variable with its range to obtain , where .
Cluster Zj using k-means for and retain the values .
Calculate the values of the gap statistic: .
Choose the number of clusters , by setting .
Rescale the value of the objective function of the appropriate k, , and use this pSD to scale Xj.
The constant c in the above procedure controls the rejection of the null model. As c goes up, it becomes less likely to reject the null model of zero clusters. A default value is c = 1, which works well according to Breiman et al. (1984) and Tibshirani et al. (2001). For the number of bootstrap samples, a default of B = 1000 yields almost no variance in the resulting scale estimates in our experience. Replacing k-means with k-medians yields the scaling procedure with Mk, the pmad.
Example 1: Fisher’s Iris data
As an illustrative example, we consider Fisher’s well-known Iris dataset (Fisher, 1936), collected by Anderson (1935). This dataset contains 50 samples from each of 3 types of iris: Iris setosa, versicolor and virginica. Each flower is described by four variables, which describe the dimensions of its sepal and petal. Table 1 illustrates the effect of scaling with the proposed pSD on this data and compares with the SD and the range. For the first two variables, the pSD is equal to the SD, because these variables do not seem to clearly distinguish any groups in the data. However, variables 3 and 4 seem to both distinguish two clear groups, resulting in significantly lower pSDs. The last row of the table reports the adjusted rand index (ARI) (Hubert and Arabie, 1985) with respect to the true classification when performing k-means with k = 3 on the dataset after scaling. The ARI takes on values between −1 and 1, where an ARI of 1 indicates perfect agreement between partitions and the lower the ARI, the higher the disagreement between the partitions. It is clear that scaling with the pSD gives better results than scaling with the SD or the range, since neither of these take into account the individual separative power of the variables. As a reference, we mention that k-means clustering without scaling gives an ARI of 0.73.
The effect of variable scaling on Fisher’s Iris data
| Variables . | SD . | Range . | pSD . |
|---|---|---|---|
| 0.83 | 3.6 | 0.83 |
| 0.44 | 2.4 | 0.44 | |
| 1.77 | 5.9 | 0.40 | |
| 0.76 | 2.4 | 0.18 | |
| k-means ARI | 0.62 | 0.72 | 0.89 |
| Variables . | SD . | Range . | pSD . |
|---|---|---|---|
| 0.83 | 3.6 | 0.83 |
| 0.44 | 2.4 | 0.44 | |
| 1.77 | 5.9 | 0.40 | |
| 0.76 | 2.4 | 0.18 | |
| k-means ARI | 0.62 | 0.72 | 0.89 |
Note: The pSD produces smaller scales for variables 3 and 4 than the classical SD and the range. This results in a higher ARI when clustering the data with the k-means algorithm after scaling.
The effect of variable scaling on Fisher’s Iris data
| Variables . | SD . | Range . | pSD . |
|---|---|---|---|
| 0.83 | 3.6 | 0.83 |
| 0.44 | 2.4 | 0.44 | |
| 1.77 | 5.9 | 0.40 | |
| 0.76 | 2.4 | 0.18 | |
| k-means ARI | 0.62 | 0.72 | 0.89 |
| Variables . | SD . | Range . | pSD . |
|---|---|---|---|
| 0.83 | 3.6 | 0.83 |
| 0.44 | 2.4 | 0.44 | |
| 1.77 | 5.9 | 0.40 | |
| 0.76 | 2.4 | 0.18 | |
| k-means ARI | 0.62 | 0.72 | 0.89 |
Note: The pSD produces smaller scales for variables 3 and 4 than the classical SD and the range. This results in a higher ARI when clustering the data with the k-means algorithm after scaling.
3 Results
3.1 Simulation study
The most well-known comparative study on the scaling of variables in clustering is arguably the one by Milligan and Cooper (1988), building on Milligan (1985). Recently, Qiu and Joe (2006) used the design of this simulation study as a basis for a new algorithm to generate clusters with a specified degree of separation. The R-package clusterGeneration (Qiu and Joe., 2015) contains an implementation of their algorithm and it will be the basis of our simulation study.
We compare the following types of scaling in our simulation study:
No scaling
The SD:
The range:
The mad:
The pSD:
The pmad:
In order to get a complete picture of the different scaling methods, we perform an extensive simulation study. Each generated dataset has an even number of clean variables, which we vary between 2 and 10. The clusters are generated from the multivariate standard normal distribution. We consider equally sized clusters of size 100 each. The degree of separation between the clusters is either separated (0.21) or well-separated (0.34), see Qiu and Joe (2006). We then add a percentage of noise variables to the dataset varied between 0% and 2000% of the number of clean variables. The noise variables can either be multivariate standard normal or uniform over the range of the clean variables. The uniform-noise variables are generated by adding a small Gaussian perturbation to an equally spaced grid over the range of the signal variables to ensure that they do not have any separative power. We did the simulation both on clean data as well as contaminated data. For the contaminated data, 5% of the observations of each of the signal variables are replaced with points sampled randomly from the uniform distribution on , where and denote the mean and SD of the signal variable. Table 2 summarizes the factors of our simulation study and their levels.
Design factors of the simulation study
| Factor . | Levels . | # . |
|---|---|---|
| Number of clean variables | 2, 4, 6, 8, 10 | 5 |
| Number of clusters | 2, 3, 4, 5 | 4 |
| Degree of separation | separated, well-separated | 2 |
| % noise variables added | 0, 50, 100, 150, 200, 500, 1000, 2000 | 8 |
| Type of noise | Gaussian, uniform | 2 |
| % outliers | 0, 5 | 2 |
| Factor . | Levels . | # . |
|---|---|---|
| Number of clean variables | 2, 4, 6, 8, 10 | 5 |
| Number of clusters | 2, 3, 4, 5 | 4 |
| Degree of separation | separated, well-separated | 2 |
| % noise variables added | 0, 50, 100, 150, 200, 500, 1000, 2000 | 8 |
| Type of noise | Gaussian, uniform | 2 |
| % outliers | 0, 5 | 2 |
Design factors of the simulation study
| Factor . | Levels . | # . |
|---|---|---|
| Number of clean variables | 2, 4, 6, 8, 10 | 5 |
| Number of clusters | 2, 3, 4, 5 | 4 |
| Degree of separation | separated, well-separated | 2 |
| % noise variables added | 0, 50, 100, 150, 200, 500, 1000, 2000 | 8 |
| Type of noise | Gaussian, uniform | 2 |
| % outliers | 0, 5 | 2 |
| Factor . | Levels . | # . |
|---|---|---|
| Number of clean variables | 2, 4, 6, 8, 10 | 5 |
| Number of clusters | 2, 3, 4, 5 | 4 |
| Degree of separation | separated, well-separated | 2 |
| % noise variables added | 0, 50, 100, 150, 200, 500, 1000, 2000 | 8 |
| Type of noise | Gaussian, uniform | 2 |
| % outliers | 0, 5 | 2 |
All together this gives 1280 different settings and for each of these, we generate 100 datasets. Each generated dataset is scaled using the six scale estimators described above. Afterwards we perform the most popular methods of connectivity-based clustering and centroid-based clustering. More specifically, we use hierarchical clustering (Anderberg, 2014; Hartigan, 1975) on the Euclidean distances with single, average, complete and Ward’s linkage functions (Ward, 1963) as well as k-means (Lloyd, 1982; MacQueen et al., 1967) and partitioning around medoids using the Manhattan distance (Kaufman and Rousseeuw, 2009). For k-means, the algorithm of Hartigan and Wong (1979) is used with 100 random starts and 100 maximum iterations for each starting value.
We compare the results using the ARI (Hubert and Arabie, 1985), which lies between −1 and 1 where 1 indicates a perfect clustering. We cluster each dataset for a variety of target clusters , where T denotes the true number of clusters. For k-means, this value is a direct input, whereas for hierarchical clustering, we cut the dendrogram at these various levels of k. We then pick the optimal value of the ARI over these different clusterings. The reason for this procedure is that we want to evaluate the effect of scaling on clustering without any distortion from the question of how to choose the optimal number of clusters. Furthermore, particularly in the case of hierarchical clustering, the clusterings resulting from a higher number of partitions are often more reflective of the real underlying structure than cutting the dendrogram at the true number of clusters, since single outlying observations can distort the dendrogram significantly.
Figure 1 shows the big picture of the simulation results for hierarchical clustering on data without outliers. For each type of linkage, the graphs show the average ARI over all different settings with the increasing number of noise variables on the x-axis. Several interesting observations can be made from these plots. It is clear that as the number of noise variables increases, the clustering gets more difficult resulting in generally lower ARI values. However, scaling with the SD or the mad is clearly more sensitive to noise variables than the other methods. Scaling with the range does fairly well, which is in line with the findings of Milligan and Cooper (1988). The pooled scale estimators outperform all the other methods, especially when the number of noise variables is large. The difference between the pSD and the pmad seems very small and not significant in these plots. Finally, not scaling appears to perform similarly to scaling with either SD or mad. This is due to the particular simulation setup and one must always take into account that the performance of not scaling the variables can be completely destroyed by taking one noise variable, which has a variance which is much larger than the variance of the signal variables.
Simulation results for hierarchical clustering with single (a), average (b), complete (c) AQ10and Ward (d) linkage functions on outlier-free data. The pooled scale estimators are the least sensitive to an increasing number of noise variables
Figure 2 shows the results for k-means clustering and partitioning around medoids. The conclusions are largely the same as for the case of hierarchical clustering. When scaling with the SD or mad, the true clustering structure seems more difficult to retrieve. The range preserves more of the cluster structure and performs better than the SD and mad, which is in line with Steinley (2004). However, the pooled scale estimators again outperform the competitors.
Simulation results for k-means (a) and partitioning around medoids (b) on data without outliers. The pooled scale estimators are more resistant to the addition of noise variables to the data
The simulation results presented above only give a rough overview of the performance of the methods and do not show the performance in the presence of outliers. The most interesting insight from a more detailed analysis of the results is that the performance of the methods is highly dependent on the type of noise variables, which are added to the signal variables. Scaling with the range works well when the noise variables are more Gaussian but fails when the noise is more uniform. This can be explained by the fact that uniform noise variables have a large variance given their range. As a result, their impact on the clustering is large when scaling with the range. Scaling with the SD and mad works much better when the noise variables are more uniform than the case of Gaussian noise. This in turn can be explained by the fact that the uniform noise variables have a high variance for their range compared with Gaussian noise variables. Scaling them by their variance pushes the uniform noise more toward the center, which limits their influence.
The effect of outliers is shown in Supplementary Figures S1 and S2. These results expose the sensitivity of the range to the presence of outliers. Other than that, the relative performance of the different scaling methods is roughly the same as in the case of outlier-free data. An interesting note is that the pmad is clearly more robust to outliers than the pSD, which resembles the robustness of the mad versus that of the SD in classical scale estimation.
3.2 Gene expression example
In a seminal paper, Perou et al. (2000) analyzed gene expression patterns of 65 surgical specimens of human breast tumors. The data are publicly available at https://www.omicsdi.org/dataset. They identified 496 intrinsic genes, which had significantly larger variation between different tumors compared with the variation between paired samples from the same tumor. Using hierarchical clustering with Eisen linkage (Eisen et al., 1998), they clustered the tumors into four different types: basal-like, Erb-B2+, normal-breast-like and luminal epithelial/ER+. We illustrate the effect of the pSD in hierarchical clustering with average, complete and Ward linkage on these 496 genes.
Figure 3 shows the resulting dendrograms when applying each of these clustering algorithms to the dataset after scaling it in various ways. For average linkage, the pSD clearly outperforms the other options. It only misclassifies two observations, whereas the other methods split the large group of luminal epithelial/ER+ tumors (in cyan) in two or more clusters and fail to identify the smallest group (in green), which contains the Erb-B2+ tumors. Complete linkage gives better results for all types of scaling. However, without scaling or when using the range, the largest group of tumors gets split up into two groups. This does not happen when scaling with the SD, yet quite a few tumors are misclassified in this case. When scaling with the pSD, only two tumors are misclassified using complete linkage. Finally, with Ward’s linkage, both the pSD and scaling with the range work well, with two and one misclassified tumor, respectively. Without scaling and with the SD however, the largest cyan cluster gets split up in two smaller clusters. In conclusion, the pSD yields better recovery of the true clusters than the other scaling methods.
The effect of variable scaling on the gene expression data. The colors correspond to the tumor type: basal-like in red, Erb-B2+ in green, normal-breast-like in dark blue and luminal epithelial/ER+ in cyan. The pSD generally yields superior recovery of the true clusters. (Color version of this figure is available at Bioinformatics online.)
Genes for which the pSD is smaller than the SD
| Variable . | Description . | SD/pSD . |
|---|---|---|
| GF200: 96(8C12): 384(2F23) | ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 | 3.5 |
| GF201: 96(88H2): 384(11O4) | GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 | 2.4 |
| PEROU: 96(7F8): 384(20L16) | REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 | 2.4 |
| GF200: 96(13D9): 384(4G17) | CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 | 2.2 |
| PEROU: 96(8A1): 384(20B1) | 68400 T57034 226 | 2.2 |
| PEROU: 96(6A1): 384(20A2) | 68400 T57034 227 | 2.2 |
| GF200: 96(14D12): 384(4G24) | APOLIPOPROTEIN D H15842 | 2.2 |
| PEROU: 96(9A9): 384(18B18) | IMMUNOGLOBULIN J CHAIN H24896 325 | 2.1 |
| Variable . | Description . | SD/pSD . |
|---|---|---|
| GF200: 96(8C12): 384(2F23) | ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 | 3.5 |
| GF201: 96(88H2): 384(11O4) | GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 | 2.4 |
| PEROU: 96(7F8): 384(20L16) | REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 | 2.4 |
| GF200: 96(13D9): 384(4G17) | CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 | 2.2 |
| PEROU: 96(8A1): 384(20B1) | 68400 T57034 226 | 2.2 |
| PEROU: 96(6A1): 384(20A2) | 68400 T57034 227 | 2.2 |
| GF200: 96(14D12): 384(4G24) | APOLIPOPROTEIN D H15842 | 2.2 |
| PEROU: 96(9A9): 384(18B18) | IMMUNOGLOBULIN J CHAIN H24896 325 | 2.1 |
Note: The third column shows the ratios of these two scales.
Genes for which the pSD is smaller than the SD
| Variable . | Description . | SD/pSD . |
|---|---|---|
| GF200: 96(8C12): 384(2F23) | ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 | 3.5 |
| GF201: 96(88H2): 384(11O4) | GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 | 2.4 |
| PEROU: 96(7F8): 384(20L16) | REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 | 2.4 |
| GF200: 96(13D9): 384(4G17) | CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 | 2.2 |
| PEROU: 96(8A1): 384(20B1) | 68400 T57034 226 | 2.2 |
| PEROU: 96(6A1): 384(20A2) | 68400 T57034 227 | 2.2 |
| GF200: 96(14D12): 384(4G24) | APOLIPOPROTEIN D H15842 | 2.2 |
| PEROU: 96(9A9): 384(18B18) | IMMUNOGLOBULIN J CHAIN H24896 325 | 2.1 |
| Variable . | Description . | SD/pSD . |
|---|---|---|
| GF200: 96(8C12): 384(2F23) | ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 | 3.5 |
| GF201: 96(88H2): 384(11O4) | GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 | 2.4 |
| PEROU: 96(7F8): 384(20L16) | REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 | 2.4 |
| GF200: 96(13D9): 384(4G17) | CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 | 2.2 |
| PEROU: 96(8A1): 384(20B1) | 68400 T57034 226 | 2.2 |
| PEROU: 96(6A1): 384(20A2) | 68400 T57034 227 | 2.2 |
| GF200: 96(14D12): 384(4G24) | APOLIPOPROTEIN D H15842 | 2.2 |
| PEROU: 96(9A9): 384(18B18) | IMMUNOGLOBULIN J CHAIN H24896 325 | 2.1 |
Note: The third column shows the ratios of these two scales.
In addition to improved clustering results, the pooled scaling procedure yields a diagnostic tool in the form of scale ratios. More precisely, we can compare the SDs of the variables with their pooled counterparts. Variables for which the ratio of these two scales is large typically show a clear grouping of the data, whereas variables for which this ratio is close to 1 do not distinguish clear groups. These ratios can thus be used as a fast and intuitive variable-screening procedure to identify potentially very informative variables, which is often a main research goal in this context.
In this example, only 8 of the 496 variables had a scale different from the SD. The information on those eight variables is presented in Table 3. Figure 4 shows the gene expressions for the four genes, which had the highest scale ratio. The colors again correspond to the four types of tumors. The top left panel shows the gene GF200:96(8C12):384(2F23), which clearly groups the red and blue points together and also shows high values for the majority of the cyan group. The top right and bottom left panels show the genes GF201:96(88H2):384(11O4) and PEROU:96(7F8):384(20L16), which distinctly separate the green tumors from the others. In the bottom right panel, the red tumors seem to have slightly lower values than the others, the blue tumors are grouped very tightly together and the cyan tumors appear to contain a sub-cluster with elevated values for this gene.
Genes for which the pSD is smaller than the SD. These genes generally cluster the majority of the true groups together while sometimes identifying potentially interesting subclusters
4 Discussion
We introduced a new approach to variable scaling prior to cluster analysis, which we call pooled scale estimators. The performance of pooled scaling is compared with the most common competitors on several popular clustering techniques, with a particular focus on the case of high-dimensional data with many uninformative noise variables. Scaling with the pooled scale estimators yields superior cluster recovery on the different clustering methods, in particular when the data contains a lot of noise variables. Since this is a common theme in the clustering of gene expression data, the performance was illustrated on breast cancer gene expressions in which it also outperformed the competition. The pooled scale estimates yield an additional diagnostic tool in the form of the ratio between pooled scales and default scales, which quantify the presence of cluster structure in the individual variables.
Acknowledgements
The authors would like to thank the Associate Editor, Prof. Dr Jonathan Wren, and three anonymous reviewers for their useful comments.
Funding
This work was supported by internal funds of the KU Leuven (to J.R.); and by the Discovery grant from the Natural Sciences and Engineering Research Council (to R.H.Z.).
Conflict of Interest: none declared.




