Abstract

Motivation

Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research.

Results

We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients.

Availability and implementation

An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Every time cluster analysis is used to find homogeneous groups in the data, we face the issue of how to scale the variables. The choice of scaling (including the option of not scaling at all) has practical implications because, in general, clustering methods are not scale-invariant. Even those clustering methods that are scale-invariant, become scale-dependent when they are combined with variable selection or regularization. As a consequence, we may get a different data partition if the variables in the data are rescaled. Unfortunately, there is not a generally accepted way to scale the variables for clustering (Jain, 2010; Jain and Dubes, 1988).

There are conflicting recommendations regarding the scaling of the data in the clustering literature. For example, Milligan and Cooper (1988)—often cited as the main benchmark study of this topic in the context of hierarchical clustering—recommend scaling the variables by their range. Steinley (2004) came to the same conclusion using k-means. On the other hand, Vesanto (2001) recommends the use of the SD for k-means clustering. Schaffer and Green (1996) studies clustering on real-data examples and argue against scaling with the range or the SD. Stoddard (1979) also argued against the use of the SD in the analysis of laboratory procedures.

While scaling may not always be necessary, it seems that in general the most reliable approach is to use some sort of scaling for two main reasons. The first reason is that scaling gets rid of the measurement scales of the variables. These measurement scales may have a strong influence on the clustering results, to the extent that a single very large variable can solely determine the whole clustering outcome. Furthermore, it can be of practical importance to get rid of measurement scales, e.g. a variable measuring ‘height’ should have the same effect on the clustering procedure when measured in centimeters or in meters. The second reason is that scaling becomes practically mandatory in the context of high-dimensional clustering with variable selection. Variable selection is usually achieved through the addition of a penalization term to the objective function. This penalization is typically not scale-invariant and thus yields different outcomes depending on the variables’ relative sizes. Therefore, without scaling, the penalty acts differently on each variable, which will often lead to ineffective variable selection. For these reasons, we consider feature scaling as a necessary pre-processing step in cluster analysis.

Though the arguments in favor of scaling before cluster analysis are clear, the issue of how to scale is delicate. The reason is that there are two types of variables: informative variables and noise variables. Informative variables help to separate clusters in the data whereas noise variables do not. If an informative variable is scaled with a very large scale, it will be compressed to a small size and thus there is a risk that much of its clustering power will disappear. In contrast, if a noise variable is scaled with a very small scale, it can potentially be blown up to where it solely determines the clustering outcome. Note that, this issue is even more pronounced in high-dimensional datasets as they typically contain many noise variables.

Interestingly, most of the commonly used scale estimators, such as the SD and the range have a tendency to yield large-scale estimates on potentially very informative variables. More precisely, these estimators may produce unduly large scales for a variable with a multimodal distribution because of the dispersion between the groups. Hence, a drawback of feature scaling by these estimates is the possible undesirable reduction of the relative clustering importance of features that clearly separate different groups. While these features are not guaranteed to reveal the whole clustering structure of the data, they show good potential and should be treated carefully.

The discussion so far suggests that there is a need for further study of both the effect of scaling and the best choice of scale estimator in cluster analysis. Despite the considerable potential impact of scaling on cluster outcomes, papers on this topic are scant, far apart and lack consensus.

In Section 2, we introduce two new scale estimators specially designed to scale variables before clustering: the pooled standard deviation (pSD) and the pooled absolute deviation. By using pooled scale estimators, we aim to scale variables without destroying their clustering power if they have any. If the marginal distribution of a feature shows evidence of clustering, our proposed scaling will enhance its clustering importance. If the marginal distribution of a feature X does not show evidence of clustering, but the joint distribution of some subset of features including X carries important clustering information, our proposed scaling approach will have a neutral effect on this variable. The proposed scaling approach can be used as pre-processing before applying any clustering method, including those with variable selection and subspace clustering, as illustrated with several examples. To calculate our proposed scale, we first run k-means clustering on each of the variables. In order to choose the number of clusters used to estimate the pooled scales, we use the Gap statistic (Tibshirani et al., 2001) with a sped-up bootstrapping procedure that bypasses the otherwise unaffordable computational burden of this approach.

The rest of the article is organized as follows. In Section 3, we compare the new scale estimators with existing scaling procedures in an extensive simulation study. Moreover, we show an application of the pooled scale estimators prior to the hierarchical clustering of gene expressions of breast cancer sample tissues (additional real-data examples are presented in the Supplementary Material). Finally, Section 4 concludes.

2 Materials and methods

2.1 Pooled scale estimators

Our starting point is the univariate k-means clustering. Assume we have a univariate dataset x1,,xn. The well-known k-means clustering looks for the k cluster centers, which minimize the squared deviations for every point in the dataset to the closest cluster center. More precisely, the vector of cluster centers is defined as:
μ=(μ1,,μk)=argmint=(t1,,tk)Sk(t),
where
Sk(t)=1ni=1nd2,i(t)
(1)
and d2,i(t)=min1jk||xitj||22. Note that, Sk(μ) can be interpreted as an estimator of scale. In particular, if k = 1, then μ is the classical sample mean and S1(μ) reduces to the classical SD. If k > 1, the squared scale can be interpreted as a pooled variance of the points around their cluster centers. As an example, suppose that k = 2 and that the sets C1 and C2 contain the indices of the points in the two clusters. Moreover, let |A| represents the number of elements in the set A, so that |C1|+|C2|=n. We then have S22(μ)=1ni=1nd2,i(μ)=1niC1||xiμ1||22+1niC2||xiμ2||22=|C1|nVar(C1)+|C2|nVar(C2), where Var(Cj) denotes the sample variance of the elements belonging to cluster j. We thus obtain a weighted mean of the within-cluster variances, with weights proportional to the number of observations in each cluster.

Our idea is to use Sk, with an appropriate (variable depending) value of k, for scaling the variables prior to the application of a clustering procedure. We will refer to this scale estimator as the pSD. The intuition for this scale estimator is the following. If a certain variable appears to not separate any clusters in the data, we consider the variability of this variable to be uninformative and we use the largest scale, S1 (i.e. the classical SD), to scale it before clustering. However, when a variable does seem to separate, e.g. k clusters, then Sk will tend to be relatively small compared with S1, and using Sk in that case will avoid dampening the variability (i.e. information) in this variable. This way, we hope to preserve as much of the clustering information in each variable as possible, while still making the variables unitless.

As an alternative to k-means, k-medians clustering can also be used for the scaling of the variables. Using the notation introduced above, the vector of cluster centers in k-medians is defined as:
μ=(μ1,,μk)=argmint=(t1,,tk)Mk(t),
where
Mk(t)=1ni=1nd1,i(t)
and d1,i(t)=min1jk|xitj|. The cluster centers now correspond to the median of the observations in each cluster. Note that once again, Mk can be interpreted as an estimator of scale. If k = 1, then μ is the classical sample median and M1(μ) is the mean absolute deviation (mad) (from the median). If k > 1, Mk can be interpreted as the pooled mean absolute deviation (pmad) of every point around its cluster center, where the pooling is done by a weighted average with weights determined by the number of observations in each cluster. We will call Mk the pmad. The pmad is expected to be more robust against outliers compared with the pSD. This is consistent with the results of the simulation study of Section 4.

Remark 1. Since we are using univariate k-means and k-medians to obtain the pSD and pmad, it is worth noting that there is an algorithm, which guarantees the convergence to a global optimum instead of a local one. This algorithm uses dynamic programing, runs in O(kn2) time and is implemented in the R-package Ckmeans.1d.dp, see Wang and Song (2011).

2.2 Determining k

An important question regarding our proposed pooled scale estimators is how to choose the appropriate number of clusters k* for each variable in the dataset. There is a vast literature on how to choose the appropriate number of clusters in cluster analysis. Relevant comparative studies giving a good overview of the most common techniques are found in Milligan and Cooper (1985), Halkidi et al. (2001), Maulik and Bandyopadhyay (2002), Brun et al. (2007) and Arbelaitz et al. (2013). In the setting of pooled scale estimators, we need a criterion that is relatively fast to compute and not too sensitive to spurious clusters. Most importantly, we need a criterion that can distinguish between the ‘null-case’ of one-cluster versus the alternative case of two or more clusters. This is an important point, since especially in high-dimensional datasets, we expect that many variables may not have interesting information for clustering the data and thus should be scaled by the largest scale, i.e. the SD or mad. Note that, this requirement makes many popular cluster validation indices unsuitable for us, since many of them do not yield a reasonable comparative assessment of the one-cluster-case.

A well-known criterion that satisfies our needs is the Gap statistic (Hastie et al., 2009; Tibshirani et al., 2001). Given a certain clustering algorithm and the resulting partition of the data, the gap statistic works as follows. For each value of the number of clusters under consideration, the ‘tightness’ of the clusters in the found partition is compared with the tightness of clusters obtained by clustering random datasets using the same algorithm. If this difference is large for a given number of clusters, it indicates ‘stronger than random’ clustering structure and the gap statistic will pick that number of clusters as the true number. A mathematical description of the gap statistic is as follows. Suppose, we have clustered the data into k clusters, C1,,Ck, where Cj denotes the indices of the observations in cluster j. Let Wk be the sum of the within-cluster sums of squares around their corresponding cluster means, i.e. Wk=j=1kiCj(xix¯j)2 where x¯j=1|Cj|iCjxi is the mean of the observations in cluster j. In order to identify the number of clusters, the value of log(Wk) is compared to its expected value En*[log(Wk)] under a uniform reference distribution on the range of the dataset. If this value deviates too much from its expected value under a uniform distribution, it indicates the existence of clusters in the data. The intuition for the comparison with the uniform distribution is that it is the distribution, which is most likely to generate spurious clusters (within the family of log-concave densities) and will thus on average provide the strongest evidence against the alternative hypothesis.

In principle, the reference distribution of log(Wk) is determined by generating bootstrap samples from the uniform distribution. As we would like scale every variable in the dataset, this appears to lead to a prohibitive computational cost, which would scale poorly with the number of variables. Fortunately, there is an efficient way to bypass this hindrance based on the results of Proposition 1 below.

In our case, the clusters are coming from the k-means (or k-medians) clustering algorithm. This means that for a given clustering C1,,Ck, we have
Wk=r=1kd2,i(μ1,,μk)=nSk2
in the computation of the pSD. For the pmad, we redefine Wk such that it corresponds to the pooled within-cluster sum of absolute deviations around the cluster medians:
Wk=r=1kd1,i(μ1,,μk)=nMk.

Therefore, in our setting, the gap statistic will be large when there is a ‘significant difference’ in the estimated pooled scale compared with the pooled-scaled estimate on data coming from a uniform distribution.

In order to estimate En*[log(Wk)] and Varn*[log(Wk)], we apply k-means to B bootstrap samples of size n. The mean of these samples serves as the estimate for En*[log(Wk)] and the appropriate scale, which accounts for the simulation error in En*[log(Wk)], is then the SD of the bootstrap samples multiplied by 1+1/B.

We now turn to speeding up the bootstrapping procedure for scaling all the variables in a dataset. The speed-up is achieved by exploiting the fact that we are in the univariate setting and by using the properties of the proposed scaling methods established in the following proposition. 

Proposition 1. Letx=x1,,xnbe a sample of univariate observations and letC1,,Ckbe a partition ofxresulting from solving the k-means clustering problem. Denote the value of the objective function with Sk as in Equation (1). Let s>0 andtRconsiderz=z1,,znwherezi=(xit)/sfori=1,,n. We then have:

  1. Shift and Scale invariance of k-means clusters:C1,,Ckis a solution to the k-means clustering problem onz.

  2. Shift invariance and Scale equivariance of k-means objective function: The value of the objective function of this clustering isSk/s.

  3. Shift and Scale invariance of gap statistic: The number of clusters selected by the gap statistic is the same forxandz.

The exact same result holds for clustering with k-medians. These results are intuitive and follow from the fact that Manhattan and Euclidean distances are scale equivariant. We give the proof in the Supplementary Material. Using this proposition, it becomes clear that we need to bootstrap the reference distribution of log(Wk) only once, instead of for every variable separately. The reason is that we can first rescale each variable in the dataset by the range, so that all of them have the same range of length one. Now the reference distribution of Wk is the same for each of these variables, allowing us to bootstrap once from the uniform distribution on [0,1] to obtain the reference distribution for all variables. Note that, it does not matter on which interval of length 1 the variable takes its values, since the whole procedure is shift invariant as well.

Remark 2. The uniform distribution is not the only possible reference distribution one can use. Another option briefly suggest by Tibshirani et al. (2001) is to use log-concave density estimation, which is possible for univariate distributions. This would yield a fit of a log-concave density to every variable in the data from which bootstrap samples can then be drawn. While this takes into account the individual distributions of every variable, the drawback is that in this case we have to take the slow approach of generating separate reference distributions for every variable.

Remark 3. Instead of the gap statistic, the jump statistic (Sugar and James, 2003) may be used as an alternative. In the univariate setting, the jump statistic considers the ‘distortions’ d^k=Sk2 for several numbers of clusters k. They then define the jumps as Jk=d^k1/2d^k11/2, where d^00. Finally, the number of clusters is estimated by taking K*=argmaxkJk. One advantage of the jump statistic over the gap statistic is that it faster to compute, since we do not need to bootstrap a reference distribution. In our simulations however, the gap statistic performed better.

2.3 Algorithm

We are now ready to describe the procedure; we propose for scaling a dataset prior to clustering. For a p-variate dataset X1,,Xp with n observations per variable, we apply the following steps to scale all the variables:

  1. Generate B bootstrap samples of size n from the uniform distribution on [0,1] and cluster each of them using k-means. Retain all the values Wk,b* for k=1,,kmax and b=1,,B. Denote with mk=(1/B)blog(Wk,b*),sdk=[(1/B)b(log(Wk,b*)mk)2]1/2, the estimates for the location and scale of log(Wk)*, respectively. Finally, put sk=1+1/Bsdk.

  2. For all variables Xj, j=1,,p, do:

    • Rescale the variable with its range to obtain Zj=Xj/rj, where rj=range(Xj).

    • Cluster Zj using k-means for k=1,,kmax and retain the values Wk,j=nSk,j2.

    • Calculate the values of the gap statistic: Gapj(k)=mklog(Wk,j).

    • Choose the number of clusters k*, by setting k*=smallest k such that Gapj(k)Gapj(k+1)csk+1.

    • Rescale the value of the objective function of the appropriate k, rjSk*,j, and use this pSD to scale Xj.

The constant c in the above procedure controls the rejection of the null model. As c goes up, it becomes less likely to reject the null model of zero clusters. A default value is c = 1, which works well according to Breiman et al. (1984) and Tibshirani et al. (2001). For the number of bootstrap samples, a default of B = 1000 yields almost no variance in the resulting scale estimates in our experience. Replacing k-means with k-medians yields the scaling procedure with Mk, the pmad.

Example 1: Fisher’s Iris data

As an illustrative example, we consider Fisher’s well-known Iris dataset (Fisher, 1936), collected by Anderson (1935). This dataset contains 50 samples from each of 3 types of iris: Iris setosa, versicolor and virginica. Each flower is described by four variables, which describe the dimensions of its sepal and petal. Table 1 illustrates the effect of scaling with the proposed pSD on this data and compares with the SD and the range. For the first two variables, the pSD is equal to the SD, because these variables do not seem to clearly distinguish any groups in the data. However, variables 3 and 4 seem to both distinguish two clear groups, resulting in significantly lower pSDs. The last row of the table reports the adjusted rand index (ARI) (Hubert and Arabie, 1985) with respect to the true classification when performing k-means with k = 3 on the dataset after scaling. The ARI takes on values between −1 and 1, where an ARI of 1 indicates perfect agreement between partitions and the lower the ARI, the higher the disagreement between the partitions. It is clear that scaling with the pSD gives better results than scaling with the SD or the range, since neither of these take into account the individual separative power of the variables. As a reference, we mention that k-means clustering without scaling gives an ARI of 0.73.

Table 1.

The effect of variable scaling on Fisher’s Iris data

VariablesSDRangepSD
graphic  0.83  3.6  0.83 
0.44  2.4  0.44 
1.77 5.9 0.40 
0.76  2.4  0.18 
k-means ARI 0.62 0.72 0.89 
VariablesSDRangepSD
graphic  0.83  3.6  0.83 
0.44  2.4  0.44 
1.77 5.9 0.40 
0.76  2.4  0.18 
k-means ARI 0.62 0.72 0.89 

Note: The pSD produces smaller scales for variables 3 and 4 than the classical SD and the range. This results in a higher ARI when clustering the data with the k-means algorithm after scaling.

Table 1.

The effect of variable scaling on Fisher’s Iris data

VariablesSDRangepSD
graphic  0.83  3.6  0.83 
0.44  2.4  0.44 
1.77 5.9 0.40 
0.76  2.4  0.18 
k-means ARI 0.62 0.72 0.89 
VariablesSDRangepSD
graphic  0.83  3.6  0.83 
0.44  2.4  0.44 
1.77 5.9 0.40 
0.76  2.4  0.18 
k-means ARI 0.62 0.72 0.89 

Note: The pSD produces smaller scales for variables 3 and 4 than the classical SD and the range. This results in a higher ARI when clustering the data with the k-means algorithm after scaling.

3 Results

3.1 Simulation study

The most well-known comparative study on the scaling of variables in clustering is arguably the one by Milligan and Cooper (1988), building on Milligan (1985). Recently, Qiu and Joe (2006) used the design of this simulation study as a basis for a new algorithm to generate clusters with a specified degree of separation. The R-package clusterGeneration (Qiu and Joe., 2015) contains an implementation of their algorithm and it will be the basis of our simulation study.

We compare the following types of scaling in our simulation study:

  1. No scaling

  2. The SD: SD=1n1i=1n(xix¯)2

  3. The range: range=x(n)x(1)

  4. The mad: mad=1n1i=1n|ximedianixi|

  5. The pSD: psd=Sk*

  6. The pmad: pmad=Mk*

In order to get a complete picture of the different scaling methods, we perform an extensive simulation study. Each generated dataset has an even number of clean variables, which we vary between 2 and 10. The clusters are generated from the multivariate standard normal distribution. We consider equally sized clusters of size 100 each. The degree of separation between the clusters is either separated (0.21) or well-separated (0.34), see Qiu and Joe (2006). We then add a percentage of noise variables to the dataset varied between 0% and 2000% of the number of clean variables. The noise variables can either be multivariate standard normal or uniform over the range of the clean variables. The uniform-noise variables are generated by adding a small Gaussian perturbation to an equally spaced grid over the range of the signal variables to ensure that they do not have any separative power. We did the simulation both on clean data as well as contaminated data. For the contaminated data, 5% of the observations of each of the signal variables are replaced with points sampled randomly from the uniform distribution on [X¯j4s(Xj),X¯j+4s(Xj)], where X¯j and s(Xj) denote the mean and SD of the signal variable. Table 2 summarizes the factors of our simulation study and their levels.

Table 2.

Design factors of the simulation study

FactorLevels#
Number of clean variables 2, 4, 6, 8, 10 
Number of clusters 2, 3, 4, 5 
Degree of separation separated, well-separated 
% noise variables added 0, 50, 100, 150, 200, 500, 1000, 2000 
Type of noise Gaussian, uniform 
% outliers 0, 5 
FactorLevels#
Number of clean variables 2, 4, 6, 8, 10 
Number of clusters 2, 3, 4, 5 
Degree of separation separated, well-separated 
% noise variables added 0, 50, 100, 150, 200, 500, 1000, 2000 
Type of noise Gaussian, uniform 
% outliers 0, 5 
Table 2.

Design factors of the simulation study

FactorLevels#
Number of clean variables 2, 4, 6, 8, 10 
Number of clusters 2, 3, 4, 5 
Degree of separation separated, well-separated 
% noise variables added 0, 50, 100, 150, 200, 500, 1000, 2000 
Type of noise Gaussian, uniform 
% outliers 0, 5 
FactorLevels#
Number of clean variables 2, 4, 6, 8, 10 
Number of clusters 2, 3, 4, 5 
Degree of separation separated, well-separated 
% noise variables added 0, 50, 100, 150, 200, 500, 1000, 2000 
Type of noise Gaussian, uniform 
% outliers 0, 5 

All together this gives 1280 different settings and for each of these, we generate 100 datasets. Each generated dataset is scaled using the six scale estimators described above. Afterwards we perform the most popular methods of connectivity-based clustering and centroid-based clustering. More specifically, we use hierarchical clustering (Anderberg, 2014; Hartigan, 1975) on the Euclidean distances with single, average, complete and Ward’s linkage functions (Ward, 1963) as well as k-means (Lloyd, 1982; MacQueen et al., 1967) and partitioning around medoids using the Manhattan distance (Kaufman and Rousseeuw, 2009). For k-means, the algorithm of Hartigan and Wong (1979) is used with 100 random starts and 100 maximum iterations for each starting value.

We compare the results using the ARI (Hubert and Arabie, 1985), which lies between −1 and 1 where 1 indicates a perfect clustering. We cluster each dataset for a variety of target clusters k=1,,3×T, where T denotes the true number of clusters. For k-means, this value is a direct input, whereas for hierarchical clustering, we cut the dendrogram at these various levels of k. We then pick the optimal value of the ARI over these different clusterings. The reason for this procedure is that we want to evaluate the effect of scaling on clustering without any distortion from the question of how to choose the optimal number of clusters. Furthermore, particularly in the case of hierarchical clustering, the clusterings resulting from a higher number of partitions are often more reflective of the real underlying structure than cutting the dendrogram at the true number of clusters, since single outlying observations can distort the dendrogram significantly.

Figure 1 shows the big picture of the simulation results for hierarchical clustering on data without outliers. For each type of linkage, the graphs show the average ARI over all different settings with the increasing number of noise variables on the x-axis. Several interesting observations can be made from these plots. It is clear that as the number of noise variables increases, the clustering gets more difficult resulting in generally lower ARI values. However, scaling with the SD or the mad is clearly more sensitive to noise variables than the other methods. Scaling with the range does fairly well, which is in line with the findings of Milligan and Cooper (1988). The pooled scale estimators outperform all the other methods, especially when the number of noise variables is large. The difference between the pSD and the pmad seems very small and not significant in these plots. Finally, not scaling appears to perform similarly to scaling with either SD or mad. This is due to the particular simulation setup and one must always take into account that the performance of not scaling the variables can be completely destroyed by taking one noise variable, which has a variance which is much larger than the variance of the signal variables.

Fig. 1.

Simulation results for hierarchical clustering with single (a), average (b), complete (c) AQ10and Ward (d) linkage functions on outlier-free data. The pooled scale estimators are the least sensitive to an increasing number of noise variables

Figure 2 shows the results for k-means clustering and partitioning around medoids. The conclusions are largely the same as for the case of hierarchical clustering. When scaling with the SD or mad, the true clustering structure seems more difficult to retrieve. The range preserves more of the cluster structure and performs better than the SD and mad, which is in line with Steinley (2004). However, the pooled scale estimators again outperform the competitors.

Fig. 2.

Simulation results for k-means (a) and partitioning around medoids (b) on data without outliers. The pooled scale estimators are more resistant to the addition of noise variables to the data

The simulation results presented above only give a rough overview of the performance of the methods and do not show the performance in the presence of outliers. The most interesting insight from a more detailed analysis of the results is that the performance of the methods is highly dependent on the type of noise variables, which are added to the signal variables. Scaling with the range works well when the noise variables are more Gaussian but fails when the noise is more uniform. This can be explained by the fact that uniform noise variables have a large variance given their range. As a result, their impact on the clustering is large when scaling with the range. Scaling with the SD and mad works much better when the noise variables are more uniform than the case of Gaussian noise. This in turn can be explained by the fact that the uniform noise variables have a high variance for their range compared with Gaussian noise variables. Scaling them by their variance pushes the uniform noise more toward the center, which limits their influence.

The effect of outliers is shown in Supplementary Figures S1 and S2. These results expose the sensitivity of the range to the presence of outliers. Other than that, the relative performance of the different scaling methods is roughly the same as in the case of outlier-free data. An interesting note is that the pmad is clearly more robust to outliers than the pSD, which resembles the robustness of the mad versus that of the SD in classical scale estimation.

3.2 Gene expression example

In a seminal paper, Perou et al. (2000) analyzed gene expression patterns of 65 surgical specimens of human breast tumors. The data are publicly available at https://www.omicsdi.org/dataset. They identified 496 intrinsic genes, which had significantly larger variation between different tumors compared with the variation between paired samples from the same tumor. Using hierarchical clustering with Eisen linkage (Eisen et al., 1998), they clustered the tumors into four different types: basal-like, Erb-B2+, normal-breast-like and luminal epithelial/ER+. We illustrate the effect of the pSD in hierarchical clustering with average, complete and Ward linkage on these 496 genes.

Figure 3 shows the resulting dendrograms when applying each of these clustering algorithms to the dataset after scaling it in various ways. For average linkage, the pSD clearly outperforms the other options. It only misclassifies two observations, whereas the other methods split the large group of luminal epithelial/ER+ tumors (in cyan) in two or more clusters and fail to identify the smallest group (in green), which contains the Erb-B2+ tumors. Complete linkage gives better results for all types of scaling. However, without scaling or when using the range, the largest group of tumors gets split up into two groups. This does not happen when scaling with the SD, yet quite a few tumors are misclassified in this case. When scaling with the pSD, only two tumors are misclassified using complete linkage. Finally, with Ward’s linkage, both the pSD and scaling with the range work well, with two and one misclassified tumor, respectively. Without scaling and with the SD however, the largest cyan cluster gets split up in two smaller clusters. In conclusion, the pSD yields better recovery of the true clusters than the other scaling methods.

Fig. 3.

The effect of variable scaling on the gene expression data. The colors correspond to the tumor type: basal-like in red, Erb-B2+ in green, normal-breast-like in dark blue and luminal epithelial/ER+ in cyan. The pSD generally yields superior recovery of the true clusters. (Color version of this figure is available at Bioinformatics online.)

Table 3.

Genes for which the pSD is smaller than the SD

VariableDescriptionSD/pSD
GF200: 96(8C12): 384(2F23) ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 3.5 
GF201: 96(88H2): 384(11O4) GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 2.4 
PEROU: 96(7F8): 384(20L16) REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 2.4 
GF200: 96(13D9): 384(4G17) CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 2.2 
PEROU: 96(8A1): 384(20B1) 68400 T57034 226 2.2 
PEROU: 96(6A1): 384(20A2) 68400 T57034 227 2.2 
GF200: 96(14D12): 384(4G24) APOLIPOPROTEIN D H15842 2.2 
PEROU: 96(9A9): 384(18B18) IMMUNOGLOBULIN J CHAIN H24896 325 2.1 
VariableDescriptionSD/pSD
GF200: 96(8C12): 384(2F23) ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 3.5 
GF201: 96(88H2): 384(11O4) GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 2.4 
PEROU: 96(7F8): 384(20L16) REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 2.4 
GF200: 96(13D9): 384(4G17) CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 2.2 
PEROU: 96(8A1): 384(20B1) 68400 T57034 226 2.2 
PEROU: 96(6A1): 384(20A2) 68400 T57034 227 2.2 
GF200: 96(14D12): 384(4G24) APOLIPOPROTEIN D H15842 2.2 
PEROU: 96(9A9): 384(18B18) IMMUNOGLOBULIN J CHAIN H24896 325 2.1 

Note: The third column shows the ratios of these two scales.

Table 3.

Genes for which the pSD is smaller than the SD

VariableDescriptionSD/pSD
GF200: 96(8C12): 384(2F23) ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 3.5 
GF201: 96(88H2): 384(11O4) GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 2.4 
PEROU: 96(7F8): 384(20L16) REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 2.4 
GF200: 96(13D9): 384(4G17) CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 2.2 
PEROU: 96(8A1): 384(20B1) 68400 T57034 226 2.2 
PEROU: 96(6A1): 384(20A2) 68400 T57034 227 2.2 
GF200: 96(14D12): 384(4G24) APOLIPOPROTEIN D H15842 2.2 
PEROU: 96(9A9): 384(18B18) IMMUNOGLOBULIN J CHAIN H24896 325 2.1 
VariableDescriptionSD/pSD
GF200: 96(8C12): 384(2F23) ESTROGEN REGULATED LIV-1 PROTEIN (LIV-1) MRNA, PARTIAL CDS H29407 45 3.5 
GF201: 96(88H2): 384(11O4) GROWTH FACTOR RECEPTOR-BOUND PROTEIN 7 H53703 224 2.4 
PEROU: 96(7F8): 384(20L16) REGULATOR OF CHROMATIN, SUBFAMILY E, MEMBER 1 W63613 228 2.4 
GF200: 96(13D9): 384(4G17) CYTOCHROME P450, SUBFAMILY IIA (PHENOBARBITAL-INDUCIBLE), POLYPEPTIDE 7 T73031 61 2.2 
PEROU: 96(8A1): 384(20B1) 68400 T57034 226 2.2 
PEROU: 96(6A1): 384(20A2) 68400 T57034 227 2.2 
GF200: 96(14D12): 384(4G24) APOLIPOPROTEIN D H15842 2.2 
PEROU: 96(9A9): 384(18B18) IMMUNOGLOBULIN J CHAIN H24896 325 2.1 

Note: The third column shows the ratios of these two scales.

In addition to improved clustering results, the pooled scaling procedure yields a diagnostic tool in the form of scale ratios. More precisely, we can compare the SDs of the variables with their pooled counterparts. Variables for which the ratio of these two scales is large typically show a clear grouping of the data, whereas variables for which this ratio is close to 1 do not distinguish clear groups. These ratios can thus be used as a fast and intuitive variable-screening procedure to identify potentially very informative variables, which is often a main research goal in this context.

In this example, only 8 of the 496 variables had a scale different from the SD. The information on those eight variables is presented in Table 3. Figure 4 shows the gene expressions for the four genes, which had the highest scale ratio. The colors again correspond to the four types of tumors. The top left panel shows the gene GF200:96(8C12):384(2F23), which clearly groups the red and blue points together and also shows high values for the majority of the cyan group. The top right and bottom left panels show the genes GF201:96(88H2):384(11O4) and PEROU:96(7F8):384(20L16), which distinctly separate the green tumors from the others. In the bottom right panel, the red tumors seem to have slightly lower values than the others, the blue tumors are grouped very tightly together and the cyan tumors appear to contain a sub-cluster with elevated values for this gene.

Fig. 4.

Genes for which the pSD is smaller than the SD. These genes generally cluster the majority of the true groups together while sometimes identifying potentially interesting subclusters

4 Discussion

We introduced a new approach to variable scaling prior to cluster analysis, which we call pooled scale estimators. The performance of pooled scaling is compared with the most common competitors on several popular clustering techniques, with a particular focus on the case of high-dimensional data with many uninformative noise variables. Scaling with the pooled scale estimators yields superior cluster recovery on the different clustering methods, in particular when the data contains a lot of noise variables. Since this is a common theme in the clustering of gene expression data, the performance was illustrated on breast cancer gene expressions in which it also outperformed the competition. The pooled scale estimates yield an additional diagnostic tool in the form of the ratio between pooled scales and default scales, which quantify the presence of cluster structure in the individual variables.

Acknowledgements

The authors would like to thank the Associate Editor, Prof. Dr Jonathan Wren, and three anonymous reviewers for their useful comments.

Funding

This work was supported by internal funds of the KU Leuven (to J.R.); and by the Discovery grant from the Natural Sciences and Engineering Research Council (to R.H.Z.).

Conflict of Interest: none declared.

References

Anderberg
M.R.
(
2014
)
Cluster Analysis for Applications: Probability and Mathematical Statistics: A Series of Monographs and Textbooks
.
Vol. 19
.
Academic Press, Cambridge, Massachusetts
.

Anderson
E.
(
1935
)
The irises of the Gaspé Peninsula
.
Bull. Amer. Iris Soc
.,
59
,
2
5
.

Arbelaitz
O.
et al.  (
2013
)
An extensive comparative study of cluster validity indices
.
Pattern Recognit
.,
46
,
243
256
.

Breiman
L.
et al.  (
1984
) Classification and regression trees.
Taylor & Francis, Monterey, CA
.

Brun
M.
et al.  (
2007
)
Model-based evaluation of clustering validation measures
.
Pattern Recognit
.,
40
,
807
824
.

Eisen
M.B.
et al.  (
1998
)
Cluster analysis and display of genome-wide expression patterns
.
Proc. Natl. Acad. Sci. USA
,
95
,
14863
14868
.

Fisher
R.
(
1936
)
The use of multiple measurements in taxonomic problems
.
Ann. Eugen
.,
7
,
179
188
.

Halkidi
M.
et al.  (
2001
)
On clustering validation techniques
.
J. Intell. Inf. Syst
.,
17
,
107
145
.

Hartigan
J.A.
(
1975
)
Clustering Algorithms
.
John Wiley & Sons, Inc, Hoboken, New Jersey
.

Hartigan
J.A.
,
Wong
M.A.
(
1979
)
Algorithm as 136: a k-means clustering algorithm
.
J. R. Stat. Soc. Ser. C Appl. Stat
.,
28
,
100
108
.

Hastie
T.
et al.  (
2009
)
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
.
Springer Science & Business Media, New York
.

Hubert
L.
,
Arabie
P.
(
1985
)
Comparing partitions
.
J. Classif
.,
2
,
193
218
.

Jain
A.K.
(
2010
)
Data clustering: 50 years beyond k-means
.
Pattern Recognit. Lett
.,
31
,
651
666
.

Jain
A.K.
,
Dubes
R.C.
(
1988
)
Algorithms for Clustering Data
.
Prentice-Hall, Inc, New Jersey
.

Kaufman
L.
,
Rousseeuw
P.
(
2009
) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, Hoboken, New Jersey.

Lloyd
S.
(
1982
)
Least squares quantization in PCM
.
IEEE Trans. Inf. Theory
,
28
,
129
137
.

MacQueen
J.
et al.  (
1967
) Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol.
1
, pp.
281
297
. Oakland, CA, USA.

Maulik
U.
,
Bandyopadhyay
S.
(
2002
)
Performance evaluation of some clustering algorithms and validity indices
.
IEEE Trans. Pattern Anal. Mach. Intell
.,
24
,
1650
1654
.

Milligan
G.
(
1985
)
An algorithm for generating artificial test clusters
.
Psychometrica
,
50
,
123
127
.

Milligan
G.
,
Cooper
M.
(
1988
)
A study of standardization of variables in cluster analysis
.
J. Classif
.,
5
,
181
204
.

Milligan
G.W.
,
Cooper
M.C.
(
1985
)
An examination of procedures for determining the number of clusters in a data set
.
Psychometrika
,
50
,
159
179
.

Perou
C.M.
et al.  (
2000
)
Molecular portraits of human breast tumours
.
Nature
,
406
,
747
752
.

Qiu
W.
,
Joe
H.
(
2006
)
Generation of random clusters with specified degree of separation
.
J. Classif
.,
23
,
315
334
.

Qiu
W.
,
Joe
H.
(
2015
) clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4. https://cran.r-project.org/web/packages/clusterGeneration/index.html (April 2020, date last accessed).

Schaffer
C.M.
,
Green
P.E.
(
1996
)
An empirical comparison of variable standardization methods in cluster analysis
.
Multivar. Behav. Res
.,
31
,
149
167
.

Steinley
D.
(
2004
) Standardizing variables in K-means clustering. In:
Banks
D.
et al.  (eds)
Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation
.
Springer
,
Berlin, Heidelberg
, pp.
53
60
.

Stoddard
A.M.
(
1979
)
Standardization of measures prior to cluster analysis
.
Biometrics
,
35
,
765
773
.

Sugar
C.A.
,
James
G.M.
(
2003
)
Finding the number of clusters in a dataset
.
J. Am. Stat. Assoc
.,
98
,
750
763
.

Tibshirani
R.
et al.  (
2001
)
Estimating the number of clusters in a data set via the gap statistic
.
J. R. Stat. Soc. Series B Stat. Methodol
.,
63
,
411
423
.

Vesanto
J.
(
2001
) Importance of individual variables in the k-means algorithm. In:
Cheung
D.
et al.  (eds)
Advances in Knowledge Discovery and Data Mining
.
Springer
,
Berlin, Heidelberg
, pp.
513
518
.

Wang
H.
,
Song
M.
(
2011
)
Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming
.
R. J
.,
3
,
29
33
.

Ward
J.H.
Jr (
1963
)
Hierarchical grouping to optimize an objective function
.
J. Am. Stat. Assoc
.,
58
,
236
244
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data