Automated calibration of consensus weighted distance-based clustering approaches using sharp

Abstract Motivation In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms. Results We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularized approaches. We propose a procedure for the calibration of the number of clusters (and regularization parameter) by maximizing the sharp score, a novel stability score calculated directly from consensus clustering outputs, making it extremely computationally competitive. Our simulation study shows better clustering performances of (i) approaches calibrated by maximizing the sharp score compared to existing calibration scores and (ii) weighted compared to unweighted approaches in the presence of features that do not contribute to cluster definition. Application on real gene expression data measured in lung tissue reveals clear clusters corresponding to different lung cancer subtypes. Availability and implementation The R package sharp (version ≥1.4.3) is available on CRAN at https://CRAN.R-project.org/package=sharp.


Introduction
Clustering aims at partitioning samples (items) into homogeneous groups with similar features (attributes) [1].Distance-based clustering algorithms like hierarchical clustering minimise the pairwise distances within clusters (compactness) and maximise the pairwise distances across clusters (separation).These approaches have been extensively used in medicine, e.g. for stratifying patients based on their molecular profiles [2].
In consensus clustering, a given clustering algorithm is applied on multiple subsamples of items to generate more stable clusters [3].Pairwise co-membership proportions, calculated as the proportions of subsamples for which two items are classified in the same cluster, are stored in the consensus matrix, which is then used as a measure of similarity between the items [3].
Consensus clustering has successfully been applied on gene expression data and enabled the identification of stable molecular phenotypes [4,5].A simulation study also revealed that consensus clustering is among the best performing clustering approaches when the true number of clusters is known [6].However the calibration of hyper-parameters hampers its applicability to real-world data sets.
Existing methods for the calibration of the number of clusters aim at maximising the stability of the clustering procedure [3,7].It has been proposed to measure clustering stability from the estimated distribution of co-membership proportions by the delta (∆) score [3], Proportion of Ambiguous Clustering (PAC) score [8], or discrepancy between clustering on the full sample and on perturbed data [2].More recently, the Relative Cluster Stability Index (RCSI) was introduced as a measure of how much the data can be partitioned and estimated as the difference between the estimated and expected scores (e.g.PAC) under the hypothesis of a single cluster [9].The calculation of the RCSI requires the time-consuming application of consensus clustering on several datasets that are simulated from a reference distribution with a single cluster.
In addition, current implementations of consensus clustering with continuous attributes rely on distance metrics defined using vector norms (e.g.Euclidean and Manhattan distances) or correlations (e.g.Pearson's or rank correlation), which assume that all attributes contribute equally to the clustering [10].Considering attributes with low proportions of variance explained by the grouping in the distance calculations may dilute the clustering structure, making it more difficult to detect the clusters [10,11].
To overcome this issue, we define here consensus weighted clustering where a distancebased clustering algorithm (e.g.hierarchical clustering) is applied on a weighted distance matrix calculated using existing regularised methods.Specifically, we investigate the use of (i) sparse clustering, which estimates attribute-specific weights that are used in the calculation of pairwise distances between items and can be shrunk to exactly zero for attribute selection [12], and (ii) Clustering Objects on Subsets of Attributes (COSA) algorithm, which estimates attribute and item specific weights [11].
We then introduce a novel consensus score measuring the stability of the clustering procedure from (weighted) consensus clustering outputs.We propose to calibrate hyper-parameters by maximising our consensus score.
In the Materials and Methods section, we introduce our consensus weighted clustering approach, our calibration procedure and the simulation models implemented.Then, we conduct several simulation studies comparing clustering performances obtained with (i) hierarchical clustering without subsampling, (ii) consensus clustering calibrated as previously proposed [3,8,9], and (iii) consensus clustering calibrated by maximising our consensus score.
We also evaluate both the clustering and ranking performances of consensus weighted clustering.Finally, we apply (consensus weighted) clustering to a publicly available transcriptomics dataset including 3,312 assayed transcripts measured in 17 normal lung tissues and 46 malignant tumours [13].

Co-membership count matrix C(𝜆, 𝐺)
Co-sampling count matrix H Step 5: calculation of consensus matrix First, K random subsamples of a proportion τ ∈ [0, 1] of the n items are drawn without replacement.We use K = 100 and τ = 0.5 throughout this paper, as in previous studies [9,14].
The number of subsamples where each pair of items is drawn is stored in the co-sampling count matrix H . Second, a (weighted) distance matrix is calculated for each of the K subsamples with parameter λ.Third, a distance-based clustering algorithm is applied on each of the K distance matrices to detect G clusters.Based on the resulting cluster memberships in each of the K subsamples, co-membership matrices indicating which pairs of items belong to the same cluster are calculated.Fourth, the matrix of co-membership counts C (λ,G) is calculated as the sum over the K co-membership matrices.Fifth, the consensus matrix Γ(λ,G) is calculated and contains, for each pair of items, the proportion of co-membership out of the number of subsamples where both items were drawn.
Consensus (weighted) clustering aims at the identification of stable clusters by maximising the within-cluster and minimising the between-cluster co-membership proportions obtained over multiple subsamples.In Step 6, a distance-based clustering algorithm (potentially different from the one in Step 3) is used to generate the G stable clusters in Z (λ,G) using the consensus matrix as a measure of similarity.
These steps are described in detail in the following sections.

Weighted distance calculation
We extend the consensus clustering framework by incorporating an algorithm for weighted distance matrix calculation in Step 2 (Figure 1).We investigate the use of two algorithms: sparse clustering [12] and Clustering Objects on Subsets of Attributes [11].

Sparse clustering
The sparse clustering approach introduced in [12] aims at identifying clusters that are supported by a subset of discriminatory attributes.This is achieved by introducing p attribute-specific weights w m , m ∈ {1, . . ., p} in the distance calculations [12].For given attribute weights w, the sparse clustering distance d S i j (w) between items i and j can be expressed as: where d i j m is the pairwise distance along attribute m.
Attribute weights are estimated by solving a regularised version of the clustering objective function.In the present paper, we use the weighted distance matrix corresponding to sparse hierarchical clustering [12]: where n is the number of items to cluster and U is the overall distance matrix.
The constraint on the 1 -norm of w by a regularisation parameter λ > 1 induces sparsity, i.e. results in some weights w m being shrunk to exactly zero.The use of the 2 -norm of w ensures that the clustering is not driven by a single feature.The calibration of the regularisation parameter λ conditionally on the number of clusters can be done using an adapted gap statistic measuring the difference between the observed and expected objectives assuming a single cluster.
Clustering Objects on Subsets of Attributes It has been shown that the estimation of a single weight per attribute may result in clusters that are equally spaced along the set of selected features [10,11].Clusters that are supported by cluster-specific attributes may therefore be missed.An alternative approach introduces attribute and item specific weights in Clustering Objects on Subsets of Attributes (COSA) [11].In COSA, the weight matrix W of size (n × p) is estimated by minimising the sum of weighted distances between each item and its nearest neighbours under a constraint on the weights [11].Entries of the COSA weighted distance matrix d C (W ) are given by The (n × p) weights in W are estimated from min where K N N (i ) are the n nearest neighbours of item i , W i . is the i t h row of matrix W and . This optimisation problem is solved approximately using an iterative algorithm.The amount of regularisation is controlled by the parameter λ but does not result in attribute selection.

Consensus clustering framework
In Step 3 (Figure 1), a distance-based clustering algorithm is applied on each of the K (weighted) distance matrices.The co-membership status C k i j (λ,G) of items i and j in subsample k is defined as: 1 if i = j are both in subsample k and are in the same cluster obtained with parameters λ and G, The matrix C of co-membership counts over the K iterations is then calculated in Step 4 (Figure 1) as: To account for the possibility that items i and j may not both be included in a given subsample, we define the matrix H of co-sampling counts, where H i j is the number of subsamples that include both items i and j .By construction, the element H i i corresponds to the number of subsamples that include item i .We consider throughout this paper that the same subsamples are used to calculate the co-membership counts for all pairs of parameters (λ,G).That is, the matrix H does not depend on λ or G.
In Step 5 (Figure 1), entries of the consensus matrix Γ(λ,G) are defined as co-membership proportions calculated over subsamples where both i and j were included: As previously proposed [3,8,9], the G stable clusters in Z (λ,G) are obtained by applying a distance-based clustering algorithm using co-membership proportions as a measure of pair-wise similarity (Step 6 in Figure 1).The use of co-membership proportions instead of the Euclidean distance in the construction of stable clusters may result in the re-assignment of some items (Supplementary Figure 1).
Note that consensus unweighted clustering is recovered by setting λ = 0.

Existing scores
To select the number of clusters G, it has been proposed in [3,7] to compare consensus matrices Γ(G) obtained with different values of G and select the one yielding the most stable clustering.
Theoretically, the most stable clustering would invariably result in the same partition at all subsampling iterations, yielding a binary consensus matrix.Based on this, the proportion of item pairs with co-membership proportions above a certain threshold x was proposed as a measure of stability [3,15]: Using that metric, Monti et al. (2003) propose to determine the number of clusters G as the value generating the largest standardised score ∆ G : where a G is the integral of C DF G (x) over [0, 1] and increases as co-membership proportions get larger.
Alternatively, the Proportion of Ambiguous Clustering (PAC) score measures the number of intermediate co-membership proportions (i.e. that are between the lowerbound x 1 and upperbound x 2 of the co-membership proportions) and is defined as [8]: The number of clusters G can then be defined as the one yielding the smallest PAC score (i.e. with the smallest number of intermediate co-membership proportions).The calculation of the PAC score requires the arbitrary choice of two parameters x 1 and x 2 .Default values of x 1 = 0.1 and x 2 = 0.9 were recommended [8].
A score measuring the discrepancy between clustering on the full sample and on perturbed datasets has been proposed to choose the number of clusters in the Perturbation clustering for data INtegration and disease Subtyping (PINS) algorithm [2,16].The discrepancy score measures clustering stability as the Area Under the Curve (AUC) of the Cumulative Distribution Function (CDF) of the difference between the co-membership matrix obtained by applying the clustering algorithm on the full sample and the consensus matrix.
More recently, a Monte Carlo reference-based technique was proposed and implemented in the R package M3C [9].The approach is based on the simulation of multiple (package default is 25) datasets from a reference distribution with a single cluster while keeping a similar attribute correlation structure [17].The number of clusters is then calibrated by maximising the Relative Cluster Stability Index (RCSI) of the PAC score or entropy [9,18] measuring the difference between observed and simulated scores.

New consensus score
In this section, we introduce a novel consensus score S c measuring clustering stability and used for the joint calibration of hyper-parameter(s).The consensus score S c is calculated using the matrix C (λ,G) of co-membership counts, the matrix H of co-sampling counts and the consensus clusters Z (λ,G), which are all outputs of consensus clustering (Figure 1).The comembership count matrix C ( λ, Ĝ) and consensus clusters Z ( λ, Ĝ) obtained with the calibrated number of clusters Ĝ and penalty parameter λ (if weighted) maximise the score S c .
To calculate the consensus score for a given pair of hyper-parameters (λ,G), the N = n × (n −1)/2 item pairs are first classified as (i) within elements if the two items belong to the same consensus cluster as defined in Z (λ,G), or (ii) between elements if the two items belong to different consensus clusters.This is illustrated in Figure 2, where the blue and orange entries of the co-membership count matrix C (λ,G) and co-sampling count matrix H correspond to the within and between elements, respectively.
C ,  =   We introduce the integers X w (λ,G), X b (λ,G), N w (λ,G) and N b (λ,G) computed as follows (Figure 2): ) The quantity X w (λ,G) is the total number of co-members obtained over the K subsampling iterations that are among the within pairs of items (in blue in Figure 2).The number N w (λ,G) is the total number of within pairs of items that are sampled together over the K subsampling iterations.Similarly, X b (λ,G) and N b (λ,G) are the total numbers of co-members and of cosampled pairs, respectively, among the between pairs (in orange in Figure 2).
Note that the co-membership counts C i j (λ,G) follow independent binomial distributions, conditionally on the co-sampling counts H and consensus clusters Z (λ,G): Our consensus score evaluates if the probabilities p i j (λ,G) are larger for pairs of items in the same consensus cluster (Z i = Z j ) than for pair of items in different consensus clusters (Z i = Z j ).To devise this score, we assume for simplicity that . As a consequence, the quantities X w (λ,G) and X b (λ,G) also follow binomial distributions, conditionally on the co-sampling counts in H : We consider that a stable clustering is characterised by a probability p w (λ,G) that is larger than p b (λ,G).To measure clustering stability, we then compare the probabilities p w (λ,G) and p b (λ,G) using a two-sample z test where the null hypothesis is p w (λ,G) ≤ p b (λ,G).The consensus score S c is defined as the z statistic, calculated as: where pw (λ,G) (λ,G) , and p0 (λ,G) = For large enough X w (λ,G) and ), the z statistic approximately follows a standard Normal distribution [19,20].As such, the consensus score S c (λ,G) is comparable across different numbers of clusters G and penalty parameters λ.The consensus score S c (λ,G) increases with clustering stability.
To illustrate this score, we now consider the extreme situation where the clustering is the most stable.The most stable clustering would result in a binary consensus matrix, or equivalently in a matrix C (λ,G) where (i) within elements are the same as in H , and (ii) between elements are all zero.In this extreme situation, the z statistic is equal to N .This is the maximum value that the z statistic can take, which results in the maximum value of the stability score (see Appendix for proof).In addition, the z statistic does not depend on λ or G in this situation.This is a desirable property as it implies that the most stable clustering (associated with a binary consensus matrix) would result in the same consensus score regardless of the numbers and sizes of clusters.

Grid search
The number of clusters G (and regularisation parameter λ for weighted clustering) are calibrated by maximising the consensus score S c using a grid search algorithm where the consensus matrix and metric measuring the stability are computed for different values of G (and λ).
The calibrated (set of) parameter(s) is the one that maximises our consensus score.

Simulation models 2.3.1 Gaussian mixture model
We simulate data X including n items and p attributes from a Gaussian mixture, M , where, ∀i ∈ {1, . . ., n} [21]: where κ is a vector of length G of probabilities that an item belongs to a given cluster, µ Z i is the mean vector of length p for item i belonging to cluster Z i and Σ is the covariance matrix of size (p × p).
Items belonging to different clusters are generated using the same covariance matrix Σ but different mean vectors µ Z i .The number and size of the simulated clusters is controlled via the number of entries in vector κ.We directly use the vector of true cluster membership Z as a simulation parameter in our simulations.

Simulation of cluster means µ g
To control the level of cluster separation and compactness by attribute, we first sample intermediate cluster-and attribute-specific means η g j for each cluster g ∈ {1, . . .,G} and each attribute j ∈ {1, . . ., p} from a Gaussian distribution with mean zero and unit variance.The (G × p) intermediate cluster-and attribute-specific means are then stored in the matrix M of The level of separation between the clusters along attribute j is controlled by the expected proportion of explained variance E j ∈ [0, 1].To ensure that the desired proportion of explained variance is reached, we generate the mean M i j for item i along attribute j using: where E j is the desired proportion of variance along attribute j explained by the simulated clustering.
The i -th row of the simulated matrix M is the mean vector µ Z i and can directly be used in the simulation model presented in Equation 7.
The random sampling of cluster means introduces some variability in the level of separation between pairs of clusters along a given attribute (Supplementary Figure 2).By aggregating information over multiple attributes, the distances calculated for the simulated data are overall lower for pairs of items belonging to the same cluster than for pairs of items belonging to different clusters.The level of separation and compactness of the clusters is controlled by the number of attributes p and the proportions of explained variance by attribute E j , j ∈ {1, . . ., p} (Supplementary Figure 3).
By chance, we may obtain some clusters that are very well separated from all others, and sets of clusters that are not as well separated (Supplementary Figure 4) [22].For example, in Supplementary Figure 4, cluster 4 (in red) is very compact and well separated from other clusters, as indicated by the large silhouette widths for all cluster members.On the other hand, clusters 3 and 5 (yellow and green) include items with silhouette widths below zero indicating poor cluster separation.The proposed simulation procedure controls the overall cluster separation through the proportion of variance explained by the grouping structure for each attribute.

Simulation of covariance matrix Σ
We first simulate a correlation matrix Σ.We consider two simulation scenarios with (i) independent attributes, or (ii) groups of correlated attributes.In the first scenario, the identity matrix is used as correlation matrix.In the second scenario, the correlation matrix Σ is simulated as previously proposed in the context of graphical modelling [14,23].Briefly, we (i) simulate the adjacency matrix of a graph with connected components of random subgraphs, (ii) simulate a corresponding precision matrix, (iii) invert it to obtain a covariance matrix, and (iv) compute the correlation matrix Σ from the covariance matrix.
To ensure that the proportion of variance explained by the grouping for variable X j is equal to E j , the covariance matrix Σ used in Equation 7 is defined as: This simulation model has been implemented in the R package fake (version ≥ 1.4.0),available on CRAN [23].

Performance metrics
Clustering performance is measured by the Adjusted Rand Index (ARI), calculated by comparing the true and estimated co-memberships [24,25]: where T P is the number of True Positives (i.e.true co-members that are in the same reconstructed clusters), T N is the number of True Negatives (i.e.pairs of items that are correctly put in different clusters), and F N is the number of False Negatives (i.e.true co-members that are in different reconstructed clusters).
Feature selection performance is measured by the F 1 -score [26]: where P is the precision and R is the recall calculated by comparing the sets of attributes contributing to the clustering in the simulation (i.e.such that E j = 0) and attributes with the highest median weights estimated for weighted distances.

Outline
In this section, we apply For weighted clustering, a total of p = 100 independent attributes are simulated, of which q * = 20 have nonzero proportions of explained variance (E = 0.6).Sensitivity analyses use different numbers of contributing attributes q * , different proportions of explained attribute variances by the clustering and/or groups of correlated attributes.

Comparison of calibration scores
We represent clustering performance as measured by the Adjusted Rand Index (ARI) obtained with the existing and novel calibration scores for different numbers of clusters (Supplementary Figure 5).The simulated number of clusters (G * = 5) is recovered using all scores except for the ∆ score ( Ĝ∆ = 2).We generally observe increasing ARI with the RCSI and consensus scores, suggesting that these metrics are relevant to choose the model with the best clustering performance (Supplementary Figure 5E-G).
Evaluating the calibration performances of these approaches for different values of separation and compactness, we found that the PAC score is only able to detect the correct number of simulated clusters for well separated clusters (Supplementary Figure 6A-B).The correct number of clusters is always missed by the ∆ score in these examples.The RCSI and consensus scores are able to detect the correct number of clusters even under limited levels of separation (Supplementary Figure 6C).When the clustering structure is not detected, the RCSI scores suggest small numbers of clusters ( ĜRCSI = 2) while the consensus score chooses a large number ( ĜS c = 18) (Supplementary Figure 6E).

Clustering performance
We evaluate the ability of hierarchical and consensus clustering with different calibration strategies to recover the grouping structure in simulated data with different levels of cluster separation (Figure 3).
As a reference, we report the performances of both clustering models with the number of clusters used for the simulation G * = 5 and compare them with results based on the calibrated numbers of clusters.With the true number of clusters G * , consensus clustering outperforms hierarchical clustering in the three settings, with a larger increase in ARI for weaker levels of cluster separation (Figure 3).This can be explained by the re-assignment of items (mostly reattributed to their correct cluster) when using the consensus matrix as a measure of similarity (Supplementary Figure 1).
For hierarchical clustering, calibration maximising the GAP statistic [17] generates better performances than with the silhouette coefficient [22,27] in all three scenarios (Figure 3).
For consensus clustering, models calibrated by maximising the ∆ score perform poorly in the three settings (Figure 3).We observe good performances of calibration by the PAC score for well separated clusters only (E = 0.6), but the inter-quartile range of the ARI is larger than for other approaches (Supplementary Table 1).Models calibrated using the RCSI and consensus scores generate the best clustering performances in these scenarios and are able to recover the true number of clusters G * for most simulated datasets (Supplementary Table 1).For weaker cluster separation (E = 0.4), calibration by maximising the consensus score outperforms all other approaches, including the Monte Carlo based procedures (increase in median ARI of 0.03, Supplementary Table 1).This suggests that our score allows for the detection of more subtle clustering structures.Increasing the number of Monte Carlo sampling iterations from 25 to 100 does not affect the clustering performances of models calibrated using the RCSI scores (Supplementary Table 2).
The estimation of consensus matrices for 2 to 20 clusters on these simulated datasets took less than 5 seconds using 1 CPU and 1 GB memory (Supplementary Table 1).The time needed to compute the silhouette, ∆, PAC or consensus scores is less than a second.The GAP statistics were calculated in 1-2 seconds.The median time to calculate the RCSI scores was above 2 minutes for N = 25 Monte Carlo sampling iterations (Supplementary Table 1).
Overall these results suggest that the proposed score performs at least as well as established approaches and generates performances that are very close to those of consensus clustering using the true number of clusters G * in all scenarios (difference in median ARI lower than 0.02), for no increase in computation time once consensus matrices are estimated (Supplementary Table 1).Conclusions are similar when applying these approaches using partitioning around medoids (Supplementary Figure 7) and for simulated data with different numbers of items (Supplementary Table 3) or clusters (Supplementary Figure 8).

Performance of consensus weighted clustering
We simulate data with G * = 5 clusters that can be observed along q * = 20 of the p = 100 attributes (i.e. with nonzero proportion of explained variance by the grouping).We use implementations in the R packages sparcl for hierarchical sparse clustering and rCOSA for Clustering Objects on Subsets of Attributes.
Using the true number of clusters G * = 5, we observe an increase in clustering performance when introducing weighting in the algorithm with a median ARI of 0.78 for (unweighted) consensus clustering compared to 0.95 and 0.94 for the best sparcl and COSA models, respectively (Figure 4A, Supplementary Table 4).
Joint calibration of the number of clusters G and regularisation parameter λ using our con-  sensus score for consensus COSA clustering yields the highest median ARI at 0.94 (Figure 4A, Supplementary Table 4).The COSA algorithm does not inherently perform feature selection, but we evaluate here its ability to give larger weights to the contributing features using the F 1score measuring the selection performance when considering the q * = 20 features with the largest median weights as selected.The median F 1 -score is very close to 1, suggesting that the q * = 20 features with larger median weights are almost always the ones used in the simulation model (Figure 4B).
When using the sparse hierarchical clustering algorithm implemented in sparcl, the best clustering performances are as good as those obtained with COSA with the true number of clusters G * (highest median ARI of 0.95).However, these correspond to models that are not sparse, with q = 94 selected attributes on average (Figure 4A).For sparcl, the F 1 -score is calculated by considering the q * = 20 features with largest selection proportions as selected.The median F 1 -score remains below 0.75 for all values of λ, suggesting a poorer ability of the sparcl model compared to COSA to give larger weights to the contributing features (Figure 4B, Supplementary Figure 9).Calibration of consensus sparcl clustering using our consensus score yields poor clustering performances, with a median ARI of 0.37 and a median calibrated number of clusters Ĝ = 3 (Supplementary Table 4).
The poor weighting and clustering performances when using sparcl in the proposed approach are likely due to the underlying assumption in sparse clustering that the same set of features equally contribute to the definition of all clusters.As this is, in general, not the case with our simulated data (Supplementary Figures 1, 3), sparcl does not seem to be able to detect the relevant attributes.Supplementary Figure 9 suggests that only a subset of the clusters, supported by a subset of the contributing attributes (in red), are stable in consensus sparcl clustering.These results are in line with previously reported limitations of the sparcl algorithm [10].
By estimating weights that are specific to the item and attribute, the COSA algorithm generates a distance matrix from which clusters that are supported by different sets of attributes can be detected (Supplementary Figure 9).Based on these results, we recommend the use of the COSA algorithm in the proposed consensus weighted clustering calibrated by maximising the consensus score.
Outputs generated by consensus COSA clustering include the estimated cluster membership and the distribution of median feature weights estimated from the K COSA models with calibrated regularisation parameter λ (Supplementary Figure 9).We observe increasing median weights with the proportion of explained variance by feature, which suggests that median weights appropriately capture attribute contribution (Supplementary Figure 10).However, the overlapping quartiles of median weights for features with proportions of explained variance below 0.4 indicate that features with weaker contributions may be difficult to disentangle from non-contributing features (Supplementary Figure 10).Introducing correlation between features does not seem to hamper the weighting performances (Supplementary Figure 11).

Data overview
We use publicly available data with transcriptomics measurements (p = 3, 312 attributes) in lung cells from n = 63 participants [13].The samples consist of 17 normal lung specimens and 46 lung tumours, including histologically defined squamous-cell lung carcinoma (N = 20), pulmonary carcinoids (N = 20) and small-cell lung carcinoma (N = 6).We perform clustering to detect groups of individuals based on their molecular profiles.

Clustering results
First, we apply hierarchical clustering with complete linkage on the Euclidean distances between the n = 63 samples (Figure 5A).As in the original publication, we observe overall lower distances among samples from healthy lungs or from tumours of the same histological subtype than between these groups.The G * = 4 clusters obtained from hierarchical clustering include (i) a mixture of N = 20 pulmonary carcinoids, N = 6 small-cell carcinomas and N = 1 squamous-cell carcinoma in Cluster 1, (i) all normal lung samples in Cluster 2, and (iii) the squamous-cell carcinomas split over Clusters 3 and 4.
For consensus unweighted clustering, calibration maximising our score indicates that the most stable clustering is obtained for Ĝ = 5 (Figure 5B).The five stable clusters correspond   to (i) N = 19 squamous-cell carcinomas split over Clusters 1 and 2, (ii) all small-cell and the remaining squamous-cell carcinoma in Cluster 3, (iii) all carcinoids in Cluster 4, and (iv) all normal lung tissues in Cluster 5 (Figure 5C).
For consensus weighted clustering, the calibrated number of clusters is Ĝ = 4 and COSA regularisation parameter is λ = 0.28 (Figure 5D).The use of weighted distances induces an increase in the consensus score (highest score of 304 compared to 196 in consensus unweighted clustering) that is reflected in the consensus matrix (Figure 5E).The Ĝ = 4 stable clusters in the calibrated consensus weighted clustering correspond to the normal lung samples and the three subtypes of lung cancer, except for one squamous-cell sample that is grouped with the small-cell carcinomas in Cluster 2.

Discussion
As previously reported [6], we observe better clustering performances with consensus clustering compared to a single run of the underlying (e.g.hierarchical) clustering on both simulated and real data.The use of weighted distances further increases clustering performances in the presence of irrelevant attributes.
Our simulation study shows that consensus clustering calibrated by maximising our consensus score performs at least as well as models calibrated using existing scores.Calibration techniques based on cumulative density distributions (PAC and PINS discrepancy scores) only perform well when the clustering structure is extremely strong.Calibration procedures using the consensus and RCSI scores yield similar clustering performances for well separated clusters, but our consensus score allows for the detection of more subtle clustering structures.
Furthermore, our consensus score can is far less computational expensive than RCSI scores, which require the simulation and clustering of multiple datasets.
The assumption that co-membership probabilities are the same for all pairs within or between consensus clusters, respectively, constitutes a potential limitation of our consensus score.
This assumption implies that the quantities X w (λ,G) and X b (λ,G) follow binomial distributions, which allows for the use of the z test and hence for extremely fast computations.As an alternative, the exact distributions could be recovered using simulations.This alternative may quickly become time consuming due to the resolution required to compare the scores.
Despite its underlying assumptions and approximations, the simulation studies indicate very good performances of our consensus score.
The use of the sparcl or COSA algorithms in consensus weighting clustering both generate an increase in clustering performance compared to unweighted approaches, for well chosen regularisation parameters.Joint calibration of the number of clusters and regularisation parameter in consensus COSA clustering using our consensus score generates some of the best performances.However, the use of sparcl in consensus weighted clustering calibrated using our consensus score is not recommended as it only detects a subset of the clusters, which leads to overall poor clustering performance.Results from our simulation studies are supported by our real data application, where lung sample types are better recovered with consensus weighted clustering calibrated using the consensus score.
Due to the subsampling procedure, the increase in performance with consensus clustering comes at the price of a higher computational burden.Our implementation in the R package sharp allows for parallelisation over the subsampling iterations to reduce computation times.
The current implementation of consensus clustering does not scale to large numbers of items (e.g.> 50, 000), which increase both the computation time and memory usage.Extensions potentially using a pre-clustering step are required [1,28].6 Supplementary materials

Maximum of the consensus score
Recall that the integers X w (λ,G) and X b (λ,G) are the total numbers of co-members in the within and between pairs, respectively.The integers N w (λ,G) and N b (λ,G) are the total numbers of times each of the within and between pairs, respectively, are drawn together in the subsamples.For clarity, the λ and G indexing is omitted here.The consensus score S c can be expressed as a function of X w , X b , N w and N b : where In this section, we want to find the maximum value of the consensus score S c .For this, we introduce the real-valued function f of X w and X b which is equal to the consensus score when X w , X b , N w and N b are integers: where We consider that X w and X b belong to these intervals in the remainder of this proof.
The derivatives of f are given by: as well as In addition, we can show that: Equations 13 and 14 show that for any values of X w and X b over the intervals where f is defined.
Similarly, we can show that Combining Equations 10, 11 and 15, we can show that f is monotonically non-decreasing Combining Equations 10, 12 and 16, we can show that f is monotonically non-increasing Hence, the function f is maximised at X w = N w and X b = 0, which are the largest and smallest values for X w and X b , respectively.The corresponding maximum is: The consensus score can be obtained by applying the function f on integers defined over the same intervals (see Equation 8).As a consequence, the consensus score is also maximised at X w = N w and X b = 0, which corresponds to a binary consensus matrix.

Euclidian distance
Supplementary Figure 1: Comparison of cluster membership by applying hierarchical clustering with complete linkage using the Euclidean distance (left) or 1 -co-membership proportion (right) as a distance measure.The co-membership proportions are obtained from hierarchical clustering on the Euclidean distances calculated on K = 100 subsamples.We represent the dendrograms obtained with the two approaches.The items are coloured by true (simulated) cluster membership.The same items are present in both dendrograms but may be re-ordered due to the change of distance metric.Positions of the same item in the two dendrograms are connected by edges to improve readability.Cluster re-assignment is indicated by red edges.

Figure 1 :
Figure 1: Flowchart describing the six steps of consensus weighted clustering applied on the data matrix X with hyper-parameters λ and G.

Figure 2 :
Figure 2: Illustration of the calculation of the quantities X w (λ,G), X b (λ,G), N w (λ,G), and N b (λ,G) given the consensus clustering outputs Z (λ,G), C (λ,G), and H .The blue entries indicate the within elements and the orange entries are between elements.
(consensus) clustering to simulated datasets with different numbers of items, attributes, clusters and levels of separation.Unless specified otherwise, inferences are based on hierarchical clustering with complete linkage applied to the (weighted) Euclidean distance.We use K = 100 subsampling iterations throughout this paper.First, we compare the clustering performances of hierarchical clustering and consensus unweighted clustering calibrated using different strategies.Then, we evaluate both the clustering and weighting performances of consensus weighted clustering.For unweighted clustering, comparisons are conducted using 1, 000 simulated datasets with p = 10 attributes and n = 150 items allocated to G * = 5 clusters of sizes N 1 = 20, N 2 = 50, N 3 = 30, N 4 = 10, N 5 = 40.The p attributes all have the same proportion E of explained variance by the clustering.Different levels of cluster separation are investigated with E ranging from 0.4 to 0.6.Sensitivity analyses include different numbers and sizes of clusters.

4 Figure 3 :
Figure 3: Comparison of clustering performances of (consensus) hierarchical clustering with different calibration strategies from N = 1, 000 simulated datasets corresponding to different levels of cluster separation.We simulate N = 1, 000 datasets with n = 150 items split into G * = 5 clusters such that N 1 = 20, N 2 = 50, N 3 = 30, N 4 = 10, N 5 = 40 across p = 10 features, each with a proportion of explained variance of E = 0.6 (left), E = 0.5 (middle) or E = 0.4 (right).For each scenario, we show a heatmap of Euclidean distances (A) and barplot of silhouette widths for each of the n = 150 items coloured by simulated cluster membership (B) for one simulated dataset.Median, quartiles, minimum and maximum Adjusted Rand Index (ARI) for hierarchical clustering with the simulated number of clusters (G * ), or calibrated by maximising the silhouette and GAP score, and for consensus hierarchical clustering with G * or calibrated using the ∆, PAC, PINS discrepancy, RCSI and consensus scores are reported (C).

Figure 4 :
Figure 4: Comparison of clustering performances by the Adjusted Rand Index (ARI) (A) and of the attribute weighting (when applicable) by the F 1 -score (B) for hierarchical and consensus clustering using unweighted, sparcl or COSA Euclidean distances.Performances are evaluated for N = 1, 000 simulated datasets with n = 150 items split into G * = 5 clusters including N 1 = 20, N 2 = 50, N 3 = 30, N 4 = 10, N 5 = 40 items, respectively.The clustering structure issupported by q * = 20 of the p = 100 attributes with nonzero proportion of explained variance (E = 0.6).Hierarchical clustering is conducted using a number of clusters that is simulated (G * ) or calibrated using the silhouette coefficient or the GAP statistic.For unweighted consensus clustering, the number of clusters is the simulated (G * ) or calibrated number maximising the consensus score.For weighted clustering using sparcl or COSA distances, we (i) fix the number of clusters to G * and consider ten different values of λ, or (ii) jointly calibrate the number of clusters and the penalty parameter using our consensus score (A).For consensus weighted clustering, the F 1 -score measures weighting performance by comparing the sets of (i) the 20 features with highest weights (or selection proportions for sparcl), and (ii) the 20 features supporting the clustering in the simulation (B).

Figure 5 :
Figure 5: Clustering applications on real transcriptomics data measured in lung tissue.Hierarchical clustering with complete linkage is applied on the Euclidean distance (A).The calibration curve (B) and consensus matrix (C) of consensus clustering using the Euclidean distance are represented.For consensus COSA clustering, we use a grid of 10 regularisation parameter in the calibration maximising the consensus score (D) and report the calibrated consen-sus matrix (E).Sample names are coloured by type (normal lung in green, carcinoids in red, squamous-cell in brown and small-cell in blue).For each of these three clustering approaches, the four estimated clusters are coloured in red, blue, green and beige.
Consensus (weighted) clustering and proposed calibration procedure have been implemented in the R package sharp (version ≥ 1.4.0).Simulation models have been implemented in the R package fake (version ≥ 1.4.0).The transcriptomics dataset is publicly available and can be downloaded at https://doi.org/10.1073/pnas.191502998.All codes to reproduce the analyses presented in this paper are available at https://github.com/barbarabodinier/Consensus_clustering.

Figure A :
Figure A: Heatmap of the consensus score (colour-coded) obtained with N w = 10 and N b = 20 and different values of X w (x-axis) and X b (y-axis).

Supplementary Figure 2 :
Scatter plots for simulated data using n = 100 items split into G * = 3 clusters such that N 1 = 20 (in blue), N 2 = 50 (in red) and N 3 = 30 (in orange) across p = 3 features with a proportion of explained variance by the grouping structure set to E = 0.8 for all features.

Supplementary Figure 3 :
Example of simulated data using n = 100 items split into G * = 3 clusters such that N 1 = 20 (in blue), N 2 = 50 (in red) and N 3 = 30 (in orange) across p = 5 (A), p = 10 (B) or p = 30 (C) features with a proportion of explained variance by the grouping structure set to E = 0.5 for all features.For each dataset, we show the heatmap of Euclidean distances (left) and score plots along the first three principal components of a Principal Component Analysis (right).

Supplementary Figure 5 :N 3 =Supplementary Figure 6 :
Consensus unweighted clustering performance as a function of different calibration scores applied on simulated data with G * = 5 clusters.Consensus clustering was conducted using hierarchical clustering with complete linkage on the Euclidean distances computed on K = 100 subsamples.We show the heatmap of pairwise distances in the simulated dataset with n = 150 items split into G * = 5 clusters such that N 1 = 20, N 2 = 50, 30, N 4 = 10, N = 40 across p = 10 features, each with a proportion of explained variance of 0.6 (A).The Adjusted Rand Index (ARI) measuring clustering performance is represented as a function of the ∆ (B), PAC (C), PINS discrepancy (D), RCSI for PAC (E), RCSI for entropy (F) and consensus (G) scores for different numbers of clusters (indicated on the points).The calibrated number of clusters obtained with the corresponding calibration score is indicated in red.Calibration curves using the PAC, ∆, PINS discrepancy, RCSI and consensus scores on simulated examples with different levels of cluster separation and compactness.Consensus clustering was conducted using hierarchical clustering with complete linkage on the Euclidean distances computed on K = 100 subsamples.Data is simulated for n = 150 items split into G * = 5 clusters such that N 1 = 20, N 2 = 50, N 3 = 30, N 4 = 10, N 5 = 40 across p = 10 features with a proportion of explained variance by the grouping structure set to E = 0.7 (A), E = 0.6 (B), E = 0.5 (C), E = 0.4 (D), or E = 0.3 (E) for all features.Heatmaps of Euclidean distances for calculated for the simulated data are reported at the top.

15 Supplementary Figure 8 :
Comparison of clustering performances of (consensus) hierarchical clustering with different calibration strategies from N = 1, 000 simulated datasets corresponding to different numbers of clusters.We simulate N = 1, 000 datasets with n = 150 items split into G * = 5 (left), G = 10 (middle), or G = 15 (right) clusters of equal sizes across p = 10 features, each with a proportion of explained variance of E = 0.5.For each scenario, we show a heatmap of Euclidean distances (A).Median, quartiles, minimum and maximum Adjusted Rand Index (ARI) for hierarchical clustering with the simulated number of clusters (G * ), or calibrated by maximising the silhouette and GAP score, and for consensus hierarchical clustering with G * or calibrated using the ∆, PAC, PINS discrepancy, RCSI and consensus scores are reported (B).