Motivation: Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as BAPS, STRUCTURE and TESS, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) ‘label switching’ of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) ‘genuine multimodality,’ truly distinct solutions across replicates.
Results: To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program CLUMPP (CLUster Matching and Permutation Program). We illustrate the use of CLUMPP by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.
Availability:CLUMPP is freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html
A variety of population-genetic applications—such as association mapping, molecular ecological studies and studies of human evolution—make use of the clustering of individual multilocus genotypes into populations. Many clustering algorithms have now been developed for employing population-genetic data to assign individuals—and fractions of individuals—to clusters (Anderson and Thompson, 2002; Chen et al., 2006, 2007; Corander and Marttinen, 2006; Corander et al., 2003, 2004; Dawson and Belkhir, 2001; Falush et al., 2003; François et al., 2006; Huelsenbeck and Andolfatto, 2007; Pella and Masuda, 2006; Pritchard et al., 2000). The result of a single cluster analysis is typically possible to represent as a matrix, where each individual is given a ‘membership coefficient’ for each cluster—interpreted as a probability of membership, or as a fraction of the genome with membership in the cluster, depending on the setting—with membership coefficients summing to 1 across K clusters. The number of clusters is predefined by the user for some methods, and for others it is inferred.
Because clustering algorithms may incorporate stochastic simulation as part of the inference, independent analyses of the same data may result in several distinct outcomes, even though the same initial conditions were used. The main differences across replicates are of two types: ‘label switching’ and ‘genuine multimodality’. ‘Label switching’ refers to a scenario in which different replicates obtain the same membership coefficient estimates, except with a different permutation of the cluster labels (Jasra et al., 2005; Stephens, 2000). In unsupervised cluster analyses, because the meaning of each cluster label is not known in advance, a clustering algorithm may be equally likely to reach any of K ! permutations of the same collection of estimated membership coefficients.
It is also possible for replicate cluster analyses to arrive at truly distinct solutions that are not equivalent up to permutation. Unlike ‘label switching’, which is simply a nuisance that makes interpretation of the clustering results more difficult, this ‘genuine multimodality’ may result from real biological factors that cause multiple parts of the space of possible membership coefficients to provide similarly appropriate explanations for the data.
Regardless of the source of differences in clustering outcomes, some method is needed for handling the results from replicate analyses. We develop three algorithms for searching for optimal alignments of R replicate cluster analyses of the same data, and we have implemented these algorithms in the computer program CLUMPP. Our program takes as input the estimated cluster membership coefficient matrices of multiple runs of a clustering program, for any number of clusters. It outputs these same matrices, permuted so that all replicates have as close a match as possible. Thus, CLUMPP strips away the ‘label switching’ heterogeneity so that the ‘genuine multimodality’ can be detected and quantified. The input file for CLUMPP is a file similar to the output from STRUCTURE (Falush et al., 2003; Pritchard et al., 2000), and the output from CLUMPP can be used directly as input by the cluster visualization program DISTRUCT (Rosenberg, 2004).
We refer to the C×K matrix of membership coefficients for a single cluster analysis as the Q-matrix, with the C rows corresponding to individuals (or populations) and the K columns corresponding to clusters. CLUMPP attempts to maximize a measure of similarity of the Q-matrices of R creplicates over all (K !)R−1 possible alignments of the replicates.
Consider a pair of Q-matrices, Qi and Qj for runs i and j, where the value in the cth row and kth column of Qi is the membership coefficient for individual c in cluster k as inferred in run i. Each matrix consists of non-negative entries, and the sum of the entries in any row is 1. We define the pairwise similarity of matrices Qi and Qj as follows:F is the Frobenius matrix norm (Golub and Van Loan, 1996).
Using G to measure similarity, the optimal alignment of matrices Qi and Qj is defined as the permutation of the columns of Qj that maximizes the similarity G over all permutations P in the set SK of permutations of K clusters. The maximum value, or2005) the ‘symmetric similarity coefficient’ (SSC) of the pair of runs [see also Rosenberg et al., (2002) for an earlier statistic]. The SSC for two runs is bounded above by 1—which it equals if the Q-matrices are identical up to a permutation of the clusters—and it decreases as the similarity of the Q-matrices decreases. The SSC statistic is generally expected to be positive if non-trivial clustering patterns are present in Qi and Qj, although it is possible for it to be negative. 5) is taken over (K !)R−1 vectors.
We make use of three algorithms for attempting to find the optimal alignment of R replicates. In decreasing order of the extent of the search, and in increasing order of computational speed, these algorithms are termed FullSearch, Greedy and LargeKGreedy. These algorithms supersede earlier methods that we described in Nordborg et al. (2005).
Note that our approach can proceed analogously using alternative functions to measure similarity in place of G. Although it is undefined when one of the two matrices equals W, G is designed to have large negative values when one of the two runs reflects substantial population structure and the other has relatively little structure (i.e. little difference from W). We can define a second similarity function G′, which is guaranteed to lie in [0,1]:, which guarantees that G′ lies in [0,1], arises from the definition of the Frobenius norm: . Similarly, . It then follows that .
The quantities SSC′, H′ and rmSSCR' can then be defined by replacing G in Equations (3–5) with G′. We proceed to describe our algorithms using the G statistic to measure similarity; to instead use G′, the approach is analogous with G′, SSC′, H′ and rmSSCR' in place of G, SSC, H, and SSCR.
The FullSearch algorithm computes H for each of the (K!)R−1 alignments of the K clusters in R replicates. Considering all possible vectors of permutations (I, P2, P3, …, PR), the algorithm computes H(I(Q1),P2(Q2),P3(Q3),…,PR(QR)) and returns the vector of permutations that maximizes SSCR. As the number of possible vectors grows quickly with K and R, however, even for moderate values of K and R, it is unrealistic to test every alignment. The FullSearch algorithm runs in time proportional to TFullSearch = (K!)R−1[R(R−1)/2]KC: the number of permutation vectors is (K!)R−1, the number of computations of G in each evaluation of Equation (4) is R(R−1)/2, and the time required for each computation of G is proportional to KC. Although it proceeds slowly, unlike our other algorithms, the FullSearch algorithm is guaranteed to find the optimal alignment of clusters across multiple runs.
The Greedy algorithm employed by CLUMPP proceeds as follows:
Choose one run, Q1.
Choose a second run, Q2, and fix the permutation, P2, that maximizes G(P1(Q1),P(Q2)) over all possible permutations P (where P1 is the identity permutation).
This algorithm runs in time proportional to TGreedy=(K !)[R(R−1)/2]KC. The number of permutations tested for each run from 2 to R is K !. For run r (r ranging from 2 to R), the number of computations of G performed for each permutation is r−1. Thus, considering all runs from 2 to R, the total number of computations of G performed is (K !)[R(R−1)]/2. Each computation of G runs in time proportional to KC.
Because the order in which the runs are considered can affect the result, several different sequences of runs should be tested. CLUMPP offers three options for testing different sequences: test all possible sequences of runs, test a pre-defined number of random sequences and test a specific set of user-defined sequences.
When K≳15, the number of permutations, K !, is very large, and it may not be possible to calculate G for all permutations of a particular pair of Q-matrices. Instead of testing every permutation as in the Greedy algorithm, the LargeKGreedy algorithm proceeds as follows:
(1)Choose one run, Q1.
(2a)Choose a second run, Q2. Compute G for all pairs of columns, one from Q1 and one from Q2. This computation is simply the value of G for two columns—hence no permutations of Q2 are computed, unlike in step 2 for the Greedy algorithm.
(2b)Pick the pair of columns Q1,y1 and Q2,z1 with highest G-value and fix these columns (Q1,y1 refers to column y1 of matrix Q1). Then pick the pairs of columns Q1,y2 and Q2,z2 with the next highest G-value, ignoring all G-values of pairs containing either of the previously chosen columns Q1,y1 and Q2,z1. Repeat this procedure until K pairs of columns, one each from Q1 and Q2, have been picked, and fix the permutation of Q2 that matches up these pairs of columns, or P2(Q2).
(3b)Pick the pair of columns y1 of P1(Q1),P2(Q2),…,Px−1(Qx−1) and z1 of Qx with highest average similarity in Equation (8). Then pick the columns y2 and z2 with the next highest similarity in Equation (8), ignoring similarity scores of pairs containing either of the previously chosen columns y1 and z1. Repeat this procedure until K pairs of columns, one for the matrices P1(Q1),P2(Q2),…,Px−1(Qx−1) and one for Qx, have been picked. Fix the permutation of Qx that matches up these pairs of columns, or Px(Qx).
A candidate for the vector of permutations of the R runs that maximizes H across all possible vectors has now been constructed. This algorithm runs in time proportional to TLargeKGreedy=[R(R−1)/2]K2C. The number of pairs of columns, one from the run under consideration and one from the previously fixed runs, is K2. For run r (r ranging from 2 to R), the number of computations of G performed for each pair of columns is r−1. Considering all runs from 2 to R, the total number of computations of G performed is K2[R(R−1)]/2. Since G is computed only for columns rather than for whole matrices, the time of computation of G is proportional only to C rather than to KC, as in the other algorithms.
As is true for the Greedy algorithm, the order in which the runs are considered can affect the result. For the LargeKGreedy algorithm CLUMPP offers the same three options for selecting the input sequence of runs as it provides for the Greedy algorithm.
To get an idea of which algorithm to use, we have found it useful to compute the quantity D=TN for each algorithm, where T is a quantity proportional to the time required by an algorithm (as described above) and N is the number of input sequences to be tested (for FullSearch, N=1). If D ∼ 1013 for FullSearch, then this algorithm is fast enough and is preferred; otherwise the Greedy algorithm can be used. If D ≳ 1013 for the Greedy algorithm, then this algorithm is probably also too slow. In that case, the LargeKGreedy algorithm should be used, as it can handle K>20, R>100 and C>1000 in reasonable time. We recommend testing at least 100 input sequences for the Greedy and LargeKGreedy algorithms—and preferably many more—to find the approximately highest SSCR value.
Rosenberg et al. (2001) analyzed a data set of 27 microsatellites genotyped in 600 chickens from 20 breeds. Using STRUCTURE (Pritchard et al., 2000), they inferred membership coefficients of the 600 chickens in 19 clusters. By repeating the cluster analysis 100 times, they assessed the robustness of the clusters, and Figure 1A displays the cluster membership estimates from 25 of the 100 replicates (randomly chosen). From Figure 1A, the ‘label-switching’ problem is apparent. Focusing on Runs 15 and 16, we could quickly ‘match up’ the clusters (i.e. the colored segments). However, because of some genuine multimodality, it would be not only tedious, but also difficult, to visually align the clusters for all replicates in an optimal manner. CLUMPP is designed to address exactly this type of situation.
Using CLUMPP and the LargeKGreedy algorithm on the 100 replicates (10 sets of 100 random input sequences of runs—for each set this takes ∼6 min on a 2.4 GHz processor), we find the highest value of H to be in the range of 0.51–0.54 (and the highest value of H′ to be between 0.69 and 0.70). If we allow the algorithm to run longer (30 000 random input sequences), the highest H equals 0.5546. The highest-scoring input sequences tend to produce quite similar alignments of the replicates. Excluding the input sequence that produced the highest H-value, the next 10 highest-scoring input sequences all produced H-values of at least 0.5441, and had on average 2.0% differences compared to the input sequence that led to the highest H. In other words, considering the permuted position of a randomly chosen cluster (among 19) in a randomly chosen run (among 100) based on the output of CLUMPP using a randomly chosen input sequence (among the 10 next highest-scoring sequences, after the one with the highest H-value), the cluster had a 98.0% chance of being aligned in the same way that it was aligned when using the highest-scoring of all input sequences.
The permuted membership coefficients of the 25 runs in Figure 1A for the input sequence among the 30 000 that leads to the highest H are shown in Figure 1B. Comparing the permuted membership coefficients for this input sequence to the clustering behavior in Table 2 of Rosenberg et al. (2001), we recapture the same patterns of clustering. For example, Brown-egg layer lines C and D always share a cluster, and this cluster sometimes (in 8 of 100 runs) includes Godollo Nhx (Run 11 in Fig. 1).
When we consider the LargeKGreedy algorithm with 10 000 fixed input sequences (chosen randomly among the 30 000 described above), for each of the 10 000 sequences, the alignment of the replicates obtained by CLUMPP is identical regardless of which of two statistics—H or H′—is used. The same input sequence that produces the highest H-value (H=0.5508) produces the highest H′-value (H′=0.7099). Excluding the input sequence that produced the highest values of H and H′, the next 10 highest-scoring input sequences—the same sequences for both statistics—all produced H-values of at least 0.5392 and H′-values of at least 0.7022, and had on average 7.0% differences compared to the input sequence that led to the highest H and H′. These results suggest that although matrices can be constructed so that the two statistics can lead to different alignments, in practice, their properties are extremely similar. The larger number of differences for the highest-scoring input sequences from among 10 000 sequences compared to the highest-scoring among the 30 000 sequences described above highlights the importance of employing a large number of input sequences whenever possible.
Interpreting the results of multiple population genetic cluster analyses is not always straightforward (Corander et al., 2004; Evanno et al., 2005; Pritchard et al., 2000; Rosenberg et al., 2001). One could argue that the single replicate with the highest likelihood (according to the criterion of the clustering program) is the optimal clustering solution, and that solutions with lower likelihood can be discarded. However, for complex data sets that produce several distinct solutions with similar likelihoods, this strategy may not be satisfactory, as it disregards both uncertainties of the analysis and important population structure information that may only be visible in different modes. From a Bayesian perspective, a possible approach would be to weight different solutions by their likelihoods. However, this approach can be difficult to apply with individual membership coefficients in the case of genuine multimodality. A third approach is to summarize data on the outcomes of many replicates, such as in Table 2 of Rosenberg et al. (2001). Regardless of the choice of procedure for employing multiple cluster analyses to make biological interpretations from population-genetic data, CLUMPP can provide a useful way of assessing the similarity of the outcomes of individual runs.
We note that our focus here has been on the setting in which the clustering algorithm explores only one mode during the iterative process that constitutes a single replicate analysis, but may explore different modes in different replicates. As was recognized by Pritchard et al. (2000), this setting typically applies to the STRUCTURE approach, for which the results of individual replicates are reasonably straightforward to summarize and interpret, but for which difficulties may arise in interpreting differences across replicates. It is possible to envision an alternative scenario in which multiple important modes (potentially including multiple permutations of each of the genuinely distinct modes) are explored during the course of a single analysis. In this situation, replicate analyses would produce similar results, but because the meaning of a cluster label could vary across intermediate steps in the analysis, it would be challenging to summarize the outcome of any single replicate. Although our analysis has used the language of multiple replicate analyses rather than that of multiple intermediate steps of a single analysis, it is worth noting that the algorithms we have proposed for the matching of clusters can also be applied with each ‘replicate’ described above corresponding to a specific state of the membership coefficient estimates encountered during the course of a single analysis. Thus, in addition to their uses in interpreting multiple replicate analyses, our algorithms can augment existing approaches to the analysis of individual replicates (e.g. Dawson and Belkhir, 2001; Huelsenbeck and Andolfatto, 2007) and can contribute new approaches for the problem of summarizing the intermediate steps of a single run.
We thank O. François and two anonymous reviewers for helpful comments on a previous version of the manuscript. This work was supported by NIH grant HL084729, by grants from the Burroughs Wellcome Fund and the Alfred P. Sloan Foundation, and by a postdoctoral fellowship to M.J. from the University of Michigan Center for Genetics in Health and Medicine.
Conflict of Interest: none declared.