## Abstract

**Motivation:** Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as *BAPS, STRUCTURE* and *TESS*, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) ‘label switching’ of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) ‘genuine multimodality,’ truly distinct solutions across replicates.

**Results:** To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program *CLUMPP* (CLUster Matching and Permutation Program). We illustrate the use of *CLUMPP* by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.

**Availability:***CLUMPP* is freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html

**Contact:**mjakob@umich.edu

## 1 INTRODUCTION

A variety of population-genetic applications—such as association mapping, molecular ecological studies and studies of human evolution—make use of the clustering of individual multilocus genotypes into populations. Many clustering algorithms have now been developed for employing population-genetic data to assign individuals—and fractions of individuals—to clusters (Anderson and Thompson, 2002; Chen *et al.*, 2006, 2007; Corander and Marttinen, 2006; Corander *et al.*, 2003, 2004; Dawson and Belkhir, 2001; Falush *et al.*, 2003; François *et al.*, 2006; Huelsenbeck and Andolfatto, 2007; Pella and Masuda, 2006; Pritchard *et al.*, 2000). The result of a single cluster analysis is typically possible to represent as a matrix, where each individual is given a ‘membership coefficient’ for each cluster—interpreted as a probability of membership, or as a fraction of the genome with membership in the cluster, depending on the setting—with membership coefficients summing to 1 across *K* clusters. The number of clusters is predefined by the user for some methods, and for others it is inferred.

Because clustering algorithms may incorporate stochastic simulation as part of the inference, independent analyses of the same data may result in several distinct outcomes, even though the same initial conditions were used. The main differences across replicates are of two types: ‘label switching’ and ‘genuine multimodality’. ‘Label switching’ refers to a scenario in which different replicates obtain the same membership coefficient estimates, except with a different permutation of the cluster labels (Jasra *et al.*, 2005; Stephens, 2000). In unsupervised cluster analyses, because the meaning of each cluster label is not known in advance, a clustering algorithm may be equally likely to reach any of *K* ! permutations of the same collection of estimated membership coefficients.

It is also possible for replicate cluster analyses to arrive at truly distinct solutions that are not equivalent up to permutation. Unlike ‘label switching’, which is simply a nuisance that makes interpretation of the clustering results more difficult, this ‘genuine multimodality’ may result from real biological factors that cause multiple parts of the space of possible membership coefficients to provide similarly appropriate explanations for the data.

Regardless of the source of differences in clustering outcomes, some method is needed for handling the results from replicate analyses. We develop three algorithms for searching for optimal alignments of *R* replicate cluster analyses of the same data, and we have implemented these algorithms in the computer program *CLUMPP*. Our program takes as input the estimated cluster membership coefficient matrices of multiple runs of a clustering program, for any number of clusters. It outputs these same matrices, permuted so that all replicates have as close a match as possible. Thus, *CLUMPP* strips away the ‘label switching’ heterogeneity so that the ‘genuine multimodality’ can be detected and quantified. The input file for *CLUMPP* is a file similar to the output from *STRUCTURE* (Falush *et al.*, 2003; Pritchard *et al.*, 2000), and the output from *CLUMPP* can be used directly as input by the cluster visualization program *DISTRUCT* (Rosenberg, 2004).

## 2 ALGORITHMS

We refer to the *C*×*K* matrix of membership coefficients for a single cluster analysis as the *Q-matrix*, with the *C* rows corresponding to individuals (or populations) and the *K* columns corresponding to clusters. *CLUMPP* attempts to maximize a measure of similarity of the *Q*-matrices of *R* creplicates over all (*K* !)^{R−1} possible alignments of the replicates.

Consider a pair of *Q*-matrices, *Q _{i}* and

*Q*for runs

_{j}*i*and

*j*, where the value in the

*c*th row and

*k*th column of

*Q*is the membership coefficient for individual

_{i}*c*in cluster

*k*as inferred in run

*i*. Each matrix consists of non-negative entries, and the sum of the entries in any row is 1. We define the pairwise similarity of matrices

*Q*and

_{i}*Q*as follows:

_{j}*W*is a

*C*×

*K*matrix with all elements equal to 1/

*K*and

_{F}is the Frobenius matrix norm (Golub and Van Loan, 1996). where

*C*and

*K*respectively denote the numbers of rows and columns of

*A*, and

*a*is the value in row

_{ck}*c*and column

*k*.

Using *G* to measure similarity, the optimal alignment of matrices *Q _{i}* and

*Q*is defined as the permutation of the columns of

_{j}*Q*that maximizes the similarity

_{j}*G*over all permutations

*P*in the set

*S*of permutations of

_{K}*K*clusters. The maximum value, or

*et al.*(2005) the ‘symmetric similarity coefficient’ (SSC) of the pair of runs [see also Rosenberg

*et al.*, (2002) for an earlier statistic]. The SSC for two runs is bounded above by 1—which it equals if the

*Q*-matrices are identical up to a permutation of the clusters—and it decreases as the similarity of the

*Q*-matrices decreases. The SSC statistic is generally expected to be positive if non-trivial clustering patterns are present in

*Q*and

_{i}*Q*, although it is possible for it to be negative.

_{j}For a collection of *R* replicates, the average pairwise similarity is defined as

*R*replicates, we search for the vector of permutations that maximizes this average pairwise similarity: Without loss of generality, we take

*P*

_{1}to be the identity permutation

*I*, so that the clusters of runs 2,…,

*R*are permuted to align with the clusters of run 1. As

*S*contains

_{K}*K*! permutations, with

*P*

_{1}set to equal

*I*, the maximum in Equation (5) is taken over (

*K*!)

^{R−1}vectors.

We make use of three algorithms for attempting to find the optimal alignment of *R* replicates. In decreasing order of the extent of the search, and in increasing order of computational speed, these algorithms are termed *FullSearch, Greedy* and *LargeKGreedy*. These algorithms supersede earlier methods that we described in Nordborg *et al.* (2005).

Note that our approach can proceed analogously using alternative functions to measure similarity in place of *G*. Although it is undefined when one of the two matrices equals *W, G* is designed to have large negative values when one of the two runs reflects substantial population structure and the other has relatively little structure (i.e. little difference from *W*). We can define a second similarity function *G*′, which is guaranteed to lie in [0,1]:

*G*′ lies in [0,1], arises from the definition of the Frobenius norm: If

*A*and

*B*have non-negative entries and row sums of 1, then −2

*a*≤0 and . Similarly, . It then follows that .

_{ck}b_{ck}The quantities SSC′, *H*′ and rmSSC_{R}' can then be defined by replacing *G* in Equations (3–5) with *G*′. We proceed to describe our algorithms using the *G* statistic to measure similarity; to instead use *G*′, the approach is analogous with *G*′, SSC′, *H*′ and rmSSC_{R}' in place of *G*, SSC, *H*, and SSC_{R}.

*FullSearch*

The *FullSearch* algorithm computes *H* for each of the (*K*!)^{R−1} alignments of the *K* clusters in *R* replicates. Considering all possible vectors of permutations (*I, P*_{2}, *P*_{3}, …, *P*_{R}), the algorithm computes *H*(*I*(*Q*_{1}),*P*_{2}(*Q*_{2}),*P*_{3}(*Q*_{3}),…,*P*_{R}(*Q _{R}*)) and returns the vector of permutations that maximizes SSC

_{R}. As the number of possible vectors grows quickly with

*K*and

*R*, however, even for moderate values of

*K*and

*R*, it is unrealistic to test every alignment. The

*FullSearch*algorithm runs in time proportional to

*T*

_{FullSearch}= (

*K*!)

^{R−1}[

*R*(

*R*−1)/2]

*KC*: the number of permutation vectors is (

*K*!)

^{R−1}, the number of computations of

*G*in each evaluation of Equation (4) is

*R*(

*R*−1)/2, and the time required for each computation of

*G*is proportional to

*KC*. Although it proceeds slowly, unlike our other algorithms, the

*FullSearch*algorithm is guaranteed to find the optimal alignment of clusters across multiple runs.

*Greedy*

The *Greedy* algorithm employed by *CLUMPP* proceeds as follows:

Choose one run,

*Q*_{1}.Choose a second run,

*Q*_{2}, and fix the permutation,*P*_{2}, that maximizes*G*(*P*_{1}(*Q*_{1}),*P*(*Q*_{2})) over all possible permutations*P*(where*P*_{1}is the identity permutation).Continue sequentially with each remaining run,

over all permutations*Q*, where_{x}*x*=3,…,*R*, and fix the permutation,*P*of_{x}*Q*, that maximizes the average similarity with the previously fixed_{x}*x*−1 runs, or*P*.

This algorithm runs in time proportional to *T _{Greedy}*=(

*K*!)[

*R*(

*R*−1)/2]

*KC*. The number of permutations tested for each run from 2 to

*R*is

*K*!. For run

*r*(

*r*ranging from 2 to

*R*), the number of computations of

*G*performed for each permutation is

*r*−1. Thus, considering all runs from 2 to

*R*, the total number of computations of

*G*performed is (

*K*!)[

*R*(

*R*−1)]/2. Each computation of

*G*runs in time proportional to

*KC*.

Because the order in which the runs are considered can affect the result, several different sequences of runs should be tested. *CLUMPP* offers three options for testing different sequences: test all possible sequences of runs, test a pre-defined number of random sequences and test a specific set of user-defined sequences.

*LargeKGreedy*

When *K*≳15, the number of permutations, *K* !, is very large, and it may not be possible to calculate *G* for all permutations of a particular pair of *Q*-matrices. Instead of testing every permutation as in the *Greedy* algorithm, the *LargeKGreedy* algorithm proceeds as follows:

(1)Choose one run,

*Q*_{1}.(2a)Choose a second run,

*Q*_{2}. Compute*G*for all pairs of columns, one from*Q*_{1}and one from*Q*_{2}. This computation is simply the value of*G*for two columns—hence no permutations of*Q*_{2}are computed, unlike in step 2 for the*Greedy*algorithm.(2b)Pick the pair of columns

*Q*_{1,y1}and*Q*_{2,z1}with highest*G*-value and fix these columns (*Q*_{1,y1}refers to column*y*_{1}of matrix*Q*_{1}). Then pick the pairs of columns*Q*_{1,y2}and*Q*_{2,z2}with the next highest*G*-value, ignoring all*G*-values of pairs containing either of the previously chosen columns*Q*_{1,y1}and*Q*_{2,z1}. Repeat this procedure until*K*pairs of columns, one each from*Q*_{1}and*Q*_{2}, have been picked, and fix the permutation of*Q*_{2}that matches up these pairs of columns, or*P*_{2}(*Q*_{2}).(3a)Continue sequentially with each remaining run,

where*Q*, where_{x}*x*= 3,…,*R*. For each*y*and*z*from 1 to*K*, compute the average similarity,*P*_{ℓ,y}denotes column*y*of the permuted matrix*P*_{ℓ}(*Q*_{ℓ}). This quantity is the similarity of column*z*of*Q*to column_{x}*y*of each of the previously fixed permutations, averaged across all runs previously considered. No permutations of*Q*are computed, unlike in Step 3 for the_{x}*Greedy*algorithm.(3b)Pick the pair of columns

*y*_{1}of*P*_{1}(*Q*_{1}),*P*_{2}(*Q*_{2}),…,*P*_{x−1}(*Q*_{x−1}) and*z*_{1}of*Q*_{x}with highest average similarity in Equation (8). Then pick the columns*y*_{2}and*z*_{2}with the next highest similarity in Equation (8), ignoring similarity scores of pairs containing either of the previously chosen columns*y*_{1}and*z*_{1}. Repeat this procedure until*K*pairs of columns, one for the matrices*P*_{1}(*Q*_{1}),*P*_{2}(*Q*_{2}),…,*P*_{x−1}(*Q*_{x−1}) and one for*Q*, have been picked. Fix the permutation of_{x}*Q*that matches up these pairs of columns, or_{x}*P*_{x}(*Q*_{x}).

A candidate for the vector of permutations of the *R* runs that maximizes *H* across all possible vectors has now been constructed. This algorithm runs in time proportional to *T _{LargeKGreedy}*=[

*R*(

*R*−1)/2]

*K*

^{2}

*C*. The number of pairs of columns, one from the run under consideration and one from the previously fixed runs, is

*K*

^{2}. For run

*r*(

*r*ranging from 2 to

*R*), the number of computations of

*G*performed for each pair of columns is

*r*−1. Considering all runs from 2 to

*R*, the total number of computations of

*G*performed is

*K*

^{2}[

*R*(

*R*−1)]/2. Since

*G*is computed only for columns rather than for whole matrices, the time of computation of

*G*is proportional only to

*C*rather than to

*KC*, as in the other algorithms.

As is true for the *Greedy* algorithm, the order in which the runs are considered can affect the result. For the *LargeKGreedy* algorithm *CLUMPP* offers the same three options for selecting the input sequence of runs as it provides for the *Greedy* algorithm.

To get an idea of which algorithm to use, we have found it useful to compute the quantity *D*=*TN* for each algorithm, where *T* is a quantity proportional to the time required by an algorithm (as described above) and *N* is the number of input sequences to be tested (for *FullSearch, N*=1). If *D* ∼ 10^{13} for *FullSearch*, then this algorithm is fast enough and is preferred; otherwise the *Greedy* algorithm can be used. If *D* ≳ 10^{13} for the *Greedy* algorithm, then this algorithm is probably also too slow. In that case, the *LargeKGreedy* algorithm should be used, as it can handle *K*>20, *R*>100 and *C*>1000 in reasonable time. We recommend testing at least 100 input sequences for the *Greedy* and *LargeKGreedy* algorithms—and preferably many more—to find the approximately highest SSC_{R} value.

### Example

Rosenberg *et al.* (2001) analyzed a data set of 27 microsatellites genotyped in 600 chickens from 20 breeds. Using *STRUCTURE* (Pritchard *et al.*, 2000), they inferred membership coefficients of the 600 chickens in 19 clusters. By repeating the cluster analysis 100 times, they assessed the robustness of the clusters, and Figure 1A displays the cluster membership estimates from 25 of the 100 replicates (randomly chosen). From Figure 1A, the ‘label-switching’ problem is apparent. Focusing on Runs 15 and 16, we could quickly ‘match up’ the clusters (i.e. the colored segments). However, because of some genuine multimodality, it would be not only tedious, but also difficult, to visually align the clusters for *all* replicates in an optimal manner. *CLUMPP* is designed to address exactly this type of situation.

**Fig. 1.**

**Fig. 1.**

Using *CLUMPP* and the *LargeKGreedy* algorithm on the 100 replicates (10 sets of 100 random input sequences of runs—for each set this takes ∼6 min on a 2.4 GHz processor), we find the highest value of *H* to be in the range of 0.51–0.54 (and the highest value of *H*′ to be between 0.69 and 0.70). If we allow the algorithm to run longer (30 000 random input sequences), the highest *H* equals 0.5546. The highest-scoring input sequences tend to produce quite similar alignments of the replicates. Excluding the input sequence that produced the highest *H*-value, the next 10 highest-scoring input sequences all produced *H*-values of at least 0.5441, and had on average 2.0% differences compared to the input sequence that led to the highest *H*. In other words, considering the permuted position of a randomly chosen cluster (among 19) in a randomly chosen run (among 100) based on the output of *CLUMPP* using a randomly chosen input sequence (among the 10 next highest-scoring sequences, after the one with the highest *H*-value), the cluster had a 98.0% chance of being aligned in the same way that it was aligned when using the highest-scoring of all input sequences.

The permuted membership coefficients of the 25 runs in Figure 1A for the input sequence among the 30 000 that leads to the highest *H* are shown in Figure 1B. Comparing the permuted membership coefficients for this input sequence to the clustering behavior in Table 2 of Rosenberg *et al.* (2001), we recapture the same patterns of clustering. For example, Brown-egg layer lines C and D always share a cluster, and this cluster sometimes (in 8 of 100 runs) includes Godollo Nhx (Run 11 in Fig. 1).

When we consider the *LargeKGreedy* algorithm with 10 000 fixed input sequences (chosen randomly among the 30 000 described above), for each of the 10 000 sequences, the alignment of the replicates obtained by *CLUMPP* is identical regardless of which of two statistics—*H* or *H*′—is used. The same input sequence that produces the highest *H*-value (*H*=0.5508) produces the highest *H*′-value (*H*′=0.7099). Excluding the input sequence that produced the highest values of *H* and *H*′, the next 10 highest-scoring input sequences—the same sequences for both statistics—all produced *H*-values of at least 0.5392 and *H*′-values of at least 0.7022, and had on average 7.0% differences compared to the input sequence that led to the highest *H* and *H*′. These results suggest that although matrices can be constructed so that the two statistics can lead to different alignments, in practice, their properties are extremely similar. The larger number of differences for the highest-scoring input sequences from among 10 000 sequences compared to the highest-scoring among the 30 000 sequences described above highlights the importance of employing a large number of input sequences whenever possible.

### 3 DISCUSSION

Interpreting the results of multiple population genetic cluster analyses is not always straightforward (Corander *et al.*, 2004; Evanno *et al.*, 2005; Pritchard *et al.*, 2000; Rosenberg *et al.*, 2001). One could argue that the single replicate with the highest likelihood (according to the criterion of the clustering program) is the optimal clustering solution, and that solutions with lower likelihood can be discarded. However, for complex data sets that produce several distinct solutions with similar likelihoods, this strategy may not be satisfactory, as it disregards both uncertainties of the analysis and important population structure information that may only be visible in different modes. From a Bayesian perspective, a possible approach would be to weight different solutions by their likelihoods. However, this approach can be difficult to apply with individual membership coefficients in the case of genuine multimodality. A third approach is to summarize data on the outcomes of many replicates, such as in Table 2 of Rosenberg *et al.* (2001). Regardless of the choice of procedure for employing multiple cluster analyses to make biological interpretations from population-genetic data, *CLUMPP* can provide a useful way of assessing the similarity of the outcomes of individual runs.

We note that our focus here has been on the setting in which the clustering algorithm explores only one mode during the iterative process that constitutes a single replicate analysis, but may explore different modes in different replicates. As was recognized by Pritchard *et al.* (2000), this setting typically applies to the *STRUCTURE* approach, for which the results of individual replicates are reasonably straightforward to summarize and interpret, but for which difficulties may arise in interpreting differences across replicates. It is possible to envision an alternative scenario in which multiple important modes (potentially including multiple permutations of each of the genuinely distinct modes) are explored during the course of a single analysis. In this situation, replicate analyses would produce similar results, but because the meaning of a cluster label could vary across intermediate steps in the analysis, it would be challenging to summarize the outcome of any single replicate. Although our analysis has used the language of multiple replicate analyses rather than that of multiple intermediate steps of a single analysis, it is worth noting that the algorithms we have proposed for the matching of clusters can also be applied with each ‘replicate’ described above corresponding to a specific state of the membership coefficient estimates encountered during the course of a single analysis. Thus, in addition to their uses in interpreting multiple replicate analyses, our algorithms can augment existing approaches to the analysis of individual replicates (e.g. Dawson and Belkhir, 2001; Huelsenbeck and Andolfatto, 2007) and can contribute new approaches for the problem of summarizing the intermediate steps of a single run.

## ACKNOWLEDGEMENTS

We thank O. François and two anonymous reviewers for helpful comments on a previous version of the manuscript. This work was supported by NIH grant HL084729, by grants from the Burroughs Wellcome Fund and the Alfred P. Sloan Foundation, and by a postdoctoral fellowship to M.J. from the University of Michigan Center for Genetics in Health and Medicine.

*Conflict of Interest*: none declared.

## REFERENCES

*Arabidopsis thaliana*

## Comments