## Abstract

**Motivation:** We introduce a dual multiple change-point (MCP) model for recombination detection among aligned nucleotide sequences. The dual MCP model is an extension of the model introduced previously by Suchard and co-workers. In the original single MCP model, one change-point process is used to model spatial phylogenetic variation. Here, we show that using two change-point processes, one for spatial variation of tree topologies and the other for spatial variation of substitution process parameters, increases recombination detection accuracy. Statistical analysis is done in a Bayesian framework using reversible jump Markov chain Monte Carlo sampling to approximate the joint posterior distribution of all model parameters.

**Results:** We use primate mitochondrial DNA data with simulated recombination break-points at specific locations to compare the two models. We also analyze two real HIV sequences to identify recombination break-points using the dual MCP model.

**Availability:** A software program ‘DualBrothers’ implementing the dual MCP model is available in the form of a Java package at http://www.biomath.ucla.edu/msuchard/DualBrothers

**Contact:**msuchard@ucla.edu

**Supplementary information:**http://www.biomath.ucla.edu/msuchard/DualBrothers

## 1 INTRODUCTION

Recombination plays an important role in the evolution of almost all living organisms. In rapidly evolving viruses, homologous recombination is one way in which the viruses adapt quickly to changing environmental conditions (Worobey and Holmes, 1999). At least 10% of the circulating human immunodeficiency virus-1 (HIV-1) strains are believed to be recombinants containing genetic material from different viral subtypes (Robertson *et al*., 1995a,b). Recombination in HIV holds implications for vaccine development (Korber *et al*., 2001) and emerging drug resistance (Kellam and Larder, 1995). While point mutation processes have been used to study HIV immune response escape (Wei *et al*., 2003) and drug resistance development (Chen *et al*., 2004), the contributions of recombination are not well understood. Accurate knowledge of the frequency and location of recombination break-points may improve our understanding of these phenomena, but reliable and statistically rigorous methods are needed to provide this break-point information.

Most methods that can detect recombination from a multiple sequence alignment use statistical, phylogenetic procedures (Hein, 1990). These methods exploit the observation that if recombination occurred in the evolutionary history of a set of aligned sequences, then different segments of the alignment should support alternative phylogenies (Li *et al*., 1988). One of the most popular approaches to recombination detection is to slide a window along a sequence alignment and look for differences in the phylogenetic tree support within each window (Grassly and Holmes, 1997; McGuire *et al*., 1997; Husmeier and Wright, 2001). Although this approach can successfully detect recombination, it suffers from a multiple testing problem when assessing the significance of recombination (Suchard *et al*., 2002) and low resolution for locating recombination break-points, limited by the window size.

Husmeier and McGuire (2003) develop a Bayesian hidden Markov model (HMM), where the hidden states are phylogenetic trees and the observable states are consecutive nucleotide sites of a multiple sequence alignment. This method can predict recombination sites more accurately than sliding window methods as shown by the authors, but their current implementation is limited to only four sequences, because the dynamic programming, required for HMM inference, is computationally very expensive. This method also assumes that all regions of the alignment are under the same evolutionary pressure. This assumption is known to lead to false recombination identification under some circumstances (Dorman *et al*., 2002; Husmeier and McGuire, 2002). Finally, trees in the HMM are updated based only on the phylogenetic information from the neighboring sites. Therefore, noisy and sparse data can also reduce the accuracy of Bayesian HMM methods.

Suchard *et al*. (2003b) proposed a single multiple change-point (MCP) model that can capture spatial phylogenetic variation in both the trees and the evolutionary pressures. In this model, an alignment is partitioned into an unknown number of segments. Each segment has a vector of phylogenetic parameters associated with it, such as parameters describing the nucleotide substitution process and a bifurcating tree topology that specifies evolutionary relationships between sequences. End points between partitions are called change-points. Recombination is inferred if at least one change in topology is observed across a change-point. This model has been successfully applied to test recombination hypotheses in HIV strains (Suchard *et al*., 2002).

Modeling spatial variation of all parameters with a single change-point process results in prior correlation between sites where substitution parameters change and those where topologies change. This prior correlation can lead to biased break-point estimation when recombination occurs near the boundary of regions with different evolutionary pressures. When both substitution process parameters and topologies change at close, yet distinct, sites the single MCP will probably produce only one change-point in the neighborhood of the two true change-points. To overcome this difficulty, we develop a dual MCP model that decouples substitution process change-points from topology break-points by introducing two a priori independent change-point processes to describe spatial phylogenetic variation.

## 2 METHODS

### 2.1 Dual MCP model

We start with *N* aligned DNA or RNA sequences of length *S*. Columns of the alignment, also called sites, **Y**_{1}, **Y**_{2},…,**Y**_{S} serve as observations of the evolutionary process. Each site **Y**_{s} = (*Y*_{s1},*Y*_{s2},…,*Y*_{sN})′ contains a nucleotide or gap from each of the *N* sequences, such that *Y*_{sn} ∈ {*A*,*G*,*C*,*T*/*U*,−}, for *s* = 1,2,…,*S*, and *n* = 1,2,…,*N*. We follow conventional likelihood-based phylogenetic approaches (Felsenstein, 2004) and model the evolutionary process in terms of an evolutionary tree (τ,**T**) and a rate matrix **Q**, where τ is a bifurcating tree topology and **T** = (*t*_{1},*t*_{2},…,*t*_{2N−3}) is a vector of branch lengths of τ. Matrix **Q** = {*q*_{uv}}, *u*,*v* ∈ {*A*,*G*,*C*,*T*/*U*}, defines the rates for a continuous-time Markovian nucleotide substitution process along each branch of τ. We follow the parameterization of **Q** by Hasegawa *et al*. (1985),

_{A}, π

_{G}, π

_{C}, π

_{T}) is the stationary distribution of the nucleotides and the diagonal elements of

**Q**are set such that the rows of

**Q**sum to 0. The resulting finite-time transition matrix for substitutions is

**P**(

*t*

_{b}) =

*e*

^{tbQ}= {

*p*(

*u*,

*v*|

*t*

_{b})}, where

*p*(

*u*,

*v*|

*t*

_{b}) is the probability of nucleotide

*u*mutating to

*v*along branch

*b*,

*b*= 1,2,…,2

*N*−3. For identifiability between

*t*

_{b}and

**Q**, we fix β such that ∑

*q*

_{uu}π

_{u}= −1 and branch lengths are expressed in terms of the expected number of substitutions per site. The transition/transversion ratio κ = α/β remains a free parameter ranging from 0 to ∞. This parameterization differs from the parameterization used by Suchard

*et al*. (2003b) where transition rate α ∈ [0,1] plays the role of the free parameter. Although \boldsymbolπ can be estimated simultaneously with other model parameters, the resulting estimates normally differ little from the observed frequencies in the alignment. Therefore, we fix the stationary distribution π to the observed nucleotide frequencies in the alignment to avoid unnecessary computations (Li

*et al*., 2000). Felsenstein (1981) provides an efficient algorithm for integrating out gaps and computing the site likelihood

*f*(

**Y**

_{s}|τ(

*s*),

**T**(

*s*),

**Q**(

*s*)), where (τ(

*s*),

**T**(

*s*),

**Q**(

*s*)) are the evolutionary parameters associated with site

*s*. To reduce computations further, we assume a coalescent-like prior on branch lengths, such that

*p*(

*t*

_{b}(

*s*)) ∝ exp(−

*t*

_{b}(

*s*)/μ(

*s*)) (Sinsheimer

*et al*., 2003). We refer to the prior mean of branch lengths μ(

*s*) as the average divergence at site

*s*. Then

**T**(

*s*) can be integrated out of the likelihood as shown in Suchard

*et al*. (2003b) producing marginal probabilities of mutation

*p*(

*u*,

*v*|μ(

*s*)) and marginal likelihoods

*f*(

**Y**

_{s}|τ(

*s*),μ(

*s*),

**Q**(

*s*)). Assuming independence across sites conditional on the evolutionary parameters of the model, the total likelihood of the alignment becomes

**Q**on

*s*is introduced for a convenient representation of the total likelihood (2).

For statistical inference, it is very important to maintain a balance between the sample size of the data and the number of model parameters. This is the well-known bias-variance trade-off paradigm. Allowing independent parameters τ(*s*), μ(*s*) and **Q**(*s*) for each site *s* makes the unrealistic evolutionary assumption of site independence and leads to unreasonably large uncertainty in parameter estimation. On the other hand, fixing all parameters to be equal across sites ignores the natural spatial variation in the data, especially in the presence of recombination (Grassly and Holmes, 1997). Suchard *et al*. (2003b) introduce an MCP model that serves as a middle ground between these two extremes. In their model, the authors divide an alignment into an unknown number of contiguous segments, allowing parameters (τ, μ, **Q**) to differ across segments, but keeping them equal inside each segment. We further extend this model by decoupling tree topologies τ and substitution process parameters (μ, **Q**) into two separate change-point processes.

To define the unknown segments for both processes, let ρ_{j} for *j* = 0,…, *J* + 1 be the substitution process change-points that divide an alignment into *J* + 1 non-overlapping intervals, subject to the constraint 1 = ρ_{0} < ρ_{1} < ⋯ < ρ_{J} < ρ_{J+1} = *S*+1. We allow substitution parameters to vary only across change-points, keeping them constant inside each segment. Specifically, (μ(*s*),**Q**(*s*)) = (μ_{j}, **Q**_{j}) for all *s* ∈ [ρ_{j−1}, ρ_{j}). Similarly, we introduce topology break-points ξ_{m}, for *m* = 0,…, *M*+1, subject to the constraint 1 = ξ_{0} < ξ_{1} < ⋯ < ξ_{M} < ξ_{M+1} = *S* + 1 with τ(*s*) = τ_{m}, for all *s* ∈ [ξ_{m−1}, ξ_{m}). To ensure ξ_{m} is truly an identifiable break-point, we place the additional constraint τ_{m} ≠ τ_{m+1}, for all *m* ∈ {1,…,*M*}. A similar constraint on (μ_{j},**Q**_{j}) is unnecessary because the probability that two independent continuous random variables are equal across a change-point is zero. The random change-point ρ = (ρ_{1},ρ_{2},…,ρ_{J}) and break-point ξ = (ξ_{1},ξ_{2},…,ξ_{M}) positions are independent of each other and may coincide. We illustrate this decoupling process in Figure 1. In this figure, a possible realization of change-points from the single MCP model is represented as a sequence of square flags on a multiple sequence alignment. The two change-point processes of the dual MCP model are shown on the bottom, with circular and triangular flags denoting substitution process change-points and topology break-points, respectively. After decoupling, substitution process change-points and topology break-points do not necessarily align with the original single MCP change-points.

### 2.2 Priors

To complete our model specification, we assume independent truncated Poisson priors on both the number of random topology break-points *M* and the number of random substitution process change-points *J*, such that

*et al*., 2001). Given

*J*and

*M*, locations of break-points ξ and change-points ρ are a priori uniform over all possible unordered selections of

*M*or

*J*locations from

*S*−1 choices, such that

For a small number of sequences *N*, we include all possible *E*_{N} = (2*N* − 5)!/2^{N−3}/(*N*−3)! unrooted choices (Felsenstein, 1978) in the space of tree topologies Ω considered by the model. However, when handling a larger number of sequences, we suggest two ways to reduce the size of Ω. First, if the phylogenetic relationship between the potential parental sequences that could have recombined to produce the putative recombinant is known, we restrict Ω to the 2*N* − 5 topologies that could result by adding a new leaf to any branch in the fixed parental tree with *N* − 1 taxa (Suchard *et al*., 2002). The second method exploits the fact that only a few tree topologies are supported by the data. All other topologies can be excluded from the analysis. To identify this subset of topologies, we follow Haake *et al*. (2004) and use MrBayes (Huelsenbeck and Ronquist, 2001; Paraskevis *et al*., 2003) to precalculate the posterior probability distribution of tree topologies for several small alignments created by breaking the full alignment into subsections. If a tree garners a posterior probability greater than some fixed threshold for at least one of the alignment subsections, it is included in the tree space Ω. In all three of the above approaches, we assume τ_{m} is drawn from a uniform prior distribution over Ω.

Evolutionary parameters κ_{j} and μ_{j} for individual segments are a priori independent and come from log–normal distributions:

_{κ},

_{μ}and

_{κ}= 2,

_{μ}= −2,

_{j}and μ

_{j}. In particular, the prior median of κ

_{j}is 7 and κ

_{j}∈ (1, 50) with 95% probability; the prior median μ

_{j}is 0.1 and μ

_{j}∈ (0.003, 7) with 95% probability. If we expect four or more segments with varying evolutionary pressures, we continue the hierarchical construction by assuming diffuse, conjugate hyperpriors on ν

_{κ},

_{μ}and

### 2.3 Sampling algorithm

In Bayesian analysis, one attempts to describe the posterior distribution of all model parameters given the observed data and then use this distribution for estimation or hypothesis testing. The parameters in our dual MCP model can be represented as

_{1},…,τ

_{M+1}), κ = (κ

_{1},…,κ

_{J+1}), μ = (μ

_{1},…,μ

_{J+1}) and

Since analytic expression of our dual MCP model posterior is intractable, we employ Markov chain Monte Carlo (MCMC) to generate random samples from the posterior (Tierney, 1994). When simulating from a posterior distribution via MCMC, the states of the Markov chain are points in the parameter space of the model and the proportion of time the chain spends at each state approximates the posterior probability (density) of this state. In an MCP model, the dimension of the parameter space is not fixed, but depends on the number of change-points in the model, necessitating a means of transitioning between states with different numbers of components. Green (1995) has developed a reversible jump MCMC (rjMCMC) algorithm by extending the Metropolis–Hastings (MH) sampling scheme (Hastings, 1970) to allow moves between spaces of different dimensions. This method introduces auxiliary variables to construct a bijective map between parameter states with unequal numbers of components. This dimension-matching procedure is concluded by adjusting MH acceptance probabilities with the Jacobian of the transformation.

In developing an rjMCMC sampler for Equation (8), we follow the scheme introduced by Suchard *et al*. (2003b) adjusting it when necessary. In particular, steps involving adding or removing change-points are modified to accommodate two kinds of change-points, and the proposal distributions for continuous parameters are altered to improve convergence of the chain. We describe our general sampling scheme first and then focus the attention of the readers on the differences between the MCMC implementations for the single and the dual MCP models.

Our sampler achieves mobility between spaces with different dimensions by proposing birth and death steps for both substitution process change-points and topology break-points. At each step, one of the following moves is attempted: inserting or deleting topology break-point(s), inserting or deleting a substitution process change-point or updating all model parameters conditional on the current values of *J* and *M*. The probabilities of choosing a particular move follow with slight modification from the single MCP sampler. Proposals for changes in the number of substitution process change-points replicate the equivalent proposals for the single MCP sampler, but changes in the number of break-points necessitate novel proposals in order to preserve inequality between adjacent topologies. We demonstrate this requirement with a simple example. Suppose the current state θ in the Markov chain has three break-points separating four topologies: (τ_{1}, τ_{2}, τ_{3}, τ_{4}) = (*A*,*B*,*A*,*B*), where *A* ≠ *B*. If the sampler proposes to remove the single break-point ξ_{2} that separates topologies τ_{2} and τ_{3}, then two segments collapse into one. The new segment should reasonably inherit its topology from either of the original segments with some probability, but both possible realizations (A,B,B) or (A,A,B) from this proposal violate the identifiability restriction: τ_{m} ≠ τ_{m+1}, for all *m*. To avoid this problem, we propose to add or delete two break-points in a single step. When the sampler encounters the situation above, it also removes the additional break-point that is producing the violation. To preserve detailed balance, we introduce a complementary birth step that adds two topology break-points at a time. A tuning parameter *c* is employed to adjust the fraction of time that the sampler uses the double rather than single break-point birth step. In our experience, setting *c* = 1/*E*_{N} results in consistent and satisfactory mixing of topology break-points.

We update (ξ,τ,ρ,κ,μ,ϕ) conditional on *M* and *J* in a Metropolis-within-Gibbs cycle. During this step, we sequentially propose new values for tree topologies and substitution model parameters for each partition, accepting or rejecting them according to the MH rule. All updates are as in Suchard *et al*. (2003b) except new values of κ_{j} and μ_{j} are proposed by generating a uniform random variable *U* ∈ [0,1] and multiplying current values of κ_{j} and μ_{j} by *e*^{U − 0.5}. Under this scheme, our sampler can take bigger jumps as current values of the variables get larger. These long jumps allow for faster exploration of the parameter state space. After the dual MCP sampler updates model parameters for all partitions, the locations of any change-points and break-points are updated. If we let hyperparameters ϕ vary, they are also updated during this round. Since full conditional distributions of ν_{κ},

_{μ}and

*et al*., 2003a), we use Gibbs sampling to update them.

We generate MCMC chains of length 2 100 000. Each chain starts with one partition (i.e. no change-points or break-points). Values for κ_{1} and μ_{1} are generated uniformly from the intervals (0,100) and (0,1), τ_{1} is drawn uniformly from Ω. The first 100 000 iterations of each chain are discarded as burn-in and every 200th iteration thereafter is saved, resulting in posterior samples containing 10 000 draws.

### 2.4 Bayes factors

Bayesian model selection is often accomplished by comparing prior and posterior probabilities of competing hypotheses via Bayes factors (Kass and Raftery, 1995). We adopt this approach to test for the presence of recombination in the evolutionary history of a putative recombinant. Let *H*_{1} = {*M* > 0} be a hypothesis postulating that there is at least one site in an alignment where recombination has occurred, and let *H*_{2} = {*M* = 0} be the alternative hypothesis in which the evolutionary history under investigation does not contain recombination. Then the Bayes factor in favor of recombination is

*M*approximately follows a Poisson distribution with mean δ, we obtain

*Pr*(

*M*> 0)/

*Pr*(

*M*= 0) ≈

*e*

^{δ}− 1. The posterior probability of recombination

*Pr*(

*M*>0|

*) can be directly estimated as the fraction of MCMC simulants satisfying the condition*

**Y***M*> 0. The simplicity of this procedure is deceptive, because the standard error of such an estimator can be quite large when

*Pr*(

*M*>0|

**) approaches 0 or 1 (Weiss**

*Y**et al*., 1999). To minimize this standard error, Carlin and Chib (1995) propose adjusting prior probabilities of competing hypotheses so that a posteriori the hypotheses are approximately equiprobable. In line with this idea, we employ the logistic regression model for Bayes factor estimation introduced by Suchard

*et al*. (2005). This approach benefits from averaging across several values of prior odds to arrive at a more efficient estimate of the Bayes factor.

## 3 DATA

To demonstrate the advantages of the dual MCP sampler over the single MCP sampler, we examine three datasets containing either simulated data or real sequence alignments including HIV recombinants.

We start with a previously used test example involving mitochondrial DNA (mtDNA) coding subunits 4 and 5 of the NADH-dehydrogenase enzyme and three transfer RNAs (tRNAs) from humans (H), orangutans (O), squirrel monkeys (S) and lemurs (L) (Hayasaka *et al*., 1988). Previous phylogenetic studies report the consensus tree (H,O,(S,L)) as the evolutionary relationship among these taxa (Yang and Rannala, 1997; Larget and Simon, 1999; Suchard *et al*., 2001). We construct an artificial alignment from these data, where sites are rearranged to form four distinct partitions: the first three partitions comprise the three codon positions from the protein-coding region of the alignment, while the last partition consists of tRNA sites. Larget and Simon (1999) have demonstrated that these partitions differ greatly in their evolutionary pressures. The greatest evolutionary divergence can be observed in the third (codon) partition where silent mutations are common. We therefore expect, among others, a substitution process change-point around site 464, the starting site for partition three. To investigate the effect of distance between change-points and break-points on the accuracy of recombination detection, we generate 18 alignments, each with a single simulated recombination break-point. We simulate the recombination event by permuting the nucleotides from the H and the L sequences starting at a fixed site in each alignment. Such permutations change the inferred relationship between taxa to the alternative topology (L,O,(S,H)) after the chosen site. The artificial break-points are placed every 10 sites starting at 405 and ending at 575.

We also apply the dual MCP model to two real HIV datasets. The first dataset consists of a portion of the *gag* gene from HIV-1 isolate RW024 (accession number U86548) aligned with the eight HIV-1 subtype consensus sequences A, B, C, D, F, G, H and J from the Los Alamos HIV Database. The alignment contains 729 sites encoding *gag* proteins p17 and p24. We assume that the phylogenetic tree describing evolutionary relationships among consensus subtype sequences is fixed and equals ((((((A,H),G),J),C),F),B,D) as reported by Robertson *et al*. (1999). This assumption allows us to consider only those topologies that can be obtained by adding sequence RW024 to any branch in the subtype tree, thus reducing the size of the tree space Ω to 13 possible topologies. The isolate RW024 has been analyzed by Cornelissen *et al*. (1996) who found that different portions of the *gag* gene (p17, p24) support different parental heritage (A, H) of the isolate. Dorman *et al*. (2002) and Suchard *et al*. (2002) find support in favor of recombination generating this isolate with a *P*-value <0.0001 and a Bayes factor of 10^{19}, respectively.

In the third example, we analyze the full genome of HIV-1 isolate KAL153 from Kaliningrad, Russia (accession number AF193276). Liitsola *et al*. (1998) show that the *gag* and *env* genes of this isolate originated from subtypes A and B, respectively. The resulting A/B recombinant viral strain is suspected to be the causative agent of an explosive HIV-1 epidemic in Kaliningrad among intravenous drug users. We use an alignment of the KAL153 isolate along with subtypes A, B and F consensus sequences.

Both of the HIV putative recombinants have been successfully analyzed in a single MCP framework, but correlation between the inferred locations of substitution process change-points and topology break-points is observed (Suchard *et al*., 2002; 2003b). A dual MCP model analysis should reveal if this correlation is supported by the data or is an artifact of the a priori correlation between change-points and break-points from which the single MCP model suffers.

Although all the data used in our examples can be found in public databases, the multiple sequence alignments may not be easily reproduced without the knowledge of original alignment algorithm settings. Therefore, we make the unpermuted alignment of the mtDNA and alignments of the putative HIV recombinants available from http://www.biomath.ucla.edu/msuchard/DualBrothers as Supplementary information.

## 4 RESULTS

### 4.1 Prior choice and sensitivity analysis

Most parameters in the dual MCP model receive non-informative prior distributions, requiring little if any a priori knowledge to specify. However, the numbers of change-points and break-points follow Poisson distributions that can be tuned to incorporate prior information. Suchard *et al*. (2003b) outline one procedure of choosing a prior mean for the number of change-points in the single MCP model. Since most of the change-points in the single MCP model correspond to substitution change-points in the dual MCP model, we readily apply these guidelines to specify the prior mean number of substitution change-points λ. Following the suggestions of Suchard *et al*. (2003b) we assign λ roughly to the number of boundaries between genes or gene products in the alignment. In the primate mtDNA example we set λ = 3 corresponding to the four artificial partitions of the nucleotide sites. For the RW024 HIV example, we expect a priori λ = 1 substitution process change-point somewhere near the boundary between the two *gag* gene products, and for the KAL153 analysis, we recall that the alignment contains all 10 HIV genes, suggesting λ = 9 divisions. Usually, no prior information about recombination is available, especially for newly sequenced strains. Therefore, in both our synthetic and real examples, we set the prior mean number of break-points δ = log 2; this translates into a prior probability of at least one recombination point

Naturally, the results of any Bayesian analysis depend on the choice of prior distributions (O'Hagan, 2003). The more informative priors, *p*(*J*) and *p*(*M*) in our case, tend to have the greatest impact on results. After perturbing the prior mean number of substitution change-points in the primate example, we find that despite the differences in the posterior distributions of *J*, posterior profiles of the evolutionary parameters remain robust to the choice of λ (see Supplementary information at http://www.biomath.ucla.edu/msuchard/DualBrothers). Since the number of topology break-points is the most important parameter for recombination detection, we examined its sensitivity to the choice of prior in more detail. We apply the single and dual MCP models to the KAL153 example nine times by varying the prior probability of at least one recombination break-point from 0.1 to 0.9. Manipulating this prior probability is trivial in the dual MCP model since *Pr*(*M* > 0) ≈ 1 − *e*^{−δ}. On the contrary, controlling the prior probability of recombination is rather challenging in the single MCP model (Suchard *et al*., 2002) as this probability is a function of both the prior mean number of change-points and the probability that two adjacent segments share the same topology. In our sensitivity analysis we set the former to 9 and vary the latter. Vertical bars in Figure 2 denote the posterior probability that the number of topology break-points attains values 2, 3 and 4. Other values are not shown as they do not gain substantial posterior support under either single or dual MCP models. Both models support two topology break-points under conservative priors on recombination. As the prior probability of recombination increases, the single MCP model more quickly favors the next most probable configuration with four topology break-points than the dual MCP model. The two additional break-points are located near the boundary of *pol* and *vif* genes as can be seen in Figure 5. This region contains very few sparsely distributed informative sites. This sparseness explains the sensitivity of the MCP models to the prior in this part of the alignment. We conclude that the contribution of the prior to posterior estimates of the number of topology break-points is minimal in the presence of sufficient phylogenetic information, but can be significant when the data are sparse or noisy. The fact that even for high values of the prior probability of recombination the posterior mode of *M* under the dual MCP model remains at 2 in contrast to the results of the single MCP analysis indicates that the dual MCP model is more robust to misspecification of the prior probability of recombination.

### 4.2 Results of posterior simulations

We first discuss the simulation study aimed at demonstrating the improved accuracy of the dual MCP model. In Figure 3, we plot the inferred against simulated recombination sites using the single and dual MCP models (open circles). Ideally one expects all plotted points to lie on the line *y* = *x*, indicating perfect estimation of recombination sites. The dashed horizontal line in each plot marks site 464, where a substitution process change-point is inferred by the dual MCP. The single MCP model shows a strong attraction between the inferred recombination sites and the substitution process change-point, with inferred recombination sites clustering along the dashed horizontal line. The dual MCP model yields more accurate inference with small variation about the diagonal. To interpret the off-diagonal variation, we define two classes of topologically informative sites in the alignment, those supportive or contradictory of the simulated recombinant structure. A site is called supportive when it supports the consensus topology and sits left of a simulated break-point or when it supports the alternative topology and lies right of the break-point. Supportive sites provide both the single and the dual MCP algorithms information with which to infer the location of recombination. Sites contradictory to the recombinant structure are sites that support the alternative topology to the left of the break-point or support the consensus topology to the right of the break-point. Careful thought shows that all sites can be classified based solely on the topology supported by the unpermuted data and regardless of the actual break-point location. We mark supportive and contradictory sites in Figure 3 as light and dark grey circles, respectively. Sites that do not support either of the two competing topologies are not plotted. As expected, the greatest inaccuracies in detection occur when the simulated recombination events are located in uninformative regions, especially those bordered by contradictory sites.

We analyze the RW024 putative recombinant with the single and the dual MCP models and summarize the results of our posterior simulations in Figure 4. For each site in the alignment we plot the marginal posterior probabilities of the tree topologies as well as medians and 95% Bayesian credible intervals (BCIs) for κ and μ. In Figure 4, arrows mark distinctions in the results between the two models. As estimated by the single MCP model, changes in the evolutionary substitution process and the most probable topology effectively occur at the same location. Over 95% of the change-points located between nt 280 and 320 are both substitution process change-points and topology break-points. Decoupling of the MCP process results in a shift in the estimated break-point and change-point in opposite directions. Change-points and break-points located between sites 280 and 320 coincide only 0.3% of the time during MCMC simulation. Both models predict a higher κ in the 3′ end of the genome. The BCIs of κ calculated under the dual MCP model are larger than those of the single MCP model for sites in the middle of the alignment. Visual examination of this alignment region reveals that the dual MCP model correctly identifies a region with a relatively high transition/transversion ratio κ, but lack of information leads to great uncertainty about its actual value. The single MCP model places the substitution process change-point further 5′, near the topology break-point; therefore, averaging κ among sites with high and low transition/transversion ratio in the region downstream of the change-point.

We plot the results of posterior simulations for the KAL153 putative recombinant in Figure 5. This figure follows the same arrangement as in Figure 4. As before, differences in performance of the single and the dual MCP models are marked by arrows. The dual MCP model estimate of the first recombination site lies 5′ of the single MCP estimate. Both models identify a strongly supported (A,B,A) recombinant structure and a small region on the boundary of *pol* and *vif* with somewhat uncertain origin. A Bayes factor calculated under the dual MCP model amounts to 10^{77} providing decisive support for at least one recombination point in the KAL153 sequence. We find no evidence of recombination when we substitute KAL153 with a pure subtype A representative (see Supplementary information at http://www.biomath.ucla.edu/msuchard/DualBrothers).

Table 1 reports estimates of recombination break-points together with their 95% BCIs calculated under both models. In the KAL153 example, despite the increase in the number of parameters, the dual MCP break-point estimates have roughly the same size BCIs as the estimates calculated under the single MCP model. Therefore, the dual MCP model improves the accuracy of recombination detection while preserving the precision of the single MCP model. However, the topology break-point in the RW024 has a noticeably larger BCI under the dual MCP model. The roots of the increased uncertainty lie not only in the different complexities of the two models, but also in differences in the utilization of sequence information. In this example, it is not surprising that the change-point inferred by the single MCP has smaller posterior variance since its location was deduced from both topological incongruence and heterogeneous sequence divergence. Pulling these events together may not be advantageous. We believe that recombination inference should be based solely on the discordance in tree topologies since the contribution of other sequence features to recombination has not been rigorously established.

We report convergence of our MCMC sampler for the KAL153 example using the Gelman–Rubin potential scale reduction factor (PSRF), a convergence statistic based on a comparison of the within-chain and between-chain variances from multiple MCMC runs (Gelman and Rubin, 1992). We follow Brooks and Giudici (1998) and identify site-specific transition/transversion ratios κ(1),…,κ(*S*) as a set of parameters that retain their interpretation even as the dimension of the parameter space changes. We compute PSRFs using 10 independent MCMC chains with overdispersed starting values. Figure 6 plots point estimates of the PSRFs and the corresponding 97.5% quantiles for each site in the alignment. The closeness of the estimates to 1 indicates good convergence of the sampler. The dual MCP model shows better overall convergence of the site-specific transition/transversion ratios, in spite of the fact that both samplers use the same transition kernels.

## 5 DISCUSSION

In this paper, we improve the accuracy of recombination detection afforded by Bayesian phylogenetic methods. Accuracy is improved by modeling spatial phylogenetic variation with two independent change-point processes, an extension of the single change-point process used by Suchard *et al*. (2003b). One process models changes in the tree topology along a multiple sequence alignment; the other process allows nucleotide substitution pressures to vary. Decoupling these processes eliminates the prior correlation between locations of changes in topology and evolutionary pressure, yielding more accurate estimation of both change-point types. The dual MCP model inherits a major strength from the original MCP model in its realistic modeling of spatial phylogenetic variation using a parsimonious number of parameters. In addition, similar to the single MCP model, the dual model allows simultaneous estimation of regions with different evolutionary pressures, uncertainty in topologies and locations of recombination sites.

Analysis of alignments with recombination simulated in an area with a significant change in evolutionary pressure shows that dual MCP estimates of break-points are more accurate than single MCP results. The recombination sites inferred under the single MCP model are clearly attracted to the evolutionary change-point location owing to the prior correlation between the two kinds of change-points. In the RW024 example under the single MCP model, the estimates of change in evolutionary pressure and recombination site locations effectively coincide. In contrast, the dual model predicts the substitution change-points and topology break-points to occur at distinct and fairly distant positions along the genome. Similar separation of substitution change-points and topology break-points are observed in the KAL153 example within the *pol* coding region. We conclude that strong posterior correlation between locations of change in evolutionary pressure and recombination is an artifact of the single MCP model. The dual MCP model should result in more accurate estimation of locations of substitution change-points and topology break-points. Moreover, as the KAL153 examples show, the dual MCP model is less sensitive to the choice of prior on recombination in the presence of sparse and noisy data.

We observe improved convergence of the rjMCMC sampler after decoupling of change-points. Because the dual MCP model selects more accurate partitions of the data, it may produce better sampler mixing by regularizing the posterior landscape. For example, if the marginal posterior distribution of the transition/transversion ratio κ is bimodal in a given partition, splitting this partition by adding an appropriate change-point may result in two new partitions with unimodal posterior distributions of κ. Two unimodal distributions may be more efficiently explored by the rjMCMC sampler. Since the dual MCP model identifies partition boundaries more accurately, fewer multimodal distributions and faster exploration of the posterior distribution of model parameters are expected.

In addition to the increased accuracy in recombination site identification, using two change-point processes allows us to model recombination sites explicitly as model parameters. This explicit representation of recombination makes estimation and hypothesis testing more rigorous and the specification of the prior on recombination sites more flexible. For instance, it is now possible to use a site-specific recombination prior, where each site is explicitly assigned a prior probability of being a topology break-point. Such prior specification allows one to incorporate information from previous recombination detection studies, directing the algorithm to regions where recombination is more likely to occur. We believe that the dual MCP shows great promise for accurately detecting recombination and finding patterns in the spatial distribution of recombination sites.

**Fig. 1**

**Fig. 1**

**Fig. 2**

**Fig. 2**

**Fig. 3**

**Fig. 3**

**Fig. 4**

**Fig. 4**

**Fig. 5**

**Fig. 5**

**Table 1**

Single MCP | Dual MCP | ||
---|---|---|---|

Median | 95% BCI | Median | 95% BCI |

KAL153 example | |||

1886 | 1666–1988 | 1825 | 1669–1985 |

7643 | 7631–7677 | 7652 | 7631–7678 |

RW024 example | |||

306 | 290–313 | 298 | 283–337 |

Single MCP | Dual MCP | ||
---|---|---|---|

Median | 95% BCI | Median | 95% BCI |

KAL153 example | |||

1886 | 1666–1988 | 1825 | 1669–1985 |

7643 | 7631–7677 | 7652 | 7631–7678 |

RW024 example | |||

306 | 290–313 | 298 | 283–337 |

**Fig. 6**

**Fig. 6**

We thank Benjamin Redelings for helpful discussions and Robert Rovetti for careful reading of the manuscript. We are also grateful to the two anonymous reviewers for their helpful comments and suggestions that greatly improved the manuscript. This work was supported by NIH grant GM068955, UCLA AIDS Institute and the James B. Pendleton Charitable Trust.

## REFERENCES

*env*show high frequency of recombinant

*gag*genes.

## Comments