## Abstract

**Motivation:** Recombination plays an important role in the evolution of many pathogens, such as HIV or malaria. Despite substantial prior work, there is still a pressing need for efficient and effective methods of detecting recombination and analyzing recombinant sequences.

**Results:** We introduce Recco, a novel fast method that, given a multiple sequence alignment, scores the cost of obtaining one of the sequences from the others by mutation and recombination. The algorithm comes with an illustrative visualization tool for locating recombination breakpoints. We analyze the sequence alignment with respect to all choices of the parameter α weighting recombination cost against mutation cost. The analysis of the resulting cost curve yields additional information as to which sequence might be recombinant. On random genealogies Recco is comparable in its power of detecting recombination with the algorithm Geneconv (Sawyer, 1989). For specific relevant recombination scenarios Recco significantly outperforms Geneconv.

**Availability:** Recco is available at

**Contact:**jmaydt@mpi-inf.mpg.de

## 1 INTRODUCTION

In comparison with tree-based phylogenetic analysis procedures, procedures for analyzing recombination are immature. Recent power studies on recombination detection methods uncovered that it can be hard even to decide whether there is recombination in a set of aligned sequences (Posada and Crandall, 2001; Wiuf *et al*., 2001). The questions, which sequence is the recombinant and where there are recombination breakpoints, are even more challenging.

Methods for analyzing recombination in molecular sequences fall into at least four categories: (1) recombination detection methods, (2) methods for deriving bounds on the number of recombination events (Song and Hein, 2005; Song *et al*., 2005), (3) network methods such as SplitsTree (Huson, 1998) and (4) inference methods based on the coalescent (Kuhner *et al*., 2000; Fearnhead and Donnelly, 2001). Each category is appropriate for different problem settings and can provide independent information on the recombination signal contained in a dataset. As recombination detection programs are conceptually closest to our approach, we limit our comments to this category in the following paragraphs.

In the past 20 years, more than 20 methods have been developed for detecting the presence of recombination in a sequence dataset. More details and an evaluation of their accuracy can be found in recent power studies (Brown *et al*., 2001; Posada and Crandall, 2001; Wiuf *et al*., 2001) and in book chapters (Salminen, 2003; Husmeier and Wright, 2004) dealing with recombination detection. Links to implementations are on the website of D. Robertson ().

The earliest methods use statistical tests for checking whether substitutions in the alignment are non-uniformly distributed, i.e. whether substitutions are significantly clustered owing to recombination, back mutation or other effects. Popular methods using this principle are the maximum χ^{2}-test (Smith, 1992; Spencer, 2003) and Geneconv (Sawyer, 1989). Geneconv searches for the longest conserved fragment between two sequences and determines whether it is significant. Extensions also allow for including mutations in the fragments. Despite the lack of any explicit model of evolution, substitution distribution methods are quite competitive (Posada and Crandall, 2001), with Geneconv performing as one of the best methods.

Many other methods detect a change of the phylogenetic distance signal in adjacent areas of the alignment. Some popular methods are PLATO (Grassly and Holmes, 1997), TOPAL (McGuire and Wright, 2000), PhyPro (Weiller, 1998) and SimPlot (Lole *et al*., 1999). These methods either use a global reference tree or a sliding window to detect local changes in the topology of the phylogenetic distance signal. A global reference tree is problematic for strong recombination signals, as a dominant phylogenetic tree cannot be identified anymore and artifacts are introduced into the tree (Schierup and Hein, 2000). A fixed window size determines the trade-off between localizing the breakpoints accurately and the ability to correctly infer recombination. Even though all these approaches use a model of (tree-like) evolution, they lack a model for recombination. Consequently, they only look for indirect evidence of recombination and thus may falsely detect recombination.

Another group of methods is more closely related to the coalescent framework as they infer a restricted version of the evolutionary history subject to recombination. RecPars (Hein, 1993) allows for a different tree-like evolutionary history at each position and tries to heuristically minimize the cost for substitution at each position under the associated tree as well as the cost for topology changes along the sequence of trees. Thus, RecPars does away with the window-size parameter, but adds recombination and mutation cost parameters. Husmeier and McGuire (2003) translate the idea of RecPars into a statistical framework and maximize the likelihood of topology changes and mutations. While being accurate, a major drawback of this approach is the high computational effort that makes this method inapplicable for datasets with more than a few sequences.

We present Recco, a fast and simple method for detecting recombination in a set of sequences and locating putative recombination breakpoints that is based on cost minimization and dynamic programming. The basic idea is to construct each sequence in the alignment (temporarily considered the putative recombinant) in turn, from the other sequences in the alignment using only the mutation and recombination operators. The output of Recco bears some resemblance with SimPlot (Lole *et al*., 1999) and represents local sequence similarity of the putative recombinant with the other sequences. In contrast to SimPlot, there is no need for a sliding window and hence no limitation of spatial resolution. The minimum cost solution identifies the best recombination breakpoints and also the parental sequences. Recco has only two tunable parameters, recombination and mutation cost. In practice, we only need to consider a single parameter α representing the cost of mutation relative to recombination. We present an approach for finding the values for α at which the solution changes structurally. This can be condensed further into a single indicator for the presence of recombination in the alignment.

The dynamic program we use is based on the work of Kececioglu and Gusfield (1998). It employs insertion, deletion, recombination and mutation for producing a single sequence from two other sequences. Our method is a restricted version of the dynamic programs introduced in Lajoie and El-Mabrouk (2005) and Spang *et al*. (2002). Lajoie and El-Mabrouk use recombination, mutation and gene conversion in order to explain a single sequence from an alignment of other sequences. Spang *et al*. (2002) allow for insertion, deletion, recombination and mutation in order to derive a single sequence from an alignment of other sequences. Even though such a more general formulation is usually preferable, our visualization technique introduced in Section 2.3.1 depends on the restrictions that we have imposed. More general formulations would need at least three dimensions for visualization. The parametric analysis which we describe in Section 2.3.2 is also unique to our work, but in principle applicable to other methods as well. For example, RecPars could benefit from this type of analysis.

## 2 METHODS

In order to illustrate our method, we use the same sequence data throughout the discussion. The data were generated by the program Seq-Gen (Rambaut and Grassly, 1997) using the genealogy depicted in Figure 1. First we eliminate the sequence R2 from the dataset. Then the dataset does not contain two similar recombinants which simplifies the problem for purposes of exposition. The general problem is discussed in the results section. Hence, sequence R1 is the only recombinant and is derived from an ancestor of A1 to the left and an ancestor of B1 to the right. The breakpoint is located in the middle of the alignment. The total length of the alignment is 100 nt. We used the Jukes–Cantor model of nucleotide evolution (Jukes and Cantor 1969) with an average number of 0.25 substitutions per site from the root to any of the tips of the genealogy. The full dataset including sequence R2 is analyzed in the Results section.

**Fig. 1**

**Fig. 1**

### 2.1 Recombination analysis through cost minimization

Let *A* be a multiple sequence alignment with *M* rows and *N* columns, with *A _{i}* representing the

*i*-th sequence and

*A*denoting the

_{ip}*p*-th nucleotide of the

*i*-th sequence. Let

*S*=

*s*

_{1}, … ,

*s*be the sequence of the putative recombinant and define

_{N}*m*(

*i*,

*p*) as the cost of substituting

*A*with

_{ip}*s*. Recombination cost is measured by

_{p}*r*(

*i*,

*j*,

*p*) and represents the cost of recombining

*A*on the left side of position

_{i}*p*with

*A*on the right side, hence resulting in the sequence

_{j}*A*

_{i}_{1}, … ,

*A*

_{i,p}_{−}

_{1},

*A*, … ,

_{jp}*A*. Usually it makes sense to set

_{jN}*r*(

*i, i, p*) = 0 for all

*i*and

*p*, as this does not represent a visible recombination. Our goal is to find a sequence of mutation and recombination events that produce

*S*from the sequences in

*A*. In the formal model we use a decision variable

*k*∈ {1, … ,

_{p}*M*} to determine the sequence of

*A*explaining

*s*, so that

_{p}*S*is explained by $Ak1,1,\u2026,AkN,N$. The most parsimonious solution then minimizes the following cost function with respect to

**k**= (

*k*

_{1}, … ,

*k*):

_{N}**k**infers a mutation event for each position

*p*where $Akp,p\u2260sp$ and a recombination event for each

*p*where

*k*

_{p}_{−1}≠

*k*. The parameter α ∈ [0,1] weighs mutation against recombination cost and accounts for the fact that recombination can always be explained by mutation. If α is close to zero the solution favors mutations. For α = 1 recombination costs nothing and the optimal solution uses the nucleotide

_{p}*A*most similar to

_{ip}*s*at each position

_{p}*p*. In the following, we use unit cost for recombination and mutation, i.e.

*r*(

*i*,

*j*,

*p*) = δ(

*i*,

*j*) and

*m*(

*i*,

*p*) = δ(

*A*,

_{ip}*s*) where δ is the Kronecker delta function.

_{p}While α selects the global preference for recombination against mutation, *m*(*i*, *p*) and *r*(*i*, *j*, *p*) can be adapted to model local changes in recombination and mutation frequency and lead to a simple strategy for incorporating prior knowledge on the recombination and mutation process. A mutation hotspot or variable region *p*′ might have a lower cost *m*(*i*, *p*′) for substitution than other positions *p*. Also, mutation to a rare nucleotide can incur a higher cost than mutation to a common nucleotide, thereby accommodating nucleotide composition bias. The recombination cost parameter *r*(*i*, *j*, *p*) can also vary among positions and capture the lower cost at recombination hotspots. However, it is also possible to model much more complex effects, as *r*(*i*, *j*, *p*) also depends on the sequence pair (*i*, *j*). For example, several mechanisms proposed for HIV recombination (Negroni and Buc, 2001) depend on the donor sequence *i*, on the acceptor sequence *j* or on the pair of parental sequences (*i*, *j*) that produce the recombinant. Such effects can easily be accounted for by increasing or decreasing the respective recombination cost *r*(*i*, *j*, *p*).

### 2.2 Formulation as a dynamic program

Despite its expressive power, the proposed optimization problem can be solved efficiently. As the cost function (1) is separable in *k _{p}*, we can minimize it by dynamic programming, see Figure 2 for the corresponding tableau. To be more specific, define a subproblem of (1) by only considering the columns 1, … ,

*p*of the alignment, with the additional constraint that solutions end using some sequence

*k*. The minimal cost solution of this subproblem is the minimal cost for generating

_{p}*S*via mutation and recombination up to and including position

*p*, using all sequences in

*A*and ending in sequence

*k*:

_{p}**Fig. 2**

**Fig. 2**

This can be rewritten as the following forward recurrence for all *p* = 1, … , *N* as the optimal value of *k _{p}* does not depend on

*k*

_{1}, … ,

*k*

_{p}_{−2}:

*f*represents the minimal cost for the subproblem restricted to columns 1, … ,

_{jp}*p*. The minimal cost for the complete problem is then $minjfjN$ and the optimal solution can be retrieved by backtracing. The time complexity is O(

*NM*

^{2}) for the forward run as there are

*N*stages with each stage performing

*M*

^{2}operations. The backtracing step needs O(

*NM*) operations and O(

*NM*) space as the tableau

*f*has to be stored and each value is accessed exactly once.

_{jp}Alternatively to the forward recurrence, we can also use the backward recurrence for *p* = *N* − 1, … , 0 and describe the minimal cost of a solution generating *S* backwards, starting from position *N* up to and excluding position *p* ending in sequence *k _{p}*. The backward recurrence is

### 2.3 Sensitivity analysis

Usually, we are not only interested in the optimal solution itself, but also in the sensitivity of the solution with respect to small changes of the involved variables. Sensitivity is commonly defined in one of two ways: sensitivity with respect to small changes of the optimal solution or sensitivity with respect to small changes of input parameters, such as α, *m*(*i*, *p*) and *r*(*i*, *j*, *p*) in our case. A sensitivity analysis with respect to the optimal solution, subsequently called ‘analysis of the solution’, studies the cost surface in the neighborhood of the optimal solution. The second approach, subsequently called ‘parametric analysis’, quantifies how the optimal solution depends on a set of parameters. More exactly, it partitions the space of parameter settings into regions that yield the same optimal solution and assesses the impact of uncertainty of parameters, e.g. due to measurement or estimation errors, on the structure of the optimal solution. Both types of analyses have been applied separately to the problem of sequence alignment; see e.g. Mevissen and Vingron (1996) and Zimmer and Lengauer (1997). In the following we use both types of analysis.

#### 2.3.1 Analysis of the solution

In the following we transfer an idea by Mevissen and Vingron (1996) from traditional sequence alignment to our problem.

For each position *p* and each sequence *i* we compute the minimum cost *c _{ip}* of a solution that runs through nucleotide

*A*. In other words, we constrain

_{ip}**k**to solutions with

*k*=

_{p}*i*. We can compute the minimal cost of the constrained problem by adding the values of the forward and backward tableaus

*c*=

_{ip}*f*+

_{ip}*b*, since

_{ip}*f*is the minimum cost of generating the recombinant up to and including position

_{ip}*p*ending in sequence

*i*, and

*b*is the minimum cost starting from position

_{ip}*N*up to and excluding position

*p*ending in sequence

*i*(Fig. 2).

We define a robustness measure *r _{ip}* by $rip=minj\u2260i{cjp}\u2212cip$. In words,

*r*is the sacrifice in cost that is incurred by forcing a solution that avoids position

_{ip}*p*in sequence

*i*in our alignment. If

*r*is small then the choice of position

_{ip}*p*in sequence

*i*does not buy us much in terms of alignment cost. Larger values of

*r*indicate more robust sequence positions. The values

_{ip}*c*and

_{ip}*r*can be computed in time

_{ip}*O*(

*MN*) for all combinations of

*p*and

*i*if the forward and backward tableau

*f*and

_{ip}*b*are given.

_{ip}Both values, *c _{ip}* and

*r*, are useful in our scenario.

_{ip}*c*is particularly attractive for visualizing the optimal solution of the dynamic program. All nucleotides

_{ip}*A*that are part of an optimal solution have the same, minimal

_{ip}*c*. We can visualize these by coloring the background of the alignment

_{ip}*A*according to the values

_{ip}*c*(Fig. 3a). For positions at which the solution is uncertain or close to a recombination breakpoint, low values of

_{ip}*c*occur for several sequences to indicate the ambiguity. On the other hand, if we ask how robust the choice of a specific nucleotide

_{ip}*A*is, we can rely on the values of

_{ip}*r*. Regions that are uncertain are characterized by a lower robustness of the optimal solution (Fig. 3b).

_{ip}**Fig. 3**

**Fig. 3**

#### 2.3.2 Parametric analysis

Even though the analysis of the previous section analyzes the cost surface of solutions close to the minimum solution and also results in an appealing visual representation, it cannot capture the uncertainty caused by varying the input parameters. In particular, for many pathogens we do not have reliable estimates of the recombination rate and cannot set the parameter α appropriately. But even if we know the recombination rate and the appropriate setting of α, it is still informative to find the region of α for which the optimal solution does not change structurally. In the following we study how *φ _{α}*(

**k**) changes as α is varied, both for a fixed solution

**k**and for

**k**subject to optimization.

If **k** is fixed, *φ _{α}*(

**k**) is linear in α and has the derivative

*R*and

*M*are the total cost of recombination and mutation not weighted by (1 − α) or α. If

**k**is also subject to optimization, the minimal cost is a piecewise-linear concave function $g(\alpha )=mink\varphi \alpha (k)$. Each linear piece corresponds to a fixed set of minimum cost solutions. This set can be retrieved by the dynamic program and only changes at the edges of the piecewise-linear cost function. Hence, it is simple to retrieve all minimum cost solutions for α ∈ [0,1], given the parametric cost curve

*g*.

In order to compute the parametric cost curve *g*, we use the algorithm of Eisner and Severance (1976). In order to apply this algorithm, we need a method for computing the optimal cost and the derivative at a given point α. In our case it is simple to compute the minimal cost with the dynamic program. To get the derivative, we extended the backtrace algorithm to compute *R* and *M*.

The result of the parametric optimization for our example dataset is shown in Figure 4, where each sequence of the alignment was taken as the putative recombinant, in turn. Each linear piece of a cost curve is defined by some values of *R* and *M* and has the functional form (1 − α)*R* + *αM*. Hence, we can read-off the values of *R* and *M* as the abscissa for α = 0 and α = 1, respectively. For example, the first linear piece of each cost curve intersects the origin and corresponds to a solution with *R* = 0, that means without recombination. All other solutions contain at least one recombination. The unweighted cost for recombinations *R* is increasing with increasing α, as mutation cost grows relative to recombination cost. The transition from a solution containing only mutations to a solution with at least one recombination is of particular interest. If the transition happens at a small value of α, recombinations are introduced, even though their weighted cost (1 − α)*r*(*i*, *j*, *p*) is high relative to the cost of mutations. To be more specific, if recombinations and mutations have unit cost, each recombination reduces the number of mutations in the solution by (1 − α)/α. For instance, for sequence R1 in Figure 4, we have α = 0.077 and the sole recombination in the solution associated with α ∈ [0.077, 0.5] removes about 0.923/0.077 ≈ 12 mutations. Hence, a small α value is indicative for a true recombinant and can be used for filtering interesting sequences from the dataset. In the following section we use a similar property in an automated method for detecting recombinant sequences.

**Fig. 4**

**Fig. 4**

### 2.4 Recombination detection

The previous section introduced sensitivity analysis as a means of investigating a sequence dataset manually and pointing out single recombination events. In the following, we focus on extending this type of analysis quantitatively, such that it can also be used in an automated procedure discerning recombinant from non-recombinant datasets. This does not only yield *P*-values for the presence of recombination, but also forms the basis for the power study in Section 3. A quick overview of the features we introduce in the following sections is given in Figure 5.

**Fig. 5**

**Fig. 5**

Given the segments of a cost curve ordered by increasing α, let *R _{i}* and

*M*,

_{i}*i*= 1, … ,

*K*be the unweighted cost of recombination and mutation for the

*i*-th linear segment. The first feature that we introduce is the amount of mutation cost saved per unit of recombination cost. More exactly,

*R*

_{i}_{+1}−

*R*is the additional amount of recombination cost incurred by solution

_{i}*i*+ 1, and

*M*−

_{i}*M*

_{i}_{+1}is the amount of mutation cost saved. The quotient Savings

*= (*

_{i}*M*−

_{i}*M*

_{i}_{+1})/(

*R*

_{i}_{+1}−

*R*) quantifies the additional amount of mutation cost saved by each additional unit of recombination cost, where Savings

_{i}*is decreasing with increasing*

_{i}*i*. Consequently, we define our first feature to be

_{1}. This is a strictly monotonic transformation as FirstAlpha ∈ [0,1], and thus a

*P*-value based on FirstAlpha is exactly the same as one based on MaxSavings. Therefore, we do not consider the feature FirstAlpha in our analysis. Another feature that is related but not equivalent to MaxSavings is FirstAngle, the angle in radians between the first and second linear segment of a cost curve (Fig. 5). A large angle points to a significant decrease of mutation cost and an increase of recombination cost. The last feature we propose is MaxCost, the maximum that the cost curve attains. This feature measures how hard it is to derive the putative recombinant from the other sequences in the alignment. The value of MaxCost depends mostly on the time when the sequence diverged from the other sequences in the dataset. A sequence that diverged early in history tends to have a higher value of MaxCost than a sequence that diverged only recently.

In general we cannot rely on absolute feature values in identifying recombination. The features depend on parameters of the dataset, such as sequence diversity or sequence length. Hence, we use a permutation test to obtain *P*-values instead of absolute feature values for each sequence. The *P*-value is computed by comparing the observed feature value with the feature value expected under the null-hypothesis that no recombination occurred. More exactly, we permute the columns of an alignment 100 times in order to generate independent samples of datasets with the same sequence diversity, but without a recombination signal. In the following, a sequence with a *P*-value ≤ 0.05 is assumed to be recombinant. As the number of sequences is usually small, we do not consider any correction for multiple testing.

The *P*-values suggest whether a certain sequence is recombinant or not. They do not imply that the whole dataset contains a significant amount of recombination. In order to test the latter hypothesis we have to condense the *P*-values or feature values for the whole dataset into a single *P*-value, while controlling the level of false positives. A simple choice is to use the smallest *P*-value over all sequences in the alignment and to apply a Bonferroni correction for multiple testing. This is very conservative and frequently reduces power. Alternatively, we can compute the minimum, or the most extreme, feature value over all sequences of a dataset and repeat this step for every permutation. The resulting *P*-value for the whole dataset still controls the false positive rate nicely and performs better than the Bonferroni correction. Again, a significance cutoff of 0.05 is used to indentify recombination.

### 2.5 Detection of the breakpoint

The backtrace contains information on the recombination breakpoints of a solution **k**. However, the inferred breakpoints depend on the selected sequence *S* and on α. In the following, we present a method that assigns *P*-values to each position without depending on *S* or α.

We address the dependence on α first. The method is based on a recombination profile *p _{j}*,

*j*= 1, … ,

*N*, that stores how much mutation cost can be saved per unit of recombination cost if we introduce a recombination at position

*j*in the solution explaining a specific sequence

*S*. More exactly, for each linear piece

*i*= 1, … ,

*K*of the parametric cost curve we denote the set of breakpoints that occur in any optimal solution of

*i*as

*B*. Then $pj=maxi=2,\u2026,K{0}\u222a{Savingsi\u22121\u2223j\u2208Bi}$, where Savings

_{i}

_{i}_{−1}is the savings value for solution

*i*.

*p*thus stores the maximum Savings value that any breakpoint at position

_{j}*j*can achieve. To obtain

*P*-values for each breakpoint position, we use a similar approach as in Section 2.4: by computing recombination profiles for column permutations of the original alignment, we can estimate the distribution of

*p*s for each position

_{j}*j*. The result is a

*P*-value for recombination at each position and for each sequence in the alignment.

To detect breakpoints without referring to a specific sequence *S*, we compute the maximum of *p _{j}* over all sequences in the alignment for each position

*j*(Fig. 6). Computing

*P*-values for the aggregated recombination profile is then analogous to the procedure for a single sequence. Figure 6 shows that a fixed Savings value is more significant for a position close to the boundary than for a position in the middle of the alignment. This is due to the fact that recombinations close to the boundary of the alignment exchange fewer nucleotides and thus cannot achieve high values of Savings.

**Fig. 6**

**Fig. 6**

Detecting the breakpoint position entails a multiple testing problem. Each position has an expected false detection rate of 5% and even our short example alignment could result in five falsely detected breakpoints, on average. Therefore, we would have to adjust for multiple testing, e.g. using the Bonferroni or false discovery rate method (Benjamini and Hochberg, 1995). In order to do so, we would have to perform, say, 10 000 permutations per dataset in order to obtain accurate *P*-values after adjustment. This is computationally forbidding in the context of a power study. Thus, for this validation we only consider the basic methodology without adjustment for multiple testing.

## 3 RESULTS

Validation is difficult for recombination detection programs and even harder for programs finding the recombinant sequences. One reason is that real datasets either show a recombination signal that is easy to detect, or there is still uncertainty about whether recombination actually took place, see e.g. Posada (2002). In comparison, synthetic datasets have many advantages, such as an unlimited amount of datasets and total control over the simulated recombination scenario. Synthetic datasets are usually generated by one of two approaches. The first is to randomly draw an ancestral recombination graph given several population genetic parameters (Griffiths and Marjoram, 1997) and then simulate sequences along the resulting genealogy. Alternatively, we can specify a fixed genealogy for simulating sequences. The first approach works well for comparing methods regarding their power of detecting recombination. The second approach allows studying the power of identifying recombinant sequences or recombination breakpoints as both are known. In the following we use both kinds of synthetic datasets in order to compare our method with existing recombination detection programs.

### 3.1 Random genealogies

In analogy to the study of Posada and Crandall (2001), we used the coalescent to generate 1000 sequence datasets for different parameter combinations. For all simulations we chose the effective population size *N*_{e} = 1000, the sequence length of *N* = 1000 nt, a sample size of *M* = 10 sequences and the Jukes–Cantor substitution model. We let the scaled recombination rate ρ and the scaled mutation rate θ vary such that ρ ∈ {0, 1, 4, 16} and θ ∈ {10, 30, 50, 100, 200}. For diploid populations the coalescent defines θ = 4*N*_{e}*μN* and ρ = 4*N*_{e}*rN*, where μ is the mutation rate per generation and loci, and *r* is the recombination rate per generation and loci. Thus, our simulations correspond to *r* ∈ {0, 2.5 × 10^{−7}, 1 × 10^{−6}, 4 × 10^{−6}} and μ ∈ {2.5 × 10^{−6}, 7.5 × 10^{−6}, 1.25 × 10^{−5}, 2.5 × 10^{−5}, 5 × 10^{−5}}. To obtain a measure of the false detection rate, we also simulated data without recombination and different mutation rates and rate variations among sites. Rate variation was modeled using the gamma distribution with shape parameter γ. The gamma distribution was approximated with eight rate categories (Yang, 1996). The simulations without recombination used the same values for θ and γ ∈ {0.01, 0.5, 2, ∞}. A higher value of γ represents a lower amount of rate variation, and γ = ∞ denotes no rate variation at all. All simulations were carried out using treevolve (Grassly *et al*., 1999). Treevolve did not terminate for recombination rates ρ ≥ 32, thus we could not cover the whole range of parameters as in Posada and Crandall (2001). Moreover, treevolve generated datasets that are different from the ones used in the original study. For comparison, we report the results on our datasets for Geneconv (Sawyer, 1989), one of the best performing methods in the original study. The parameters of Geneconv were set as in the original study, i.e. using global permutation values and a gscale value of 0. A gscale value of 0 does not allow for mismatches in the conserved fragments between sequences and performed best for detecting recombination in random genealogies.

The probability of detecting recombination for Geneconv and the MaxSavings feature of Recco is depicted in Figure 7. The other features we have proposed did not perform well in this scenario and failed to detect recombination in many cases. Geneconv has a higher probability of detecting recombination than MaxSavings, but does not control the false positive rate of 5% for high mutation rates. Our permutation-based approach controls the false detection rate always to a level below 5%. In order to compare the two methods, we have adjusted the MaxSavings feature such as to classify a dataset as recombinant if the *P*-values do not exceed 0.12. This setting increases the false detection rate to a level slightly below that of Geneconv, but also increases the power to a comparable level. Thus, our method performs as well as Geneconv in this scenario. However, we have reasons to believe that the power of Recco is reduced by a simple limitation: if two recombinants are similar, they are explained by mutation instead of recombination. This is investigated in more detail in the next subsection.

**Fig. 7**

**Fig. 7**

### 3.2 Fixed recombination scenarios

Fixed recombination scenarios can be used for studying the power of identifying the recombinant or the recombination breakpoint, as the recombination scenario is known in detail. We use a total of four recombination scenarios. (S1) and (S2) are generated using the genealogy in Figure 1, but (S1) does not contain sequence R2. (S3) and (S4) were originally introduced in Holland *et al*. (2002) and are given in Figure 8. As all scenarios only contain recent recombinations, the recombination signal has a very simple structure and should be fairly easy to detect. For each scenario we simulated 1000 alignments using Seq-Gen (Rambaut and Grassly, 1997). We then applied Recco and computed the *P*-values for recombination per dataset, per sequence and for each position. We observed that in this scenario Geneconv performed much better using a gscale value of 1, i.e. allowing for mismatches in the conserved fragments. Thus, we only report the results using a gscale value of 1 for Geneconv. The result for recombination detection and the parameters used in the simulation are given in Table 1. The most predictive feature for recombination detection in most scenarios was MaxSavings. MaxSavings performed slightly worse than Geneconv only for its worst-case scenario (S2), where the recombinants are very similar. As (S1) and (S2) use a very short sequence length, we repeated this experiment with a sequence length of 1000 nt and one-tenth of the mutation rate. The results did not differ qualitatively and MaxSavings still performed best for scenarios (S2), (S3) and (S4). Recco was also quite fast: a full analysis with 100 column permutations of a dataset from scenario (S3) took ∼44 s on an AMD Opteron 2.4 GHz. This included computing the parametric cost curve 101 times for each of the 11 sequences with 1000 nt, constructing breakpoint profiles and summarizing the results as *P*-values.

**Fig. 8**

**Fig. 8**

**Table 1**

Sequence length (nt) | Tree height | Mutation model | ts/tv ratio | Geneconv | MaxSavings | FirstAngle | MaxCost | |
---|---|---|---|---|---|---|---|---|

(S1) | 100 | 0.25 | JC | — | 0.518 | 0.467 | 0.012 | 0.106 |

(S2) | 100 | 0.25 | JC | — | 0.472 | 0.875 | 0.042 | 0.572 |

(S3) | 1000 | 0.15 | K2P | 2 | 1.000 | 1.000 | 0.303 | 0.933 |

(S4) | 1000 | 0.15 | K2P | 2 | 1.000 | 1.000 | 0.950 | 0.988 |

Sequence length (nt) | Tree height | Mutation model | ts/tv ratio | Geneconv | MaxSavings | FirstAngle | MaxCost | |
---|---|---|---|---|---|---|---|---|

(S1) | 100 | 0.25 | JC | — | 0.518 | 0.467 | 0.012 | 0.106 |

(S2) | 100 | 0.25 | JC | — | 0.472 | 0.875 | 0.042 | 0.572 |

(S3) | 1000 | 0.15 | K2P | 2 | 1.000 | 1.000 | 0.303 | 0.933 |

(S4) | 1000 | 0.15 | K2P | 2 | 1.000 | 1.000 | 0.950 | 0.988 |

Tree height is given in mutations per site from the root to any tip. The mutation model is either JC (Jukes and Cantor, 1969), or K2P (Kimura, 1980). ts/tv ratio is the transition to transversion ratio.

Finding the recombinant sequence is more difficult (Fig. 9), particularly because it is very hard to discriminate between the true recombinant and sequences involved in the recombination. This is an intrinsic problem. As expected, the recombinant sequences had the highest probability of being detected as recombinant for (S1), (S3) and (S4). For (S2) the parental sequences A1 and B1 are the only sequences that are detected as recombinant. R1 and R2 are masked as they are very similar to each other. Consequently, the evidence for recombination is derived from sequences A1 and B1, and not from R1 for scenario (S2). Another interesting aspect is that the features MaxCost and FirstAngle performed quite well, despite their failure to detect recombination for random genealogies. Particularly FirstAngle seems to discriminate better between true recombinant and parental sequences. MaxCost is almost as good as MaxSavings here, and achieved a false detection rate <2.5% in all datasets of Subsection 3.1 (data not shown).

**Fig. 9**

**Fig. 9**

To evaluate breakpoint detection performance, we computed the probability of detecting a recombination breakpoint at each position for Geneconv and for the MaxSavings feature of Recco. A breakpoint for Geneconv was defined as the start or end position of a gene conversion fragment with a significant *P*-value. The result for detecting recombination breakpoints in scenarios (S1), (S2) and (S3) is given in Figure 10. (S4) shows a very similar behavior to (S3). As a side effect of our breakpoint definition for Geneconv there is a high chance of identifying the first or last positions of an alignment as a breakpoint. This should be ignored. Geneconv has a smaller false detection rate than the MaxSavings feature of Recco, but fails to detect the true breakpoint in many cases. The scenarios (S1) and (S2) have a low diversity and are very difficult for Geneconv. Recco is more suitable for these datasets and may assign a highly significant *P*-value to breakpoints even after correcting for multiple testing (Fig. 6). However, Recco's false detection rate is too high for long sequences if multiple testing is not accounted for.

**Fig. 10**

**Fig. 10**

## 4 CONCLUSION

We have introduced the new fast method Recco for analyzing sequences subject to recombination. The MaxSavings feature of Recco performs as well as or better than many other methods for recombination detection. While the underlying dynamic program is quite simple, the output of Recco is enhanced considerably by succeeding sensitivity analysis. Sensitivity analysis on the solution provides an intuitive visualization of the solution and alternative recombination hypotheses. We have also introduced a sensitivity analysis on the input parameter α, which controls the ambiguity between mutation and recombination. We believe that this type of analysis can be of benefit in other approaches to recombination detection as well, e.g. RecPars (Hein, 1993). A major limitation of our approach is that two similar recombinant sequences mask each other and are not detected as recombinant. We believe that resolving this issue will lead to an algorithm that is even more powerful. A manual work-around is to iteratively remove the closest sequence to the sequence under investigation until it is detected as recombinant.

The authors would like to thank Andreas Meyerhans for helpful discussions on the requirements of a recombination analysis tool and on the biology of the recombination process in HIV.

*Conflict of Interest*: none declared.

## Comments