RDP4: Detection and analysis of recombination patterns in virus genomes

RDP4 is the latest version of recombination detection program (RDP), a Windows computer program that implements an extensive array of methods for detecting and visualising recombination in, and stripping evidence of recombination from, virus genome sequence alignments. RDP4 is capable of analysing twice as many sequences (up to 2,500) that are up to three times longer (up to 10 Mb) than those that could be analysed by older versions of the program. RDP4 is therefore also applicable to the analysis of bacterial full-genome sequence datasets. Other novelties in RDP4 include (1) the capacity to differentiate between recombination and genome segment reassortment, (2) the estimation of recombination breakpoint confidence intervals, (3) a variety of ‘recombination aware’ phylogenetic tree construction and comparison tools, (4) new matrix-based visualisation tools for examining both individual recombination events and the overall phylogenetic impacts of multiple recombination events and (5) new tests to detect the influences of gene arrangements, encoded protein structure, nucleic acid secondary structure, nucleotide composition, and nucleotide diversity on recombination breakpoint patterns. The key feature of RDP4 that differentiates it from other recombination detection tools is its flexibility. It can be run either in fully automated mode from the command line interface or with a graphically rich user interface that enables detailed exploration of both individual recombination events and overall recombination patterns.


Introduction
In many different groups of viruses, genetic recombination is an important evolutionary process that generates much of the genetic diversity upon which natural selection acts. Recombination patterns that are evident within the genomes of such viruses can reveal a great deal about their biology and evolution. Non-random patterns of sequence exchange between individuals within a species can provide direct evidence of geographical or host-range-imposed population subdivisions that prevent certain individuals from recombining (Lam et al. 2013; Monjane et al. 2014). Similarly, sequence exchange patterns between viruses in different species can reveal otherwise undetectable ecological links between some species and barriers between others (Beiko, Harlow, and Ragan 2005;Lefeuvre et al. 2010;Prasanna et al. 2010). The distributions of recombination breakpoints that are evident within virus genomes can also reveal details of the mechanistic and biochemical processes underlying recombination (Magiorkinis et al. 2003;Rohayem, Mü nch, and Rethwilm 2005;Lefeuvre et al. 2009;Dedepsidis et al. 2010;Simon-Loriere et al. 2010) and the selective forces that constrain the survival and proliferation of recombinants (Lefeuvre et al. 2007;Simon-Loriere et al. 2009;Golden et al. 2014;Woo, Robertson, and Lovell 2014). The epidemiological and/or ecological context of recombinants and the distributions of detected recombination breakpoints can also be crucial in identifying instances where recombinants have been artefactually generated in the laboratory (Boni et al. 2008;Han and Worobey 2011;Martin, Lemey, and Posada 2011;Tan et al. 2012;Lam et al. 2013).
Besides an interest in recombination itself, another important reason for analysing recombination patterns in virus genomes is to minimise the disruptive impact that recombination can have on other phylogeny-based analyses of molecular evolution (Schierup and Hein 2000b;Scheffler, Martin, and Seoighe 2006;Arenas and Posada 2010). Specifically, unaccounted for recombination events within a set of sequences can seriously undermine the accuracy of phylogenetic trees constructed from these sequences (Schierup and Hein 2000a;Posada and Crandall 2002). Therefore, it is often desirable to either exclude recombinant sequences or identify recombination breakpoint positions and focus analyses exclusively on those genome regions that are unbroken by these breakpoints prior to carrying out selection, molecular clock, phylogeographic, or any other analyses of virus genome sequences that may be misled by incorrectly inferred phylogenetic trees.

Detecting individual recombination events with RDP4
RDP4 is a computer program that was developed with all of these applications in mind. Given a set of aligned nucleotide sequences, it identifies and characterises individual recombination events, providing detailed information on which sequences in the analysed dataset carry evidence of the same recombination event, the likely positions of recombination breakpoints, and the identities of sequences that are most closely related to the parental sequences. Key elements of the RDP4 program interface are illustrated in Fig. 1.
Crucially, RDP4 is able to perform recombination analyses without any need for predefined sets of non-recombinant reference sequences: a factor which makes it more generally applicable than many other available recombination analysis tools (see http://www.bioinf.manchester.ac.uk/recombination/programs. shtml; Martin, Lemey, and Posada 2011). RDP4 is able to do this using a range of fast and powerful heuristic recombination detection methods that sequentially test every combination of three sequences in an input alignment for evidence that one of the three sequences is a recombinant and the other two are its parents. Besides the original RDP method (Martin and Rybicki 2000), these methods include BOOTSCAN (Salminen et al. 1995), MAXCHI (Maynard Smith 1992), CHIMAERA (Posada and Crandall 2001), 3SEQ (Boni, Posada, and Feldman 2007), GENECONV (Padidam, Sawyer, and Fauquet 1999), LARD (Holmes, Worobey, and Rambaut 1999), and SISCAN (Gibbs, Armstrong, and Gibbs 2000). Following the detection of a 'recombination signal' with these methods, RDP4 determines approximate breakpoint positions using a hidden Markov model, BURT, and then identifies the recombinant sequence using the PHYLPRO (Weiller 1998), VISRD (Lemey et al. 2009), and EEEP methods (Beiko and Hamilton 2006;Heath et al. 2006; see the manual that is distributed with RDP4 for a detailed account of how all of these methods work).
Having detected all of the recombination signals that are evident within an input alignment, RDP4 will then proceed to infer the minimum number of recombination events needed to account for these signals. It does so by sequentially disassembling identified recombinant sequences into their component parts (i.e., each recombinant sequence is split into two pieces) and  (2) interchangeable tree/matrix/information displays that provide information on individual user-selected recombination events such as inferred breakpoint locations (and statistically plausible alternative locations), parental sequences (and phylogenetically plausible alternative parents), analysis warnings (such as if there is a high probability of recombinants and/or recombination breakpoints having been misidentified), and relative degrees of support by different analysis methods for detected recombination signals; (3) a schematic sequence display depicting colour-coded representations of the analysed sequences and the locations of detected recombination events; and (4) a plot display graphically illustrating the statistical evidence underlying the detection of individual user-selected recombination events.
iteratively rescanning the resulting expanded dataset until no further recombination signals are evident.
This fully exploratory approach means that, without any prior information, RDP4 can be used to characterise complex patterns of recombination such as those arising when recombination events occur between parental sequences that are themselves recombinant.
It is important to note, however, that there are also drawbacks to this approach. Primary among these is that when analysing datasets that contain large numbers of recombinant sequences, it can become very difficult for RDP4 to accurately identify the recombinants. Similarly, when numerous ancient recombination events have occurred such that multiple sequences in a dataset carry evidence of the same ancestral recombination events, RDP4 will often incorrectly attribute recombination signals arising from multiple different recombination events to a single ancestral event (i.e., it will under-count the number of recombination events evident within a dataset).
To partially rectify such deficits, RDP4 includes an array of tools which can be used to manually check, and correct if necessary, any perceived inference errors that the program has made. These tools are all accessible via a point-and-click graphical user interface and enable a user to directly test alternative hypotheses relating to the misidentification of recombination breakpoints, parental sequences, and groups of sequences sharing evidence of the same ancestral recombination events. Among others, these cross-checking tools include the following:

Accounting for recombination during phylogenetics-based analyses
In cases where recombination is only being analysed with the intention of minimising its impact on other molecular evolution analyses, RDP4 can export sequence alignments in a multitude of formats either with recombinant sequences/fragments of sequences removed or with recombinant sequences split into their constituent parts. Such alignments will be stripped of all readily detectable evidence of individual recombination events and can then be used with other computer programs such as BEAST (Bouckaert et al. 2014) or HYPHY (Kosakovsky- Pond et al. 2005) to make more accurate estimates of evolutionary rates or less error-prone inferences of positive selection. RDP4 can also be used to directly construct minimum evolution (with FastTree2; Price, Dehal, and Arkin 2010) and maximum-likelihood (with RAxML8; Stamatakis 2014) phylogenetic trees that account for the recombination events that it has detected. Specifically, it will construct trees using edited versions of the input alignment where fragments of sequence derived through recombination have either been removed altogether or have been re-added to the alignment as new sequences. Further, the program can carry out 'recombination aware' inferences of ancestral sequences using parsimony (with PHYLIP;

Operational limits
RDP4 can be used to productively analyse datasets containing up to 200 million nucleotides within 72 hours on a standard 2 GHz processor with 2 GB of RAM. Such datasets might, e.g., consist of sixty 3-Mb-long bacterial genome sequences, or 1,500 10-kb-long viral genome sequences. With default program settings, RDP4 can analyse 100 10-kb-long sequences in 10 minutes on a standard desktop computer.

Availability
RDP4 is available for free download from http://web.cbio.uct.ac. za/darren/rdp.html. It is distributed along with programs for generating (SDT; Muhire, Varsani, and Martin 2014) and aligning (IMPALE) datasets and an extensive manual that contains detailed descriptions of the various methods implemented in RDP4 and a step-by-step guide describing how best to use these. The manual and RDP4 site also contain information on how RDP4 can be run on Mac and Linux computers.