Summary: TreeMos is a novel high-throughput graphical analysis application that allows the user to search for phylogenetic mosaicism among one or more DNA or protein sequence multiple alignments and additional unaligned sequences. TreeMos uses a sliding window and local alignment algorithm to identify the nearest neighbour of each sequence segment, and visualizes instances of sequence segments whose nearest neighbour is anomalous to that identified using the global alignment. Data sets can include whole genome sequences allowing phylogenomic analyses in which mosaicism may be attributed to recombination between any two points in the genome. TreeMos can be run from the command line, or within a web browser allowing the relationships between taxa to be explored by drill-through.
Supplementary information: Supplementary data are available at Bioinformatics online.
Recombination events between DNA and RNA molecules occur during meiosis in eukaryotes by gene conversion, illegitimate pairing between paralogons, and by lateral gene transfer. Methods of phylogenetic tree reconstruction, used to infer evolutionary relationships between sequences, are predicated on a model of sequence evolution without recombination. Recombination between sequences violates the assumptions of this model, potentially resulting in a group of sequences with several underlying phylogenies (Posada et al., 2002).
Recombination detection methods have included sequence similarity, distance, phylogeny, compatibility, and substitution distribution methods (Posada et al., 2002). A range of tools are available to identify recombination (e.g. Etherington et al., 2005; Milne et al., 2004). The limitation of most tools is that sequences to be searched must first be aligned in a single multiple alignment. This is appropriate when searching for evidence of recombination in a conserved gene family within or between genomes. However, a single multiple alignment cannot be achieved in several circumstances. First, in a large gene superfamily, where sequence similarity is low, robust alignment may only be possible for each subfamily separately. Second, in genome-scale comparisons chromosomal rearrangements can mean that alignments of whole chromosomes or large chromosomal segments cannot be made. Third, when considering a conserved gene family a sequence may, through recombination, contain fragments of a non-homologous sequence, from elsewhere in the same or another genome, that will not align with the remaining sequences in the alignment. TreeMos addresses these three types of case.
The TreeMos approach is a phylogenomic one because it allows the genetic information of entire genomes to be incorporated within a single analysis. TreeMos was developed to search for phylogenetic mosaicism within sequences which could not be analysed within a single multiple alignment, using the proteins of the rhodopsin G-Protein Coupled Receptor (GPCR) gene superfamily, an example of the first type above, as a test case (Allaby and Woodwark, 2007). TreeMos considers sequences within a data set and looks for high local similarities between non-aligned regions within an alignment, regions in separate alignments, and sequences which may be distant homologues, or contain homologous segments (through recombination) in otherwise non-homologous sequences. The high local similarities are subjected to phylogenetic analyses, in order to identify instances of a change in the nearest neighbour—phylogenetic mosaicism—which are displayed as a reticulate relationship (Fig. 1).
Figure 1 illustrates how visualization can lead to the discernment of higher order patterns of phylogenetic mosaicism, such as correlated mosaicism in which many members of a sequence family resemble a distantly related family more than each other within a localized sequence region (Allaby and Woodwark, 2007).
TreeMos can be run on Mac OS X, Windows or Linux, through a user-friendly web browser interface, from the command line, or as part of a high-throughput pipeline. The program searches multiple alignments and individual DNA or protein sequences, in FASTA format, for phylogenetic anomalies. Default parameters, which can be adjusted by the user, identify the window size over which anomalies are detected, the increment to slide the window, and the maximum genetic distance between a pair of sequences for them to be considered related (Allaby and Woodwark, 2004).
For each sequence, its Global Nearest Neighbour (GNN) is identified by comparison with all sequences in the data set, as are the Local Nearest Neighbours of each window within the sequence (LNNs). Throughout, an automated data screening procedure is used to filter out sequences, which are too dissimilar to be reliably aligned (see Supplementary Fig. 1) (Allaby and Woodwark, 2007). Tree-building is used to identify nearest neighbours, in cases where enough data are available, otherwise distance methods are used. Where the LNN of a particular window differs from the GNN, the window is identified as having a phylogenetically anomalous affiliation. Typically, this entails hundreds or thousands of phylogenetic analyses. The resulting set of anomalies are reported in tab-separated text format, and visualized as an image for each sequence and for each alignment. Log files are generated, recording processing steps and any errors. Sets of results can be archived for browsing at a future date.
The web browser interface allows all functions to be accessed through local web pages, and the resulting set of anomalous affiliations to be visualized interactively with drill-through between affiliated sequences (no connection to the internet is required).
External packages are used to carry out the analyses underlying the algorithm. In release 1.0 BLAST (Altschul et al., 1990) is used to search for local alignments, CLUSTALW (Thompson et al., 1994) is used to search for multiple alignments, and the neighbor, dnadist and protdist programs from the PHYLIP package (Felsenstein, 2005) are used to carry out neighbor-joining tree and distance calculations using gamma-distributed among-site rate heterogeneity with a fixed shape parameter value of 4 based on a recent review of real data (Bofkin and Goldman, 2007). The algorithm is designed to be package-neutral, and the software could be readily modified to use alternative packages with potential for improved performance. Future plans include incorporating S-Search (Smith and Waterman, 1981) for local and MUSCLE (Edgar, 2004) or MAFFT (Katoh et al., 2005) for multiple alignments. For phylogenetic analyses, we intend to assess the use of PHYML (Guindon and Gascuel, 2003) with parameter optimization, and also to carry out substitution model selection by carrying out likelihood ratio tests.
TreeMos is coded in Perl and has been tested on Mac OS X 10.4.9, Windows XP with ActivePerl 5.8.6 installed, and SuSE Linux 9.3. Output files are platform-independent, so results generated on Windows will successfully load on Mac OS X, for example. For graphical navigation, TreeMos uses a personal webserver to execute Perl CGI scripts, which have been tested using Mac OS X 10.4.9 personal web sharing, and using the XAMPP installation of Apache on Windows XP and SuSE Linux 9.3. Executables of the NCBI BLAST (Altschul et al., 1990), PHYLIP (Felsenstein, 2005), and CLUSTAL W (Thompson et al., 1994) packages are distributed with the TreeMos installer. Graphical reporting is accomplished through the GD graphics library (Joye and Boutell, 2007), and a binary version for Mac OS X Intel platforms is included in the installer. On Mac OS X PowerPC, Windows, and Linux platforms, the installer attempts to use the fink, rpm, and cpan modules respectively to install the GD library.
The development of TreeMos was part-funded by the Biotechnology and Biological Sciences Research Council (BBSRC), UK. We thank the anonymous reviewers for helpful suggestions.
Conflict of Interest: none declared.