Abstract

Summary: TreeMos is a novel high-throughput graphical analysis application that allows the user to search for phylogenetic mosaicism among one or more DNA or protein sequence multiple alignments and additional unaligned sequences. TreeMos uses a sliding window and local alignment algorithm to identify the nearest neighbour of each sequence segment, and visualizes instances of sequence segments whose nearest neighbour is anomalous to that identified using the global alignment. Data sets can include whole genome sequences allowing phylogenomic analyses in which mosaicism may be attributed to recombination between any two points in the genome. TreeMos can be run from the command line, or within a web browser allowing the relationships between taxa to be explored by drill-through.

Availability:http://www2.warwick.ac.uk/fac/sci/whri/research/archaeobotany

Contact:jonathan.moore@warwick.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Recombination events between DNA and RNA molecules occur during meiosis in eukaryotes by gene conversion, illegitimate pairing between paralogons, and by lateral gene transfer. Methods of phylogenetic tree reconstruction, used to infer evolutionary relationships between sequences, are predicated on a model of sequence evolution without recombination. Recombination between sequences violates the assumptions of this model, potentially resulting in a group of sequences with several underlying phylogenies (Posada et al., 2002).

Recombination detection methods have included sequence similarity, distance, phylogeny, compatibility, and substitution distribution methods (Posada et al., 2002). A range of tools are available to identify recombination (e.g. Etherington et al., 2005; Milne et al., 2004). The limitation of most tools is that sequences to be searched must first be aligned in a single multiple alignment. This is appropriate when searching for evidence of recombination in a conserved gene family within or between genomes. However, a single multiple alignment cannot be achieved in several circumstances. First, in a large gene superfamily, where sequence similarity is low, robust alignment may only be possible for each subfamily separately. Second, in genome-scale comparisons chromosomal rearrangements can mean that alignments of whole chromosomes or large chromosomal segments cannot be made. Third, when considering a conserved gene family a sequence may, through recombination, contain fragments of a non-homologous sequence, from elsewhere in the same or another genome, that will not align with the remaining sequences in the alignment. TreeMos addresses these three types of case.

The TreeMos approach is a phylogenomic one because it allows the genetic information of entire genomes to be incorporated within a single analysis. TreeMos was developed to search for phylogenetic mosaicism within sequences which could not be analysed within a single multiple alignment, using the proteins of the rhodopsin G-Protein Coupled Receptor (GPCR) gene superfamily, an example of the first type above, as a test case (Allaby and Woodwark, 2007). TreeMos considers sequences within a data set and looks for high local similarities between non-aligned regions within an alignment, regions in separate alignments, and sequences which may be distant homologues, or contain homologous segments (through recombination) in otherwise non-homologous sequences. The high local similarities are subjected to phylogenetic analyses, in order to identify instances of a change in the nearest neighbour—phylogenetic mosaicism—which are displayed as a reticulate relationship (Fig. 1).

Fig. 1.

Sample graphical output from TreeMos for the Human AA2A gene protein sequence, with respect to other proteins in the Human GPCR gene superfamily. AA2A is represented on the left with scale in residues. Other members of the superfamily which have phylogenetically anomalous relationships with AA2A are represented on the right at reduced scales.

Fig. 1.

Sample graphical output from TreeMos for the Human AA2A gene protein sequence, with respect to other proteins in the Human GPCR gene superfamily. AA2A is represented on the left with scale in residues. Other members of the superfamily which have phylogenetically anomalous relationships with AA2A are represented on the right at reduced scales.

Figure 1 illustrates how visualization can lead to the discernment of higher order patterns of phylogenetic mosaicism, such as correlated mosaicism in which many members of a sequence family resemble a distantly related family more than each other within a localized sequence region (Allaby and Woodwark, 2007).

2 FEATURES

TreeMos can be run on Mac OS X, Windows or Linux, through a user-friendly web browser interface, from the command line, or as part of a high-throughput pipeline. The program searches multiple alignments and individual DNA or protein sequences, in FASTA format, for phylogenetic anomalies. Default parameters, which can be adjusted by the user, identify the window size over which anomalies are detected, the increment to slide the window, and the maximum genetic distance between a pair of sequences for them to be considered related (Allaby and Woodwark, 2004).

For each sequence, its Global Nearest Neighbour (GNN) is identified by comparison with all sequences in the data set, as are the Local Nearest Neighbours of each window within the sequence (LNNs). Throughout, an automated data screening procedure is used to filter out sequences, which are too dissimilar to be reliably aligned (see Supplementary Fig. 1) (Allaby and Woodwark, 2007). Tree-building is used to identify nearest neighbours, in cases where enough data are available, otherwise distance methods are used. Where the LNN of a particular window differs from the GNN, the window is identified as having a phylogenetically anomalous affiliation. Typically, this entails hundreds or thousands of phylogenetic analyses. The resulting set of anomalies are reported in tab-separated text format, and visualized as an image for each sequence and for each alignment. Log files are generated, recording processing steps and any errors. Sets of results can be archived for browsing at a future date.

The web browser interface allows all functions to be accessed through local web pages, and the resulting set of anomalous affiliations to be visualized interactively with drill-through between affiliated sequences (no connection to the internet is required).

External packages are used to carry out the analyses underlying the algorithm. In release 1.0 BLAST (Altschul et al., 1990) is used to search for local alignments, CLUSTALW (Thompson et al., 1994) is used to search for multiple alignments, and the neighbor, dnadist and protdist programs from the PHYLIP package (Felsenstein, 2005) are used to carry out neighbor-joining tree and distance calculations using gamma-distributed among-site rate heterogeneity with a fixed shape parameter value of 4 based on a recent review of real data (Bofkin and Goldman, 2007). The algorithm is designed to be package-neutral, and the software could be readily modified to use alternative packages with potential for improved performance. Future plans include incorporating S-Search (Smith and Waterman, 1981) for local and MUSCLE (Edgar, 2004) or MAFFT (Katoh et al., 2005) for multiple alignments. For phylogenetic analyses, we intend to assess the use of PHYML (Guindon and Gascuel, 2003) with parameter optimization, and also to carry out substitution model selection by carrying out likelihood ratio tests.

3 IMPLEMENTATION

TreeMos is coded in Perl and has been tested on Mac OS X 10.4.9, Windows XP with ActivePerl 5.8.6 installed, and SuSE Linux 9.3. Output files are platform-independent, so results generated on Windows will successfully load on Mac OS X, for example. For graphical navigation, TreeMos uses a personal webserver to execute Perl CGI scripts, which have been tested using Mac OS X 10.4.9 personal web sharing, and using the XAMPP installation of Apache on Windows XP and SuSE Linux 9.3. Executables of the NCBI BLAST (Altschul et al., 1990), PHYLIP (Felsenstein, 2005), and CLUSTAL W (Thompson et al., 1994) packages are distributed with the TreeMos installer. Graphical reporting is accomplished through the GD graphics library (Joye and Boutell, 2007), and a binary version for Mac OS X Intel platforms is included in the installer. On Mac OS X PowerPC, Windows, and Linux platforms, the installer attempts to use the fink, rpm, and cpan modules respectively to install the GD library.

ACKNOWLEDGEMENTS

The development of TreeMos was part-funded by the Biotechnology and Biological Sciences Research Council (BBSRC), UK. We thank the anonymous reviewers for helpful suggestions.

Conflict of Interest: none declared.

REFERENCES

Allaby
RG
Woodwark
M
Phylogenetic analysis reveals extensive phylogenetic mosaicism in the Human GPCR superfamily
Evol. Bioinformatics
 , 
2007
, vol. 
3
 (pg. 
155
-
168
)
Allaby
RG
Woodwark
M
Phylogenetics in the bioinformatics culture of understanding
Compar. Funct. Genomics
 , 
2004
, vol. 
5
 (pg. 
128
-
146
)
Altschul
SF
, et al.  . 
Basic local alignment search tool
J. Mol. Biol
 , 
1990
, vol. 
215
 (pg. 
403
-
410
)
Bofkin
L
Goldman
N
Variation in evolutionary processes at different codon positions
Mol. Biol. Evol
 , 
2007
, vol. 
24
 pg. 
513
 
Edgar
RC
MUSCLE: multiple sequence alignment with high accuracy and high throughput,
Nucl. Acids Res
 , 
2004
, vol. 
32
 (pg. 
1792
-
97
)
Etherington
JG
, et al.  . 
Recombination Analysis Tool (RAT): a program for the high-throughput detection of recombination
Bioinformatics
 , 
2005
, vol. 
21
 (pg. 
278
-
281
)
Felsenstein
J
PHYLIP (Phylogeny Inference Package) version 3.6
2005
University of Washington, Seattle
Department of Genome Sciences
 
Distributed by the author.
Guindon
S
Gascuel
O
A simple, fast, and accurate algorithm to estimate large phylogenies by maximul likelihood
Systematic Biology
 , 
2003
, vol. 
52:5
 (pg. 
696
-
704
)
Joye
PA
Boutell
T
gdLibrary 2.0.34 software application
2007
 
Katoh
K
, et al.  . 
MAFFT version 5: improvement in accuracy of multiple sequence alignment
Nucl. Acids Res
 , 
2005
, vol. 
33
 pg. 
511
 
Milne
I
, et al.  . 
TOPALi: software for automatic identification of recombinant sequences within DNA multiple alignments
Bioinformatics
 , 
2004
, vol. 
20
 (pg. 
1806
-
1807
)
Posada
D
, et al.  . 
Recombination in evolutionary genomics
Annu. Rev. Genet
 , 
2002
, vol. 
36
 (pg. 
75
-
97
)
Smith
TF
Waterman
MS
Identification of common molecular subsequences
J. Mol. Biol
 , 
1981
, vol. 
147
 (pg. 
195
-
197
)
Thompson
JD
, et al.  . 
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucl. Acids Res
 , 
1994
, vol. 
22
 (pg. 
4673
-
4680
)

Author notes

Associate Editor: Martin Bishop

Comments

0 Comments