-
PDF
- Split View
-
Views
-
Cite
Cite
Robert Lanfear, Paul B. Frandsen, April M. Wright, Tereza Senfeld, Brett Calcott, PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses, Molecular Biology and Evolution, Volume 34, Issue 3, March 2017, Pages 772–773, https://doi.org/10.1093/molbev/msw260
- Share Icon Share
PartitionFinder 2 is a program for automatically selecting best-fit partitioning schemes and models of evolution for phylogenetic analyses. PartitionFinder 2 is substantially faster and more efficient than version 1, and incorporates many new methods and features. These include the ability to analyze morphological datasets, new methods to analyze genome-scale datasets, new output formats to facilitate interoperability with downstream software, and many new models of molecular evolution. PartitionFinder 2 is freely available under an open source license and works on Windows, OSX, and Linux operating systems. It can be downloaded from www.robertlanfear.com/partitionfinder. The source code is available at https://github.com/brettc/partitionfinder.
Main Text
In phylogenetic analyses it is important to account for variation in rates and patterns of evolution among sites (Yang 1996; Kumar et al. 2012). Partitioning attempts to achieve this by estimating independent models of molecular evolution for subsets of sites that are deemed to have evolved in similar ways. It can be challenging to choose a good partitioning scheme, because the number of possible schemes can be extremely large.
The original version of PartitionFinder (Lanfear et al. 2012) proposed algorithms to automate the selection of a partitioning scheme given a set of user-defined data blocks as input. By combining these algorithms with the selection of models of molecular evolution, PartitionFinder improved and simplified phylogenetic analyses for many users. However, PartitionFinder was written before the advent of phylogenomic datasets such as those produced by sequencing whole genomes (e.g., Jarvis et al. 2014) and transcriptomes (e.g., Misof et al. 2014), and remains too slow to be practical for use with these datasets. Because of this, we designed new features and re-wrote all of the methods and routines in PartitionFinder, which we present as PartitionFinder 2.
PartitionFinder 2 includes a number of new features. First, we wrote faster versions of the k-means, relaxed-clustering, and greedy algorithms (Lanfear et al. 2014; Frandsen et al. 2015), although we urge caution with relying on purely data-driven approaches to partitioning such as k-means, because we still lack evidence that they perform appropriately under a wide range of simulation conditions (Frandsen et al. 2015). Second, we included a range of new models of evolution, including important recent advances such as the LG4X and LG4M mixture models (Le et al. 2012). Third, we implemented Maximum-Likelihood (ML) starting trees for all analyses, motivated by our observation that model selection methods can be biased by the choice of starting tree (Frandsen et al. 2015). Fourth, we implemented the ability to analyze morphological datasets. Finally, we included a variety of new output formats to improve interoperability with downstream software.
In addition to new features, we also implemented a number of improvements that enable the efficient analysis of genome-scale datasets. These include: a new alignment parser; more efficient use of multiple processors; a dramatic reduction in the number of files that are written and read; and many improvements in internal and external data storage and processing. These improvements streamline analyses and help to make the best use of the available computational resources.
The net result of the new features and improvements is that PartitionFinder 2 can be dramatically faster than its predecessor, particularly for very large datasets analyzed on computers with many processors. To illustrate this, we compared the performance of version 2.0.0 to version 1.1.1 on two datasets: an insect dataset comprising 2,868 protein domains (each specified as a separate data block) and 595,033 sites from 144 taxa (Misof et al. 2014); and a vertebrate dataset of 56 genes (split into 168 codon-position data blocks) and 25,919 sites from 110 taxa (Fong et al. 2012). We used Maximum Parsimony starting trees in all analyses to enable direct comparisons of execution times. We analyzed the insect dataset on a server with fifty-six 2.6 GHz processors, using the new fast relaxed clustering (rclusterf) algorithm in version 2.0.0, and the original relaxed clustering algorithm in version 1.1.1, both with default settings. Version 2.0.0 was more than 100 times faster than version 1.1.1: it completed the analysis in 35 h, while version 1.1.1 finished less than 1% of the analysis in the same time. We analyzed the vertebrate dataset on a desktop Macintosh computer with eight 4 GHz processors, using the greedy algorithm with precisely the same settings in versions 1.1.1 and 2.0.0. Version 2.0.0 was five times faster than version 1.1.1: it completed the analysis in 108 min compared with 534 min for version 1.1.1.
PartitionFinder 2 can be installed by downloading it from the website above, or installing it via GitHub. No other programs need to be compiled, but it does require the installation of Python and a small number of dependencies. These can be managed via a point-and-click installer, following the details outlined in the manual. We hope that PartitionFinder 2 will be useful to the phylogenetics community.
Acknowledgments
RML was supported by the Australian Research Council. AMW was supported by NSF DEB-1256993. This work was supported by the Macquarie University Genes to Geoscience center.
Associate editor: Michael S. Rosenberg