Abstract

Summary: Count is a software package for the analysis of numerical profiles on a phylogeny. It is primarily designed to deal with profiles derived from the phyletic distribution of homologous gene families, but is suited to study any other integer-valued evolutionary characters. Count performs ancestral reconstruction, and infers family- and lineage-specific characteristics along the evolutionary tree. It implements popular methods employed in gene content analysis such as Dollo and Wagner parsimony, propensity for gene loss, as well as probabilistic methods involving a phylogenetic birth-and-death model.

Availability: Count is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the web site http://www.iro.umontreal.ca/∼csuros/gene_content/count.html. It can also be launched using Java Webstart from the same site. The software is distributed under a BSD-style license. Source code is available upon request from the author.

Contact:csuros@iro.umontreal.ca

1 INTRODUCTION

Some aspects of genome evolution are best captured by integer quantities. Given a phylogeny with terminal taxa 𝒳, such a quantity forms a numerical profile, which extends the so-called phylogenetic profile of presence–absence (Koonin and Galperin, 2002; Pellegrini et al., 1999) Φ : 𝒳 ↦ {0, 1, 2,…,}. In a typical application, Φ[x] denotes the number of genes in genome x ∈ 𝒳 for a certain homolog gene family: a homolog family comprises all descendants of the same ancestral gene (Fitch, 2000) in evolutionary lineages. Such families are routinely identified by pairwise sequence comparisons, coupled with the clustering of postulated homolog pairs (Alexeyenko et al., 2006; Tatusov et al., 1997). In other interesting examples, Φ[x] might be the size (Caetano-Anollés, 2005) of genome x or a sequence length polymorphism in population x (Witmer et al., 2003).

Given a phylogeny, an evolutionary character's history can be inferred by various means in order to reconstruct its state at ancestral nodes or to estimate the tempo of evolution (Pagel, 1999). The Count software package provides a convenient graphical user interface to sophisticated computational methods in such analyses, and to the manipulation of datasets involving numerical profiles. Count was already used to study the evolution of gene repertoire in Archaea (Csűrös and Miklós, 2009) and nucleo-cytoplasmic DNA viruses (Yutin et al., 2009).

2 FEATURES

Count is designed primarily to work with a dataset of numerical profiles for homolog gene families. It allows for combining multiple profiles with various annotations, as found in databases of clustered homolog families such as COG (Tatusov et al., 1997). Profiles can be filtered by criteria based on presence, membership count and annotations, in order to compile winnowed datasets for further analysis.

Given an evolutionary tree T, Count computes the states ξ[u] at tree nodes uT, based on each profile Φ by imposing Φ[u] = ξ[u] for all terminal taxa u. In parsimony approaches, the ancestral reconstruction minimizes a criterion based on the implied state changes ξ[u] → ξ[v] over the edges uv. Alternatively, Count works with so-called phylogenetic birth-and-death models that consider (ξ[u]: uT) as a random variable with a well-defined distribution.

2.1 Parsimony

Count implements Dollo parsimony (Farris, 1977) and Wagner parsimony (Farris, 1970). In case of the latter, it also implements an asymmetric version (Csűrös, 2008) that penalizes losses and gains differently. Count also computes Propensity for Gene Loss (Krylov et al., 2003), which quantifies the frequency of loss for each family using Dollo parsimony.

2.2 Phylogenetic birth-and-death models

The probabilistic model employed in Count relies on linear birth-death-immigration processes (Kendall, 1949), commonly used to model population growth and queuing systems. In the general phylogenetic birth-and-death model, three rates are assigned to each branch: gene loss rate μ, gene duplication rate λ and a gain rate κ. ‘Gain’ covers multiple phenomena without specifying the origin of the gain, including de novo gene formation and lateral gene transfer. Specifically, character evolution on each edge uv with length τ is stochastically determined by a continuous time Markov process X with X(0) = ξ[u] and X(τ) = ξ[v]. The process is characterized by the gain rate κ, loss rate μ and duplication rate λ: for 0 < n, 0 ≤ t ≤ τ and any 0 < δ,  

formula
Less general models may forbid gain (κ = 0), or duplication (λ = 0), or even both. Paralogs evolve independently in this model, capturing the birth-and-death evolution of multigene families (Nei and Rooney, 2005), as opposed to concerted evolution, or events involving multiple members at a time. The standard pruning algorithm (Felsenstein, 1973) for computing likelihoods cannot be used with numerical characters, because the ancestral state space is not bounded. Adequate algorithms were proposed for κ=0, λ>0 (Arvestad et al., 2004, 2009) and for κ, λ > 0 (Csűrös and Miklós, 2006). Count computes the likelihood using our algorithm described before (Csűrös and Miklós, 2009), which applies to the general model and all the restricted models. Count allows for rate variation across branches and gene families. Model parameters are set by maximizing the likelihood. The optimized model can be used for ancestral reconstruction and to infer lineage-specific trends by using posterior probabilities conditioned on the profiles.

2.3 User interaction

Figure 1 illustrates the rich graphical user interface of Count. The program can work with multiple datasets and models at the same time, in order to help comparisons between different analyses. Entire work sessions can be saved, and individual analysis results can be exported into tab-delimited text files, in order to use with other programs such as spreadsheet tools. Main software components (rate optimization and ancestral reconstruction) can also be launched from the command line without invoking the graphical interface.

Fig. 1.

Some graphical displays in Count. On the left, ancestral reconstruction using posterior probabilities. On the right, display of a phylogenetic birth-and-death model. Annotated features:

(1)
Dataset of numerical (phylogenetic) profiles.
(2)
Gene family annotations loaded from separate file.
(3)
Small profile logo for each family (black bars show Φ[u] at terminal nodes).
(4)
Aggregate family-specific information on number of branches where the family was lost, gained, expanded and contracted (estimated as expectations, hence the fractional values).
(5)
Multiple family selection by cell content or the mouse. Selection is reflected on the content of the top-right table and the bottom tree. The top-right table shows lineage-specific aggregate information on number of families lost, gained, expanded and contracted on each branch. The bottom tree shows the inferred probabilities for family presence and absence at ancestral nodes (filled rectangles).
(6)
Lineage selection by the mouse in the table row or the node of the bottom tree. The selection brings up more detailed information at the corresponding node in the bottom tree.
(7)
Lineage-specific rates displayed in the top-left table, and depicted on the bottom tree, along with a legend.
(8)
Family-specific rate variation depicted on the top-right.
(9)
Lineage selection by the mouse in the table row or the node of the bottom tree. The selection brings up more detailed information at the corresponding node in the bottom tree.

Fig. 1.

Some graphical displays in Count. On the left, ancestral reconstruction using posterior probabilities. On the right, display of a phylogenetic birth-and-death model. Annotated features:

(1)
Dataset of numerical (phylogenetic) profiles.
(2)
Gene family annotations loaded from separate file.
(3)
Small profile logo for each family (black bars show Φ[u] at terminal nodes).
(4)
Aggregate family-specific information on number of branches where the family was lost, gained, expanded and contracted (estimated as expectations, hence the fractional values).
(5)
Multiple family selection by cell content or the mouse. Selection is reflected on the content of the top-right table and the bottom tree. The top-right table shows lineage-specific aggregate information on number of families lost, gained, expanded and contracted on each branch. The bottom tree shows the inferred probabilities for family presence and absence at ancestral nodes (filled rectangles).
(6)
Lineage selection by the mouse in the table row or the node of the bottom tree. The selection brings up more detailed information at the corresponding node in the bottom tree.
(7)
Lineage-specific rates displayed in the top-left table, and depicted on the bottom tree, along with a legend.
(8)
Family-specific rate variation depicted on the top-right.
(9)
Lineage selection by the mouse in the table row or the node of the bottom tree. The selection brings up more detailed information at the corresponding node in the bottom tree.

2.4 Implementation

Count is written entirely in Java (Java SE 6), and was tested on various computer platform, including Microsoft Windows, MacOS X and Linux. In addition, Count is also available as an integrated application bundle on MacOS X and a Java Webstart application. The software is distributed with test data and a detailed User's Guide.

ACKNOWLEDGEMENTS

I am grateful for valuable feedback on the software from Aaron Darling, Dannie Durand, Maureen Stoltzer, Gergely Szöllősi, Natalya Yutin and Yuri Wolf.

Funding: Natural Sciences and Engineering Research Council of Canada grant.

Conflict of Interest: none declared.

REFERENCES

Alexeyenko
A
, et al.  . 
Automatic clustering of orthologs and inparalogs shared by multiple genomes
Bioinformatics
 , 
2006
, vol. 
22
 (pg. 
e9
-
e15
)
Arvestad
L
, et al.  . 
Gusfield
D
Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution
RECOMB '04: Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology.
 , 
2004
New York, NY
ACM
(pg. 
326
-
335
)
Arvestad
L
, et al.  . 
The gene evolution model and computing its associated probabilities
J. ACM
 , 
2009
, vol. 
56
 pg. 
7
 
Caetano-Anollés
G
Evolution of genome size in the grasses
Crop Science
 , 
2005
, vol. 
45
 (pg. 
1809
-
1816
)
Csűrös
M
Ancestral reconstruction by asymmetric Wagner parsimony over continuous characters and squared parsimony over distributions
Proceedings of the Sixth RECOMB Comparative Genomics Satellite Workshop
 , 
2008
Heidelberg
(pg. 
72
-
86
Vol. 5267 of Springer Lecture Notes in Bioinformatics
Csűrös
M
Miklós
I
A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer
Proceedings of the Tenth Annual International Conference on Research in Computational Molecular Biology (RECOMB)
 , 
2006
Heidelberg
(pg. 
206
-
220
Vol. 3909 of Springer Lecture Notes in Bioinformatics
Csűrös
M
Miklós
I
Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model
Mol. Biol. Evol.
 , 
2009
, vol. 
26
 (pg. 
2087
-
2095
)
Farris
JS
Methods for computing Wagner trees
Syst. Zool.
 , 
1970
, vol. 
19
 (pg. 
83
-
92
)
Farris
JS
Phylogenetic analysis under Dollo's law
Syst. Zool.
 , 
1977
, vol. 
26
 (pg. 
77
-
88
)
Felsenstein
J
Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters
Syst. Zool.
 , 
1973
, vol. 
22
 (pg. 
240
-
249
)
Fitch
WM
Homology a personal view on some of the problems
Trends Genet.
 , 
2000
, vol. 
16
 (pg. 
227
-
231
)
Kendall
DG
Stochastic processes and population growth
J. R. Stat. Soc. B
 , 
1949
, vol. 
11
 (pg. 
230
-
282
)
Koonin
EV
Galperin
MY
Sequence-Evolution-Function: Computational Approaches in Comparative Genomics.
 , 
2002
New York
Kluwer Academic Publishers
Krylov
DM
, et al.  . 
Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution
Genome Res.
 , 
2003
, vol. 
13
 (pg. 
2229
-
2235
)
Nei
M
Rooney
AP
Concerted and birth-and-death evolution of multigene families
Ann. Rev. Genet.
 , 
2005
, vol. 
39
 (pg. 
121
-
152
)
Pagel
M
Inferring the historical patterns of biological evolution
Nature
 , 
1999
, vol. 
401
 (pg. 
877
-
884
)
Pellegrini
M
, et al.  . 
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Proc. Natl Acad. Sci. USA
 , 
1999
, vol. 
96
 (pg. 
4285
-
4288
)
Tatusov
RL
, et al.  . 
A genomic perspective on protein families
Science
 , 
1997
, vol. 
278
 (pg. 
631
-
637
)
Witmer
PD
, et al.  . 
The development of a highly informative mouse simple sequence length polymorphism (SSLP) marker set and construction of a mouse family tree using parsimony analysis
Genome Res.
 , 
2003
, vol. 
13
 (pg. 
485
-
491
)
Yutin
N
, et al.  . 
Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution
Virol. J.
 , 
2009
, vol. 
6
 pg. 
223
 

Author notes

Associate Editor: David Posada

Comments

0 Comments