-
PDF
- Split View
-
Views
-
Cite
Cite
Yongbing Zhao, Xinmiao Jia, Junhui Yang, Yunchao Ling, Zhang Zhang, Jun Yu, Jiayan Wu, Jingfa Xiao, PanGP: A tool for quickly analyzing bacterial pan-genome profile, Bioinformatics, Volume 30, Issue 9, 1 May 2014, Pages 1297–1299, https://doi.org/10.1093/bioinformatics/btu017
Close -
Share
Abstract
Summary: Pan-genome analyses have shed light on the dynamics and evolution of bacterial genome from the point of population. The explosive growth of bacterial genome sequence also brought an extremely big challenge to pan-genome profile analysis. We developed a tool, named PanGP, to complete pan-genome profile analysis for large-scale strains efficiently. PanGP has integrated two sampling algorithms, totally random (TR) and distance guide (DG). The DG algorithm drew sample strain combinations on the basis of genome diversity of bacterial population. The performance of these two algorithms have been evaluated on four bacteria populations with strain numbers varying from 30 to 200, and the DG algorithm exhibited overwhelming advantage on accuracy and stability than the TR algorithm.
Availability: PanGP was developed with a user-friendly graphic interface and it was available at http://PanGP.big.ac.cn.
Contact:xiaojingfa@big.ac.cn or wujy@big.ac.cn
Supplementary information:Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Since the conception ‘pan-genome’ was introduced in 2005 (Medini et al., 2005), pan-genome had shed light on the genomic diversity and evolution of bacterial genome. To make pan-genome analyses more expediently, several software tools were designed. Panseq could identify single nucleotide polymorphism on core genome and strain’s specific region (Laing et al., 2010), and PGAP was designed to perform five analytic functions with only one command (Zhao et al., 2012). At the same time, a series of mathematic models had been proposed to characterize pan-genome profile of bacteria, such as the Streptococcus agalactiae pan-genome model (Tettelin et al., 2005), the Haemophilus influenzae pan-genome model (Hogg et al., 2007), heaps law model (Tettelin et al., 2008) and infinitely many genes model (Baumdicker et al., 2010). All current models could well depict the pan-genome profile of a launch of bacteria with their time complexity O(
), where n is the strain number. In this article, we developed two sampling algorithms to efficiently perform pan-genome profile analysis for larger scale genomes. These two algorithms were integrated in the software with a graphic interface, PanGP.
2 METHODS
2.1 Test dataset
Gene clusters from seven bacterial populations were used to test the performance of sampling algorithms; the detailed strain list and gene clustering method were introduced in Supplementary Material.
2.2 Sampling algorithm
To analyse the pan-genome profile of N strains, the pan-genome size and core genome size of n strains (1 ≤ n ≤ N) would be calculated. If a population has N strains, there are
combinations. When
grows to a large number, two sampling algorithms, totally random (TR) and distance guide (DG), would be available to sample s (s: sample size) combinations consisted of n strains.
For the TR algorithm: when
(r: sample repeat times), randomly sample s non-redundant combinations from the total
combinations. The average pan-genome size and core genome size of these s non-redundant combinations were record as
and
.
For the DG algorithm: when
, sample
(k: amplification coefficient, which could be modified to control sample size for evaluating the genome diversity of the total combinations) non-redundant combinations from the total
combinations. Calculate the Dev_geneCluster value of these
combinations, sort all the
combinations by their Dev_geneCluster value and sample s combinations from end to end with the same interval. The average pan-genome size and core genome size of these s combinations were record as
and
.
The above sample process would repeat r times for both TR and DG algorithms, and their average value (
and
) would be taken as the pan-genome size and core genome size of n strains, respectively.
2.3 Genome diversity characterizing models in the DG algorithm
In the DG algorithm, genome diversity was taken as a criterion to sample a certain number of strain combinations. To figure out a suitable model to characterize genome diversity, three models were proposed. In the following models, G(n) represents a combination consisted of n strains. i and j represent the ith and jth strain in G(n). In the population with N strains, there are
G(n) combinations.
) and node depth (
) between the ith and jth strain on phylogenetic tree were calculated respectively (Calculating method for
and
were introduced in Supplementary Fig. S1 in Supplementary Material). The average branch length (
) or node depth (
) of every strain pair in a given combination G(n) represented genome diversity of the combination G(n) and were calculated using the following formulas, respectively. In model A, six methods were provided, which were introduced in Supplementary Material.
Note:
means the Sum_geneNum value;
means the Dev_geneNum value;
is the gene number of the ith strain.
) is calculated by the following formula. Note:
is the number of different gene clusters between the ith and jth strains.
2.4 Evaluate the performance of sampling algorithm
and
) and real data (
and
) at each point.
and
), and the jth simulation result (
and
) at each point (1 ≤ i < j ≤ 100). 2.5 Nonlinear fitting for the pan-genome profile
When the pan-genome size and core genome size of n strains (1 ≤ n ≤ N) were available, a series of mathematics models (Supplementary Material) were used to depict pan-genome size, core genome size and new gene size.
3 RESULTS
The performance of sampling algorithms. It was found in Supplementary Fig. S1 that genome diversity, characterized by Dev_geneCluster method in Model C, exhibited strong correlation to pan-genome size, and difference in gene cluster number may primarily affect pan-genome size. Hence, we take Dev_geneCluster as the method to characterize genome diversity in the DG algorithm. The performance of two sampling algorithms was estimated from three aspects—accuracy, stability and time cost. Seen from Supplementary Figs S2–S4, the result from the DG algorithm was closer to the real data and more stable than the TR result with the same sample size. For the same species, population size almost did not affect the accuracy and stability of the two sampling algorithms. When the sample size was set to 500, the error proportion in the total gene clusters, either between the DG result and the real data or between any two sample results from the DG algorithm, were <0.1% in all test populations. Regarding time cost, the DG algorithm took slightly more time than the TR algorithm with the same sample size (Supplementary Table S1). When sample size was set to 500, the analysis for ≤50 genomes could be finished within 3 min.
Application for PanGP. Among pan-genome analysis tools, PGAP, a comprehensive pan-genome analysis pipeline with five modules, could only deal with small-scale genomes in the pan-genome profile analysis module, while PanGP was a highly efficient tool for large-scale bacterial pan-genome profile analysis with sampling algorithms. Besides TR and DG algorithms, another one named traverse all algorithm (TA algorithm) was also provided in PanGP. When TA algorithm was selected, the pan-genome profile would be analysed without sampling. PanGP required orthologs information as the input data, which could be generated by series of software, such as PGAP (Zhao et al., 2012), OrthoMCL (Li et al., 2003) and PanOCT (Fouts et al., 2012), with the format introduced in the online manual. When sampling finished, the pan-genome profile would be present as curves images (shown as Supplementary Fig. S5). The curves images could be revised and exported via the user-friendly graphic interface, and non-linear fitting with three mathematic models was also available for pan-genome size, core genome size and new gene size. More usages about PanGP were provided in the online manual.
Funding: The National Basic Research and Development Program (973 Program; 2010CB126604 to J.X.); National Programs for High Technology Research and Development (863 Program; 2012AA020409), the Ministry of Science and Technology of People’s Republic of China; Key Program of Chinese Academy of Sciences grant (KSZD-EW-TZ-009-02 to J.X).
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: John Hancock







