ConSpeciFix: classifying prokaryotic species based on gene flow

Bobay, Louis-Marie; Ellis, Brian Shin-Hua; Ochman, Howard

doi:10.1093/bioinformatics/bty400

Abstract

Summary

Classification of prokaryotic species is usually based on sequence similarity thresholds, which are easy to apply but lack a biologically-relevant foundation. Here, we present ConSpeciFix, a program that classifies prokaryotes into species using criteria set forth by the Biological Species Concept, thereby unifying species definition in all domains of life.

Availability and implementation

ConSpeciFix’s webserver is freely available at www.conspecifix.com. The local version of the program can be freely downloaded from https://github.com/Bobay-Ochman/ConSpeciFix. ConSpeciFix is written in Python 2.7 and requires the following dependencies: Usearch, MCL, MAFFT and RAxML.

1 Introduction

The most common method for defining species is the Biological Species Concept (BSC) (Mayr, 1942), which assigns individuals as members of the same species by their ability to reproduce with one another. But because the BSC relies on sexual reproduction (i.e. gene flow, homologous recombination), it is not thought to be applicable to prokaryotes, i.e. bacteria and archaea, which reproduce asexually. However, gene exchange is now known to be prevalent among prokaryotes (Shapiro and Polz, 2015; Vos and Didelot, 2009) opening the possibility that prokaryotes can be classified into species according to the BSC (Bobay and Ochman, 2017). To this end, we developed a unique tool, ConSpeciFix, that uses genomic information to classify organisms into species based on gene flow. In this context, ‘gene flow’ refers to homologous recombination events among genes in the core genome, i.e. single-copy orthologs that are shared by at least 85% of the strains considered. This contrasts recent approaches that assess species boundaries based on non-homologous gene transfers (Moldovan and Gelfand, 2018). The ConSpeciFix pipeline is available on GitHub and also implemented on a webserver, allowing users to directly upload and classify microbial genomes against a pre-processed database.

2 Materials and methods

Our procedure consists of testing for gene flow between candidate genomes and the genomes in our database, which have already been assigned to species. Gene flow is estimated by the inference of homoplasic alleles (h) relative to non-homoplasic alleles (m). Homoplasies are alleles that are not inherited vertically from a single ancestor and likely result from gene flow. We infer homoplasies with a distance-based approach: For each set of genomes, we first compute the pairwise distances D on the concatenate of the core genome using a maximum likelihood method (Stamatakis, 2006). In the simple scenario in which only two variants are present at a site, we classify the more frequent variant as the major allele (N₀) and the least frequent as the minor allele (N₁). We then compare the core genome distances between strains harboring alleles N₁ and N₀. The minor allele is considered homoplasic if max(D_N1N1) > min(D_N0N1), where max(D_N1N1) represents the maximal core genome distance between strains harboring the minor allele and min(D_N0N1) represents the minimal core genome distance between pairs of strains harboring the minor allele and the major allele. When more than three or four alleles are present at a site, multiple minor alleles are defined, and each is compared to the major allele (without including the other minor alleles in the comparison). Because homoplasies can also be introduced by independent convergent mutations, we include a feature to display the proportion normally attributable to this process based on simulations computed across 93 prokaryotic species.

3 Implementation

ConSpeciFix is hosted online at www.conspecifix.com. This server accommodates individual user-uploaded genomes, which can be evaluated for gene flow against a pre-processed database of curated species. When analyzing multiple genomes (e.g. in endeavors to define new species), a variant of ConSpeciFix and installation instructions can be downloaded from GitHub.

Database building. To build the pre-processed database, we re-classified all complete genomes of prokaryotes available on the NCBI RefSeq repository as of October 2017. We downloaded the genomes for each named species represented by 15 or more genomes. (Analyses based on fewer than 15 genomes generally do not yield robust estimates of gene flow.) A core genome was defined for each species, and gene flow was estimated with a distance-based approach, as in Bobay and Ochman (2017). Within each named species, genomes with substantially lower h/m ratios were removed from the dataset, and all remaining genomes were considered members of the same BSC-defined species. The redefined dataset of species currently contains 11 127 genomes and 236 BSC-defined species.

Testing candidate genomes. Our webserver allows users to test candidate genomes for gene flow against any of the designated BSC-defined species in the database. After selecting one of the 236 species and uploading the query genome (which can be complete or incomplete), ConSpeciFix identifies the core genes shared between the candidate genome and the selected species using Usearch Global v6.1 (Edgar, 2010). This core genome is used to re-estimate the distance matrix with RAxML v8.0 (Stamatakis, 2006), from which homoplasic and non-homoplasic alleles are inferred. Ratios of homoplasic and non-homoplasic alleles (h/m) are computed for randomly sampled subsets of genomes; and from this step, graphs and statistics comparing h/m ratios between sets of genomes that include and exclude the candidate genome are generated. Finally, ConSpeciFix estimates whether the candidate genome should be considered as part of the selected species based on an exclusion criterion by testing whether the inclusion of the candidate genome leads to a substantial drop in gene flow. The stringency of our original exclusion criterion (Bobay and Ochman, 2017) sometimes resulted in a failure to identify sexually-isolated strains in cases where the h/m graph did not quickly reach a plateau. To address this issue, the exclusion criterion now relies on the Chi² outlier method (Dixon, 1950) implemented in the R package ‘outliers’, which allows detection of sexually-isolated strains when h/m graphs do not quickly reach a plateau. Additional metrics are returned to the user, such as the number of orthologous genes shared between the candidate genome and the number of core genes shared with the tested species. Note that our database includes numerous clonal species, which are not expected to engage in gene flow even when strains have very similar genomes. Clonal species can be identified based on their consistently low h/m ratios (typically below 0.2), as expected under evolution strictly driven by mutations (red lines in graphs, Fig. 1). For such species, we recommend that users resort to the average nucleotide identity (ANI) metric that are returned in the result statistics, a metric that is commonly applied to define prokaryotic species based on genomic data (Konstantinidis and Tiedje, 2005).

Fig. 1.

Open in new tab Download slide

ConSpeciFix output graphs with and without the tested candidate genome. Graphs show h/m ratios estimated across different subsets of B.bifidum genomes (A) and across different subsets of B.bifidum genomes that include the B.longum 2-2B genome (B). For each number of subsampled genomes (x-axis), black dots represent the median h/m ratio estimated for 100 samplings, with grey areas indicating the standard deviation. Dashed and dotted lines indicate average and maximal h/m ratios, respectively, expected when homoplasies are generated solely by convergent mutations, as estimated for 93 species (Bobay and Ochman, 2017)

Exploratory phase. Users can forego the initial step of species selection and compare candidate genomes against an extended database (which includes our redefined dataset of 236 species plus a representative genome for each additional species present in RefSeq, yielding a total of 7105 species) to determine which species might be most suitable for evaluation. Using the ‘exploratory phase’ option, ConSpeciFix detects the most closely related species in the extended database based on a set of 45 protein coding genes that are shared across prokaryotes (Raymann, et al., 2015) by computing the number of orthologous genes shared with each species and their average nucleotide identity.

4 Application

To illustrate the ConSpeciFix process, we tested whether Bifidobacterium longum strain 2-2B engages in gene flow with strains of Bifidobacterium bifidum. The analysis returns two main graphs, one (Fig. 1A) showing the estimated rate of gene flow (with h/m ratios) across randomly sampled subsets of B.bifidum genomes, and a second (Fig. 1B) representing h/m ratios computed across subsets of B.bifidum genomes but including the tested B.longum genome. This example shows that inclusion of the B.longum 2-2B genome induces a sharp drop in h/m ratios, indicating that this strain does not engage in gene flow with B.bifidum.

Acknowledgements

We thank Kim Hammond for assistance with figures and two anonymous reviewers.

Funding

This work was supported by NIH grant R35GM118038 to HO.

Conflict of Interest: none declared.

References

Bobay

L.M.

,

Ochman

H.

(

2017

)

Biological species are universal across Life's domains

.

Genome Biol. Evol

.,

9

,

491

–

501

.

Dixon

W.J.

(

1950

)

Analysis of extreme values

.

Ann. Math. Stat

.,

21

,

488

–

506

.

Google Scholar

Crossref

WorldCat

Edgar

R.C.

(

2010

)

Search and clustering orders of magnitude faster than BLAST

.

Bioinformatics

,

26

,

2460

–

2461

.

Konstantinidis

K.T.

,

Tiedje

J.M.

(

2005

)

Genomic insights that advance the species definition for prokaryotes

.

Proc. Natl. Acad. Sci. USA

,

102

,

2567

–

2572

.

Google Scholar

Crossref

WorldCat

Mayr

E.

(

1942

)

Systematics and the Origin of Species

.

Columbia University Press

,

New York

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Moldovan

M.A.

,

Gelfand

M.S.

(

2018

)

Pangenomic definition of prokaryotic species and the phylogenetic structure of Prochlorococcus spp

.

Front. Microbiol

.,

9

,

428.

Raymann

K.

et al. (

2015

)

The two-domain tree of life is linked to a new root for the Archaea

.

Proc. Natl. Acad. Sci. USA

,

112

,

6670

–

6675

.

Google Scholar

Crossref

WorldCat

Shapiro

B.J.

,

Polz

M.F.

(

2015

)

Microbial speciation

.

Cold Spring Harb. Perspect. Biol

.,

7

,

a018143.

Stamatakis

A.

(

2006

)

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

.

Bioinformatics

,

22

,

2688

–

2690

.

Vos

M.

,

Didelot

X.

(

2009

)

A comparison of homologous recombination rates in bacteria and archaea

.

ISME J

.,

3

,

199

–

208

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
May 2018	56
June 2018	39
July 2018	16
August 2018	41
September 2018	23
October 2018	120
November 2018	100
December 2018	24
January 2019	18
February 2019	27
March 2019	27
April 2019	25
May 2019	20
June 2019	17
July 2019	16
August 2019	12
September 2019	17
October 2019	17
November 2019	29
December 2019	25
January 2020	27
February 2020	20
March 2020	22
April 2020	35
May 2020	16
June 2020	60
July 2020	39
August 2020	15
September 2020	20
October 2020	27
November 2020	15
December 2020	16
January 2021	19
February 2021	8
March 2021	18
April 2021	15
May 2021	45
June 2021	33
July 2021	17
August 2021	18
September 2021	30
October 2021	35
November 2021	12
December 2021	16
January 2022	22
February 2022	53
March 2022	19
April 2022	39
May 2022	25
June 2022	29
July 2022	32
August 2022	18
September 2022	14
October 2022	34
November 2022	28
December 2022	14
January 2023	14
February 2023	45
March 2023	47
April 2023	73
May 2023	26
June 2023	13
July 2023	32
August 2023	20
September 2023	15
October 2023	37
November 2023	20
December 2023	33
January 2024	30
February 2024	34
March 2024	32
April 2024	30

Article Contents

ConSpeciFix: classifying prokaryotic species based on gene flow

Abstract

1 Introduction

2 Materials and methods

3 Implementation

4 Application

Acknowledgements

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

ConSpeciFix: classifying prokaryotic species based on gene flow

Abstract

1 Introduction

2 Materials and methods

3 Implementation

4 Application

Acknowledgements

Funding

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only