Abstract

AcCNET (Accessory genome Constellation Network) is a Perl application that aims to compare accessory genomes of a large number of genomic units, both at qualitative and quantitative levels. Using the proteomes extracted from the analysed genomes, AcCNET creates a bipartite network compatible with standard network analysis platforms. AcCNET allows merging phylogenetic and functional information about the concerned genomes, thus improving the capability of current methods of network analysis. The AcCNET bipartite network opens a new perspective to explore the pangenome of bacterial species, focusing on the accessory genome behind the idiosyncrasy of a particular strain and/or population.

Availability and Implementation

AcCNET is available under GNU General Public License version 3.0 (GPLv3) from http://sourceforge.net/projects/accnet

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Recent developments in network based methods (sequence-similarity networks, genome networks, gene family graphs), significantly improve the usability of molecular datasets in comparative genomics analysis. Furthermore, they open a way to explore reticulate evolutionary processes (Corel et al., 2016). A wide range of bioinformatic tools for either characterizing core and pangenome of bacterial species or for carrying out phylogenomic analysis is available. Unfortunately, only a fraction of them allow a specific analysis of accessory genomes (Agren et al., 2012; Contreras-Moreira and Vinuesa, 2013). Here, we present AcCNET (Accessory Genome Constellation Network), a tool based on bipartite networks that permits the simultaneous analysis of the accessory genes of a large number of genomes. AcCNET, by working with proteomes, does not require to use synteny information in the datasets, thus making it ideal for working with contig sets produced by de novo genome sequencing. AcCNET offers a comprehensive 2D view of genome-proteome relationships associated with the adaptive attributes of single strains and/or specialized populations. Therefore, AcCNET helps to elucidate the dynamics of species with high plasticity and lateral gene exchange from which populations can emerge or evolve (Georgiades and Raoult, 2010).

2 Implementation

AcCNET is a Perl application, based exclusively on freeware. From a given set of accessory genomes, it builds a bipartite network of with two types of nodes: Genomic Units (GUs, the vehicles or containers of genes: i.e. chromosomes, plasmids, viruses) and Homologous protein Clusters (HpC, the set of gene-encoded proteins clustered according to identity, coverage and e-value). The edges represent links between HpC and GU components. The edge-weights reflect the phylogenetic distance of the corresponding protein in a given GU and the consensus HpC. Some layout and clustering algorithms use edge-weight values. It is especially important in some cluster methods in which are mandatory include edge-weight values.

The application workflow comprises three steps: (1) protein cluster construction, (2) assignation of phylogenetic distances and (3) network set up. The first step uses kClust software (Hauser et al., 2013) to identify homologous proteins. kClust allows setting the clustering parameters (identity, coverage and e-value) to adjust thresholds of input sequence requirements. The second step calculates the edge-weights. Proteins belonging to a given HpC are aligned using Muscle (Edgar, 2004). The multiple alignment is then trimmed and further optimized to protein distance matrix performance using Trimal (Capella-Gutiérrez et al., 2009). Finally, a distance matrix is calculated using protdist from PHYLIP package (Felsenstein, 1989). The third step of the workflow translates phylogenetic distances between protein sequences to distances between GU nodes and HpC nodes by using a mathematical rule (1) where Di is the distance between a protein i and its related cluster, dij is the distance between protein i and protein j, and n is the number of proteins in the HpC.
Di=j=1ndijn2
(1)
Edge-weight is considered an attraction force between nodes and thus, is proportional to the inverse of the phylogenetic distance (i.e. the shorter the distance, the stronger the attraction). To scale the edge-weight, the final value is computed as the natural log of the inverse distance (2).
Edgeweight=log10(1/Di)
(2)
Some cases deserve special attention. Negative edge-weight values should be set to 0.01 (as a minimum value) in order to avoid potential flaws with layout and clustering algorithms. Edges-weight of HpC that comprise a unique member are set as half the number of total GUs. With this value, edges-weight are normalized among them and the value decrease the size of the study to not disturb the network arrangement or cluster algorithms. In HpCs with 100% identity among all members, the edge-weight is set as half the number of members of the cluster. This formula was checked to be concordant with the other edge values. HpC with more than one protein belonging to the same GU has redundant edges for each of the proteins involved. The user can either preserve redundant edges to inspect gene duplications and the origin of each homolog using the phylogenetic information stored in the edges or remove duplicate edges if this information is useless for the study. By default, AcCNET considers the set of proteins present in all input GUs as core-genome. This threshold can be modified according to user criteria (e.g. increased to 101% to include all pangenome or decreased ad libitum to remove pseudo-core-genome).

2.1 Input data

AcCNET works with proteome files in FASTA format, each proteome being represented in an individual file. Proteomes can be linked to individual genomic elements (chromosomes, plasmids, virus…) or to complex sets (genomes, pangenomes, genomic communities…). In fact each element will be treated in the network as a GU node.

2.2 Output data

AcCNET retrieves three files: a network, a table and a FASTA file. The network file has a tabular format with four columns: Source, Target, Weight and Type. The output file is compatible with most graph visualization software, such as Cytoscape (Smoot et al., 2011), Gephi (Bastian et al., 2009) or R (https://www.r-project.org/). The table file comprises the attributes information of the network nodes and has four columns: ID, Type, TwinGroup and Description. The Type attribute discriminates between HpC nodes and GU nodes. The TwinGroup attribute refers to HpCs with the same neighbours (Corel et al., 2016). AcCNET takes the annotation of the representative sequence of each HpC for the node Description field. The FASTA file contains the representative sequence of each HpC with the header adapted to the network index.

3 Biological application

Lateral gene transfer extensively contributes to the evolution and diversification of Archaea and Bacteria by interchanging different adaptive traits. A plethora of bacterial species of medical, veterinary or agricultural relevance is able to adapt to disparate lifestyles. Thus, characterization of core genome and the dynamics of pangenomes are increasingly analysed to identify mechanisms of evolution, the traits that make populations functionally different and finally, to understand the concept of species. AcCNET can be configured not only to analyse accessory genomes but also to represent whole pangenomes. This feature results especially useful for the study of the dynamics of mobile genetic elements. In fact, AcCNET facilitates the analysis of large collections of plasmids, making the tool especially useful to massive characterization of conjugative elements as plasmids or ICEs (Surveillance of antibiotic resistance, highly pathogenic species, characterization of emerging lineages). As an example of AcCNET applicability, we analysed a set of 138 genomes of Escherichia coli, taken from NCBI database. The example is comprehensively explained in the Supplementary material.

4 Conclusion

AcCNET is a novel freeware based on bipartite networks that aims to analyse accessory genomes with high resolution and accuracy. By providing an interactive visualization tool, AcCNET allows deep comprehensive inspection of the network. Although initially designed for the analysis of bacterial genome sequences, its features allow exploring all type of GUs, from transposons or viruses to mammals or primates.

Acknowledgements

Authors acknowledge Rocio Rodríguez-Villanueva for revision of the English grammar

Funding

This work was supported by the European Commission, Seven Framework Program (EVOTARFP7-HEALTH-282004 for VFL, FB, FdlC and TMC (PI15-00466) and PLASWIRES-612146/FP7-ICT-2013-10 for FdlC). Authors also acknowledge the European Development Regional Fund ‘A way to achieve Europe’ (ERDF) for cofounding the Plan Nacional de I + D+ I 2012-2015 (PI12-01581 to TMC and BFU2014-55534-C2-1-Pfor FdlC) and CIBER actions (CIBER in Epidemiology and Public Health, CIBERESP; CB06/02/0053 to FB) and the Regional Government of Madrid (PROMPT-S2010/BMD2414). Val F. Lanza is further supported by a Research Award Grant 2016 of the European Society for Clinical Microbiology and Infectious Diseases (ESCMID).

Conflict of Interest: none declared.

References

Agren
 
J.
 et al.  (
2012
)
Gegenees: fragmented alignment of multiple genomes for determining phylogenomic distances and genetic signatures unique for specified target groups
.
PloS One
,
7/6
,
e39107.

Bastian
 
M.
 et al.  (
2009
) Gephi: An open source software for exploring and manipulating networks.
ICWSM
,
8
,
361
362
.

Capella-Gutiérrez
 
S.
 et al.  (
2009
)
trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses
.
Bioinformatics
,
25/15
,
1972
1973
.

Contreras-Moreira
 
B.
,
Vinuesa
P.
(
2013
)
GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis
.
Appl. Environ. Microbiol
.,
79/24
,
7696
7701
.

Corel
 
E.
 et al.  (
2016
)
Network-Thinking: graphs to analyze microbial complexity and evolution
.
Trends Microbiol
.,
24
,
1
14
. Elsevier Ltd.

Edgar
 
R.C.
(
2004
)
MUSCLE: multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Res
.,
32/5
,
1792
1797
.

Felsenstein
 
J.
(
1989
) PHYLIP - Phylogeny Inference Package (Version 3.2).
Cladistics
,
5
,
164
166
.

Georgiades
 
K.
,
Raoult
D.
(
2011
)
Defining pathogenic bacterial species in the genomic era
.
Frontiers in Microbiology
, DOI: 10.3389/fmicb.2010.00151

Hauser
 
M.
 et al.  (
2013
)
kClust: fast and sensitive clustering of large protein sequence databases
.
BMC Bioinformatics
,
14
,
248.

Smoot
 
M.E.
 et al.  (
2011
)
Cytoscape 2.8: new features for data integration and network visualization
.
Bioinformatics (Oxford, England)
,
27/3
,
431
432
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: John Hancock
John Hancock
Associate Editor
Search for other works by this author on:

Supplementary data