Abstract

This paper presents an update on the content, accessibility and analytical tools of the EnteroBase platform for web-based pathogen genome analysis. EnteroBase provides manually curated databases of genome sequence data and associated metadata from currently >1.1 million bacterial isolates, more recently including Streptococcus spp. and Mycobacterium tuberculosis, in addition to Salmonella,Escherichia/Shigella,Clostridioides,Vibrio,Helicobacter,YersiniaandMoraxella. We have implemented the genome-based detection of antimicrobial resistance determinants and the new bubble plot graphical tool for visualizing bacterial genomic population structures, based on pre-computed hierarchical clusters. Access to data and analysis tools is provided through an enhanced graphical user interface and a new application programming interface (RESTful API). EnteroBase is now being developed and operated by an international consortium, to accelerate the development of the platform and ensure the longevity of the resources built. EnteroBase can be accessed at https://enterobase.warwick.ac.uk as well as https://enterobase.dsmz.de.

Introduction

EnteroBase is a publicly accessible, free to use, web-based platform for pathogen genomic analyses, currently providing genome sequence data from 1144 499 bacterial isolates (as of 31 July 2024) in a uniformly assembled and annotated format. EnteroBase is updated daily by automatically scanning the NCBI Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) (1) for newly published sequence data. Users can also upload sequencing data and corresponding metadata directly. Genome assemblies that pass quality control are genotyped by using core genome multilocus sequence typing (cgMLST) followed by a unique, multilevel hierarchical clustering approach. Genome data are presented together with associated metadata, including geographic origin, isolation date, source sample, type of disease, etc. Expert microbiologists with specialized knowledge of specific pathogens actively curate the isolate-associated metadata based on scientific literature and interaction with the scientists who submitted the sequence data. By providing users with tools for phylogenetic, epidemiological and comparative genomic analyses, augmented by powerful graphical visualization and search capabilities, EnteroBase serves a broad user community, ranging from medical microbiologists to clinicians, epidemiologists, population geneticists and bioinformaticians.

EnteroBase was founded by Mark Achtman and his team at the University of Warwick in 2014, and individual databases (2,3), implemented tools (4,5) and the results obtained with them (6–8) were reported previously. The last major overview of EnteroBase was published in December 2019 (9). Since then, a number of advances have been made. In this paper, we outline selected improvements to EnteroBase content, accessibility and analytical tools.

The EnteroBase platform is now operated and developed in an international consortium including the University of Warwick (UK) and the Leibniz Institute DSMZ (Germany), accelerating the development of the platform, increasing its data security and ensuring the longevity of the resources built. Accordingly, EnteroBase can be accessed at https://enterobase.warwick.ac.uk as well as https://enterobase.dsmz.de. The number of EnteroBase users and analysis jobs has increased steadily over recent years, and over the last 12 months, EnteroBase has been accessed by users in 151 countries worldwide, indicating the platform is widely and routinely used in many countries (Figure 1).

(A) Active EnteroBase users, 2016–2024. (B) Cumulative number of user jobs run on the EnteroBase platform. (C) The user access statistics by country as collected by Google Analytics for the period July 2023 to July 2024. The table shows the seven countries with the highest number of user accesses.
Figure 1.

(A) Active EnteroBase users, 2016–2024. (B) Cumulative number of user jobs run on the EnteroBase platform. (C) The user access statistics by country as collected by Google Analytics for the period July 2023 to July 2024. The table shows the seven countries with the highest number of user accesses.

Content

The total number of genome sequences in EnteroBase now exceeds 1 million, compared to 300 000 genomes in 2019 (Figure 2). This increase is partly due to the establishment of databases for additional pathogens, including Streptococcus (Warwick University site) and Mycobacterium tuberculosis (Leibniz Institute DSMZ site), which currently hold >120 000 genomes each. Most of these data were retrieved from public sequence read archives (1), with automatic daily updates, whereas user uploads account for ∼10% of the total data.

Increasing number of bacterial genome sequences provided by EnteroBase databases. The ‘other’ section includes the smaller databases (Vibrio, Helicobacter, Moraxella and Yersinia). Data sources are public sequence read archives (1) and user uploads, respectively.
Figure 2.

Increasing number of bacterial genome sequences provided by EnteroBase databases. The ‘other’ section includes the smaller databases (Vibrio, Helicobacter, Moraxella and Yersinia). Data sources are public sequence read archives (1) and user uploads, respectively.

The genome-based detection of antimicrobial resistance (AMR) determinants was implemented for Salmonella, Escherichia and Clostridioides databases and is currently indicated for all 858 942 individual genomes (31 July 2024). Based on this new feature, AMR genotypes can be readily compared across tens of thousands of genomes to elucidate geographical distributions or temporal trends of specific resistance determinants, or their associations with specific sample types, disease patterns and pathogen hosts (e.g. livestock versus human). While other tools are available to predict AMR from genome sequences (10–12), they cannot easily compare AMR properties among large numbers of genomes, or across space and time.

We also found that the quality of AMR genotyping varied with the bacterial species under study and could be significantly improved by pathogen-specific data curation. For example, dedicated web tools failed to detect most genetic determinants conferring resistance against antibiotics recommended for the treatment of Clostridioides difficile infections, because associated databases were not up to date for this species (7). We have therefore recently implemented a modified pipeline for AMR detection for C.difficile in EnteroBase, together with an expert-curated database of genetic determinants to be detected (Figure 3). This pipeline detects plasmid sequences by using BLASTn (13) (version 2.2.31+; thresholds: sequence coverage ≥95%, sequence identity ≥99%), considering the circular nature of plasmids, and it detects mutations in promoter sequences by using BLASTn-short (thresholds: sequence coverage = 100%, sequence identity ≥90%). In contrast, for the detection of point mutations in protein-coding genes (including base substitutions, insertions and deletions), it relies on the AMRFinderPlus tool (10) (version 3.11.26) with an adapted database, screening for resistances using both protein and nucleotide reference sequences, respectively.

(A) Genomes can be queried for specific genetic determinants for AMR using the Experiment Type ‘AMR analysis’. The returned genome entries are displayed in the spreadsheet. (B) Bubble plot illustrating the genetic population structure of C.difficile based on 30 599 genome sequences. All genomes are classified into hierarchical clusters at multiple levels, e.g. HC10 clusters identify chains of related genomes with pairwise differences of up to 10 cgMLST alleles (see EnteroBase user guide/HierCC equivalents for correlates of hierarchical clusters in different bacterial species). Hierarchical clusters at levels HC150, HC2000 and HC2500, respectively, are indicated by gray shading, whereas HC10 clusters are colored based on the presence of genetic determinants of resistance to the antibiotic metronidazole, including plasmids and point mutations in both protein-coding genes and a promoter. It appears that the mutation in the nimB gene promoter (PnimBG), conferring increased metronidazole tolerance (15), is restricted to a limited number of genotypes, including epidemic PCR ribotype 027 strains FQR1 and FQR2 [correlating with hierarchical clusters HC10_9 and HC10_4, respectively (3)]. Other determinants of metronidazole resistance, including the plasmid pCD-METRO, are rare in C. difficile (7). The interactive bubble plot shown here is available among public ‘workspaces’ within EnteroBase.
Figure 3.

(A) Genomes can be queried for specific genetic determinants for AMR using the Experiment Type ‘AMR analysis’. The returned genome entries are displayed in the spreadsheet. (B) Bubble plot illustrating the genetic population structure of C.difficile based on 30 599 genome sequences. All genomes are classified into hierarchical clusters at multiple levels, e.g. HC10 clusters identify chains of related genomes with pairwise differences of up to 10 cgMLST alleles (see EnteroBase user guide/HierCC equivalents for correlates of hierarchical clusters in different bacterial species). Hierarchical clusters at levels HC150, HC2000 and HC2500, respectively, are indicated by gray shading, whereas HC10 clusters are colored based on the presence of genetic determinants of resistance to the antibiotic metronidazole, including plasmids and point mutations in both protein-coding genes and a promoter. It appears that the mutation in the nimB gene promoter (PnimBG), conferring increased metronidazole tolerance (15), is restricted to a limited number of genotypes, including epidemic PCR ribotype 027 strains FQR1 and FQR2 [correlating with hierarchical clusters HC10_9 and HC10_4, respectively (3)]. Other determinants of metronidazole resistance, including the plasmid pCD-METRO, are rare in C. difficile (7). The interactive bubble plot shown here is available among public ‘workspaces’ within EnteroBase.

The AMR genotyping pipeline for Salmonella and Escherichia is also based on the AMRFinderPlus tool (10) (version 3.11.26). A transcription of the standard AMRFinderPlus output for each strain is available for viewing and download. On the graphical user interface (GUI), AMR determinants are organized into columns by drug class. This mostly follows the AMRFinderPlus ‘class’, with the beta-lactam class further divided into penicillinases, extended-spectrum beta-lactamases and carbapenemases based on the ‘subclass’ categorization. Rare or clinically less relevant classes are grouped as ‘Other’ in the GUI. This distinction is important to identify remaining options for antibiotic treatment.

EnteroBase can now be used to search for bacterial strains with specific resistance determinants or classes, and AMR genotyping results can be displayed for a list of strains by selecting ‘AMR analysis’ from the ‘Experimental Data’ drop-down menu (Figure 3A). Additional details that may help to assess the confidence of the resistance genotype call are available by selecting the eye icon (e.g. the reference sequence type, length, and accession number, sequence coverage and identity, contig position, scientific literature).

Analytical tools

We have implemented the new bubble plot graphical tool for visualizing genetic population structures, based on pre-computed hierarchical clusters. Similar bubble plots had previously been generated with an external tool (14), and as the results were very informative, we have added a unique and more powerful version to EnteroBase. In these plots, bubbles represent clusters of bacteria that are related at a specified level, with the bubble size indicating the number of genomes included, and they are each shown nested within the next level hierarchical cluster. Users can choose to plot all levels of taxonomy, from clonal outbreaks to epidemics, and from endemic strains to the level of bacterial species. The bubble plots can be colored using metadata or experimental data of interest to identify any correlations with population structure (Figure 3).

User interface

The graphical user interface of EnteroBase has been improved recently, with a unified design, interactive features and enhanced search functions. The EnteroBase websites are accessible by mobile devices, enabling genomic surveillance in the field, including rapid data synthesis and reporting. The homepage has been made more intuitive with descriptions of the main features to get a quick overview. All databases hosted by the University of Warwick and the Leibniz Institute DSMZ are displayed, indicating the genotyping schemes included and the total number of genomes available. Links have been added to the footer of the homepage for quick access to the user guide, contact information, terms and conditions, and acknowledgments. Additional pages provide information about the curators and developers involved in the maintenance and development of the EnteroBase software and databases, and on the literature to be cited. For easy navigation, pages are titled and displayed pages are highlighted in the menu. The search functionality has been improved, allowing the user to view their query after search results have loaded. The query window is now integrated into the page (Figure 3) and automatically minimizes once a query has been submitted, to make room for the search results to be displayed. Registered users can save their search queries.

To enable integration with other platforms, EnteroBase now provides programmatic access to data and analysis tools via an application programming interface (RESTful API). It is designed for rapid live updates of small amounts of data. Queryable data include cgMLST schemes, genome assemblies, genome metadata, locus sequences, allele profiles and AMR predictions. All API activity must be requested via HTTP basic authentication and authenticated with a valid token. Requests can be wrapped in a programming language of the user’s choice; we recommend using Python and provide some example scripts at https://bitbucket.org/enterobase/enterobase-scripts. Responses are returned as JSON formatted data. To help users formulate their API requests, a Swagger sandbox is available at https://enterobase.warwick.ac.uk/api/v2.0/swagger-ui, which provides interactive documentation of the EnteroBase API, including information on endpoints, inputs, outputs and response codes.

User support

The documentation of EnteroBase has been significantly improved and updated. Documentation on AMR analysis has been added and a guide to using API access to EnteroBase has been introduced, including several working examples. A wealth of detailed information for advanced users and developers has been added, to make it easier to install EnteroBase locally and to set up new databases and genotyping schemes.

Personal user support is provided through the support email address. The interactions with EnteroBase users continue to provide useful feedback that enables us to update the documentation and the tool itself to improve usability. New users often are guided to the specific documentation associated with the tasks they wish to perform. Occasionally, users will inform EnteroBase support of inconsistencies in the data held on EnteroBase, often arising from errors when data were originally uploaded to EnteroBase or NCBI. Such inconsistencies are then investigated and corrected. Users also ask about potential new features, which has helped set priorities for the adding of new features such as the introduction of AMR analysis.

System architecture

At the long-established Warwick University site, the EnteroBase software runs on four servers, with two servers each replicating each other’s functions to provide complete redundancy should one server fail. The PostgreSQL databases are mirrored in the same way. The four servers all reside on a secure network, with only a lightweight front-end web server on the public internet. This machine serves some static html files, but its main role is to act as a load balancer, forwarding requests to the backend servers. The data are managed in a shared filesystem. In addition to the servers mentioned above there are further compute nodes that run the time and resource consuming jobs such as tree calculations and assemblies. At the Leibniz Institute DSMZ site, we are currently expanding the server structure to achieve a similar level of redundancy.

Future

EnteroBase will continue to be operated and actively developed at two locations. We are currently working on implementing AMR genotyping for additional species (e.g. Mtuberculosis). We will establish databases for other pathogens and develop and install additional bioinformatics tools (e.g. for assembling long sequencing reads and identifying mobile genetic elements), and expand networking with other platforms through programmatic interfaces. EnteroBase will continue to be an important resource for making efficient use of the increasing amount of pathogen sequencing data and realizing the full potential of high-throughput genome sequencing.

Data availability

EnteroBase can be accessed at https://enterobase.warwick.ac.uk as well as https://enterobase.dsmz.de.

Acknowledgements

The EnteroBase system was originally developed by Prof. Mark Achtman and his research group at the University of Warwick. M. Achtman provided helpful comments on an earlier version of the manuscript. We are most grateful to our expert database curators, including M. Chattaway, M. Pardos, F.-X. Weill, E. Litrup, S. Beatson, C. Nodari, M. Kilian, M. Frentrup, B. Gomez-Gil, C. Constantinidou, K. Thorell, R. Torres, S. Reuter and S. Nguyen. The authors at Leibniz Institute DSMZ acknowledge continuous support from their scientific computing team. The graphical abstract was created using freely available images from Flaticon.com.

Funding

Recent EnteroBase work at the University of Warwick was funded by the PATH-SAFE programme (Pathogen Surveillance in Agriculture, Food and the Environment), which was funded by HM Treasury (the UK government's economic and finance ministry) through the Shared Outcomes Fund. EnteroBase work at the Leibniz Institute DSMZ was partially funded by the Global Health EDCTP3 (European and Developing Countries Clinical Trials Partnership) supported by the European Union (project PANGenS, to U.N.) and by the German Center for Infection Research (DZIF, TI 12.901/12.902, to J.O. and TTU 09.720, to U.N.).

Conflict of interest statement. None declared.

References

1.

Sayers
E.W.
,
Beck
J.
,
Bolton
E.E.
,
Brister
J.R.
,
Chan
J.
,
Comeau
D.C.
,
Connor
R.
,
DiCuccio
M.
,
Farrell
C.M.
,
Feldgarden
M.
et al. .
Database resources of the National Center for Biotechnology Information
.
Nucleic Acids Res.
2024
;
52
:
D33
D43
.

2.

Alikhan
N.F.
,
Zhou
Z.
,
Sergeant
M.J.
,
Achtman
M.
A genomic overview of the population structure of Salmonella
.
PLoS Genet.
2018
;
14
:
e1007261
.

3.

Frentrup
M.
,
Zhou
Z.
,
Steglich
M.
,
Meier-Kolthoff
J.P.
,
Göker
M.
,
Riedel
T.
,
Bunk
B.
,
Spröer
C.
,
Overmann
J.
,
Blaschitz
M.
et al. .
A publicly accessible database for Clostridioidesdifficile genome sequences supports tracing of transmission chains and epidemics
.
Microb. Genom.
2020
;
6
:
e000410
.

4.

Zhou
Z.
,
Alikhan
N.F.
,
Sergeant
M.J.
,
Luhmann
N.
,
Vaz
C.
,
Francisco
A.P.
,
Carrico
J.A.
,
Achtman
M.
GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens
.
Genome Res.
2018
;
28
:
1395
1404
.

5.

Zhou
Z.
,
Charlesworth
J.
,
Achtman
M.
HierCC: a multi-level clustering scheme for population assignments based on core genome MLST
.
Bioinformatics
.
2021
;
37
:
3645
3646
.

6.

Frentrup
M.
,
Thiel
N.
,
Junker
V.
,
Behrens
W.
,
Münch
S.
,
Siller
P.
,
Kabelitz
T.
,
Faust
M.
,
Indra
A.
,
Baumgartner
S.
et al. .
Agricultural fertilization with poultry manure results in persistent environmental contamination with the pathogen Clostridioidesdifficile
.
Environ. Microbiol.
2021
;
23
:
7591
7602
.

7.

Kolte
B.
,
Nübel
U.
Genetic determinants of resistance to antimicrobial therapeutics are rare in publicly available Clostridioidesdifficile genome sequences
.
J. Antimicrob. Chemother.
2024
;
79
:
1320
1328
.

8.

Behrens
W.
,
Kolte
B.
,
Junker
V.
,
Frentrup
M.
,
Dolsdorf
C.
,
Börger
M.
,
Jaleta
M.
,
Kabelitz
T.
,
Amon
T.
,
Werner
D.
et al. .
Bacterial genome sequencing tracks the housefly-associated dispersal of fluoroquinolone- and cephalosporin-resistant Escherichia coli from a pig farm
.
Environ. Microbiol.
2023
;
25
:
1174
1185
.

9.

Zhou
Z.
,
Alikhan
N.F.
,
Sergeant
M.J.
,
Mohamed
K.
,
Group
A.S.
,
Achtman
M.
The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersiniapestis phylogeny and Escherichia core genomic diversity
.
Genome Res.
2020
;
30
:
138
152
.

10.

Feldgarden
M.
,
Brover
V.
,
Gonzalez-Escalona
N.
,
Frye
J.G.
,
Haendiges
J.
,
Haft
D.H.
,
Hoffmann
M.
,
Pettengill
J.B.
,
Prasad
A.B.
,
Tillman
G.E.
et al. .
AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence
.
Sci. Rep.
2021
;
11
:
12728
.

11.

Alcock
B.P.
,
Raphenya
A.R.
,
Lau
T.T.Y.
,
Tsang
K.K.
,
Bouchard
M.
,
Edalatmand
A.
,
Huynh
W.
,
Nguyen
A.V.
,
Cheng
A.A.
,
Liu
S.
et al. .
CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database
.
Nucleic Acids Res.
2020
;
48
:
D517
D525
.

12.

Florensa
A.F.
,
Kaas
R.S.
,
Clausen
P.
,
Aytan-Aktug
D.
,
Aarestrup
F.M.
ResFinder—an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes
.
Microb. Genom.
2022
;
8
:
000748
.

13.

Altschul
S.F.
,
Gish
W.
,
Miller
W.
,
Myers
E.W.
,
Lipman
D.J.
Basic local alignment search tool
.
J. Mol. Biol.
1990
;
215
:
403
410
.

14.

Achtman
M.
,
Zhou
Z.
,
Charlesworth
J.
,
Baxter
L.
EnteroBase: hierarchical clustering of 100 000s of bacterial genomes into species/subspecies and populations
.
Philos. Trans. R. Soc. Lond. B Biol. Sci.
2022
;
377
:
20210240
.

15.

Olaitan
A.O.
,
Dureja
C.
,
Youngblom
M.A.
,
Topf
M.A.
,
Shen
W.J.
,
Gonzales-Luna
A.J.
,
Deshpande
A.
,
Hevener
K.E.
,
Freeman
J.
,
Wilcox
M.H.
et al. .
Decoding a cryptic mechanism of metronidazole resistance among globally disseminated fluoroquinolone-resistant Clostridioidesdifficile
.
Nat. Commun.
2023
;
14
:
4130
.

Author notes

The first two authors should be regarded as Joint First Authors.

The last two authors should be regarded as Joint Last Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site—for further information please contact [email protected].

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.