ChlamDB: a comparative genomics database of the phylum Chlamydiae and other members of the Planctomycetes-Verrucomicrobiae-Chlamydiae superphylum

Abstract ChlamDB is a comparative genomics database containing 277 genomes covering the entire Chlamydiae phylum as well as their closest relatives belonging to the Planctomycetes-Verrucomicrobiae-Chlamydiae (PVC) superphylum. Genomes can be compared, analyzed and retrieved using accessions numbers of the most widely used databases including COG, KEGG ortholog, KEGG pathway, KEGG module, Pfam and InterPro. Gene annotations from multiple databases including UniProt (curated and automated protein annotations), KEGG (annotation of pathways), COG (orthology), TCDB (transporters), STRING (protein–protein interactions) and InterPro (domains and signatures) can be accessed in a comprehensive overview page. Candidate effectors of the Type III secretion system (T3SS) were identified using four in silico methods. The identification of orthologs among all PVC genomes allows users to perform large-scale comparative analyses and to identify orthologs of any protein in all genomes integrated in the database. Phylogenetic relationships of PVC proteins and their closest homologs in RefSeq, comparison of transmembrane domains and Pfam domains, conservation of gene neighborhood and taxonomic profiles can be visualized using dynamically generated graphs, available for download. As a central resource for researchers working on chlamydia, chlamydia-related bacteria, verrucomicrobia and planctomyces, ChlamDB facilitates the access to comprehensive annotations, integrates multiple tools for comparative genomic analyses and is freely available at https://chlamdb.ch/. Database URL: https://chlamdb.ch/


INTRODUCTION
All known members of the phylum Chlamydiae are obligate intracellular bacteria exhibiting a unique life cycle. Described chlamydial species cause a broad range of diseases in various species of birds, fishes, reptiles, amphibians, marsupials and mammals (1), and include major human pathogens such as Chlamydia trachomatis--a leading cause of blindness and infertility (1,2). Chlamydiae are difficult to cultivate and genetic manipulations are only available for a few species, which drastically slows down the understanding of their fascinating biology. Other members of the Planctomycetes-Verrucomicrobiae-Chlamydiae (PVC) superphylum include the closest relatives of the Chlamydiae: The Planctomycetes are extremely attractive for the field of evolutionary cell biology given their peculiar intracellular compartments (3). Like Chlamydiae, they replicate using an FtsZ-independent mechanism but contrarily to the Chlamydiae, Planctomycetales were shown to have a complete peptidoglycan cell wall (4)(5)(6)(7). There is currently no database allowing an easy access and comparison of comprehensive genomics data for members of the PVC superphylum. A database focusing on the curation of chlamydial genome annotation was recently published (8), but it is limited to three species of the genus Chlamydia. A phylum-scale perspective including comparative data with the closest free-living relatives of the Chlamydiae would provide significant added value for the research community given the conserved intracellular lifestyle of these bacteria that were estimated to diverge over 700 million years ago (9). The PVCbase (10) provides updated automated protein annotations of forty-two PVC genomes, but only offers limited browsing capabilities and no comparative data. ChlamDB offers a centralized resource for genomic data and annotations of the entire PVCsuperphylum. Its simple search engine allows browsing protein annotations, identifying orthologs in PVC genomes and performing a variety of comparative analyses.  (13).
The database provides various tools for comparing, analyzing and retrieving genomic data. A simple Boolean search interface allows querying the database for specific entries using NCBI protein accessions and locus tags or UniProt accessions. Accessions numbers of widely-used databases such as COG (14), KEGG ortholog (KO) (15), KEGG pathway (16), KEGG module, Pfam (17) and In-terPro (18) are also recognized and can be used to search for proteins with specific annotations. The annotation of individual genomes can be browsed in tables of genes that are accessible directly from the front web page. In addition, sequence homology searches can be performed through a BLAST interface integrating the different blast flavours (BLASTp, BLASTn, tBLASTn and BLASTx) (19).

Individual protein annotation view
Searching for a protein allows to access a 'locus' page, designed to summarize automated and imported functional annotations, and provides comprehensive comparative data to facilitate the interpretation of annotations ( Figure 1). It integrates annotations from multiple databases including UniProt (curated and automated protein annotations) (20), KEGG (annotation of pathways), COG (orthology), TCDB (transporters) (21), STRING (protein-protein interactions) (22) and InterPro (domains and signatures). The different tabs at the top of the page link to additional data such as the list of orthologs in other PVC genomes ( Figure  1C), identified using OrthoFinder (23). Orthologs are listed in a table containing the locus tag, the gene name, the name of the organism, the product, the percentage of amino acid identity as compared to the reference locus and the UniProt annotation score. Orthologs that were reviewed on Swis-sProt are flagged to quickly identify orthologs with manually curated annotations. Additional tabs link to (i) a precomputed phylogeny of the orthologous group, (ii) a second phylogeny that includes the closest non-PVC RefSeq hits of each sequence of the orthogroup, allowing to investigate the phylogenetic relationship of PVC proteins and their closest homologs available in public databases ( Figures 1J and  2J), precomputed homology searches with (iii) RefSeq and (iv) SwissProt databases (200 top hits), (v) links to published literature based on text-mining from the STRING database (24) and PaperBLAST hits (25) and (vi) candidate functional interactors. Putative interactors were predicted in-house from genomic data alone using phylogenetic profiling and investigation of conserved gene neighborhood (see online methods) ( Figure 1G). See (26) and (27) for the rationale justifying use of those two approaches.
We put a strong emphasis on the visual representation of the data ( Figure 2). The pattern of presence/absence of orthologous groups within the PVC superphylum can be visualized with help of an annotated reference phylogeny ( Figures 1D and 2D). The reference phylogeny was reconstructed with FastTree (28) (default parameters, JTT+CAT model) based on the concatenated alignment of 32 single copy orthologs conserved in at least 266 out of the 277 genomes.
The organization of transmembrane and Pfam domains in orthologs can be easily compared along the phylogeny of the orthologous group ( Figures 1H and 2H). The conservation of proteins encoded in the direct neighborhood (23 kb upstream and downstream) of the protein of interest can also be visualized ( Figures 1E and 2E).
The 'orthogroup' link ( Figure 1K) provides an overview of the annotation of orthologs including gene name, product, COG annotation, KEGG annotation, InterPro annotations, number of transmembrane domains and sequence length. It allows verifying the consistency of annotations among putative orthologs and identifying wrongly grouped proteins (e.g. non-orthologous proteins sharing a domain).

Annotation of candidate type III secretion system effectors
Chlamydiae use a type III secretion system (T3SS) to deliver effector proteins that will allow the bacterium to overcome eukaryotic host defenses and to manipulate host cells. Effectors are difficult to identify because they evolve quickly and are much less conserved than proteins encoding components of the T3SS apparatus (29,30). Between 5 and 8% of Chlamydia spp. coding sequences (CDS) are estimated to be effectors (31). Candidate T3SS effectors were identified using four different machine-learning classifiers that were trained with known effector sequences: BPBAac (32), ef-fectiveT3 (33), DeepT3 (34) and T3 MM (35). In addition, we tagged proteins harboring eukaryotic domains rarely found in bacterial genomes. Such domains are known to be frequently involved in bacteria-host interactions (36,37). The ADP/ATP transporter domain (InterPro accession IPR004667) is for instance frequently found in both bacteria (70.48%) and eukaryotes (29.52%) ( Figure 1L). A dedicated page allows visualizing the taxonomic distribution of each COG and Pfam domains across respectively 2,031 (for COG) and 6,677 (for Pfam) representative Archaea, Bacteria, Eukaryotes and Viruses genomes ( Figure 1M and 2M). The detailed list of identified homologs can (for instance) be used to quickly determine whether a candidate effector protein harbors a domain predominantly identified in the genome of eukaryotes and other intracellular bacterial parasites such as Rickettsia or Legionella.

Comparative genomics and data mining tools
Since C. trachomatis genome became one of the first sequenced genomes (38), hundreds of Chlamydiae genomes have been sequenced. Comparisons of complete genomes of different strains and species can help identify genetic variations that can be involved in defining tissue tropism or host specificity (39), or identify genes essential to the unique intracellular lifestyle of Chlamydiae. ChlamDB allows users to perform various comparative analyses based on orthologous proteins to identify highly conserved and genome-specific or clade-specific orthologous groups (Figure 3.1 and 3.2). Whole genome comparisons can be visualized using interactive circular genome maps, Venn diagrams or heat maps (Figure 3.3, 3.4 and 3.5). In addition, ChlamDB enables the alignment of local genomic regions in two or more genomes (Figure 3.6).
Pfam domains, KEGG orthologs and InterPro entries can also be compared to identify clade-specific or highly conserved protein features (Figure 3    Annotations from the KEGG database were used to classify proteins into metabolic pathways and modules (16). Data for individual pathways and modules can be retrieved by searching KEGG accessions in the main search bar. In addition, KEGG annotations in various genomes can be compared as annotated phylogenies (Figure 4.1) and interactive bar charts or accessed from summary tables available for each genome (Figure 4.2). Modules and pathways pages detail KEGG orthologs associated to a given entry ( Figure  4.3) and report the list of orthologs identified in each PVC genome (Figure 4.4).

Implementation, methods and updates
The interface was developed using the Django framework (https://www.djangoproject.com/). Data are stored on a MySQL server and visualized with existing JavaScript libraries allowing to draw interactive plots and tables such as jvenn.js (41), datatables.js (https://datatables.net), cytoscape.js (42) and feature−viewer.js (https://github.com/ calipho-sib/feature-viewer) (43). The python module Geno-meDiagram is used to draw genome schematics, including alignments of multiple genomic locations (44). Circular representations of genomes and plasmids are made with Circos (45). The Ete3 Python module is used to draw phylogenetic trees with associated metadata (46). Some plots are also made using R (47), ggplot2 (48) and plotly (https://plot.ly). Annotations, phylogenetic trees and multiple sequence alignments can be downloaded from the website. A detailed description of the methods used to precompute functional and comparative analyses and setup the database is available online (https://www.chlamdb.ch/docs/ index.html). The code source of the website is freely available on Github and issues can be reported online (https: //github.com/metagenlab/chlamdb). This database has been developed at the Centre for Research on Intracellular Bacteria (CRIB) in Lausanne and will be maintained and updated at least once a year.

CONCLUSION AND FUTURE DIRECTIONS
As the number of genome sequences quickly increases, there is a need for a centralized genomics resource providing updated annotations and extensive comparative genomics capabilities for the PVC superphylum. A superphylumspecific database has a significant added value with respect to large-scale genomic databases such as PATRIC (49) or Microscope (50): ChlamDB greatly facilitates access to comprehensive annotations and comparative data meaningful to the Chlamydia and PVC research community, with an intuitive interface and a special focus on visual representations of comparative data. Easy access to precomputed homology searches and phylogenetic reconstructions will help researchers to investigate the function and evolutionary history of proteins encoded in PVC genomes. Annotations of proteins specific for intracellular life such as predictions of type III secretion system effectors and identification of eukaryote-like domains will also facilitate the identification of uncharacterized proteins that might be involved in chlamydia-host interactions.
Since the annotation of PVC genomes stored in Genbank is generally not up-to-date with the most recent research, the existing ChlamDB could be extended to allow manual curation of the annotation and tracking of protein annotation history. Indeed, successful examples of community-curated databases exist for major pathogens, such as the Pseudomonas Database (www.pseudomonas. com) (51). The inference of orthologous relationships could be used to propagate the annotation of characterized proteins to less studied members of the phylum.