PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands

Pathogenicity is a complex multifactorial process confounded by the concerted activity of genetic regions associated with virulence and/or resistance determinants. Pathogenicity islands (PAIs) and resistance islands (REIs) are key to the evolution of pathogens and appear to play complimentary roles in the process of bacterial infection. While PAIs promote disease development, REIs give a fitness advantage to the host against multiple antimicrobial agents. The Pathogenicity Island Database (PAIDB, http://www.paidb.re.kr) has been the only database dedicated to providing comprehensive information on all reported PAIs and candidate PAIs in prokaryotic genomes. In this study, we present PAIDB v2.0, whose functionality is extended to incorporate REIs. PAIDB v2.0 contains 223 types of PAIs with 1331 accessions, and 88 types of REIs with 108 accessions. With an improved detection scheme, 2673 prokaryotic genomes were analyzed to locate candidate PAIs and REIs. With additional quantitative and qualitative advancements in database content and detection accuracy, PAIDB will continue to facilitate pathogenomic studies of both pathogenic and non-pathogenic organisms.


INTRODUCTION
Increased awareness of infectious diseases of humans, animals and plants caused by microbial pathogens has accelerated the genome-wide study of microbial pathogenicity, called pathogenomics (1)(2)(3). Genomic islands (GIs) are regions of the genome that are acquired through horizontal gene transfer (HGT) (4). The genomes of pathogenic bacteria often contain pathogenicity islands (PAIs), a subset of GIs that mediate the horizontal transfer of genes encoding numerous virulence factors. Some known PAIs include the type III secretion system (e.g. LEE PAI in pathogenic Escherichia coli and Hrp PAI in Pseudomonas syringae), superantigen (e.g. SaPI1 and SaPI2 in Staphylococcus aureus), colonization factor (e.g. VPI in Vibrio cholerae), iron uptake system (e.g. SHI-2 in Shigella flexneri) and enterotoxin (e.g. espC PAI in E. coli and she PAI in S. flexneri). PAIs confer virulence upon the recipient, resulting in the dissemination and diversification of bacterial pathogens (5).
Antimicrobial resistance islands (REIs) are another class of GIs that are linked to pathogenesis by conferring simultaneous resistance to multiple antibiotics and facilitating the emergence of multidrug-resistant pathogens (6)(7)(8). For example, acquisition of the staphylococcal cassette chromosome mec (SCCmec) resulted in the emergence of methicillin-resistant S. aureus (9). The Salmonella genomic island 1 (SGI1) is associated with the multipledrug-resistant form of Salmonella typhimurium (10). Pseudomonas aeruginosa genomic island 1 (PAGI-1) is found in the majority of clinical isolates (11). AbaR1 was reported to contain over 85% of resistance genes of Acinetobacter baumannii AYE, explaining a remarkable ability of this emerging opportunistic pathogen to rapidly acquire multidrug resistance within a few decades (12).
Pathogenomic studies necessitate specialized data resources related to pathogens. Public database servers have been developed for searching virulence factors (e.g. VFDB (13), MvirDB (14)) and PAIs (e.g. PAIDB (15), PAI-IDA (16), PredictBias (17), IslandViewer (18)). A recently developed software suite, PIPS (19), was specifically designed to predict PAIs, but requires installation of multiple programs and databases on a Linux computer. Compared with most PAI-related databases, which focus on predicting PAIs by searching for HGT (20), PAIDB remains the only database dedicated to providing comprehensive information on all annotated and predicted PAIs in prokaryotic genomes (21). PAIDB also allows users to predict PAI-like regions that are homologous to known PAIs using an automated identification system. Several databases of resistance genes have also been described, such as ARDB (22), CARD (23) and BacMet (24). Although numerous REIs have been reported, to our knowledge, a REI-related database has yet to be developed.
In 2007, we released PAIDB, which contained 112 types of PAIs and 889 GenBank accessions of complete or partial PAI loci previously described in 497 pathogenic bacterial strains (15). Since the release of PAIDB, there have been continuous requests for an expanded collection of PAIs and candidate regions in newly sequenced genomes (21). Here, we demonstrate PAIDB v2.0, which contains 223 types of PAIs from 1331 accessions, and 88 types of REIs from 108 accessions. This update to the PAIDB reflects a dramatic increase in the number of analyzed genomes, improved accuracy of candidate region detection and a functional update of the web application.

Definition of terms
We have previously defined a 'PAI-like region' as a predicted genomic region that is homologous to known PAI(s) and contains at least one virulence gene homolog from the PAI loci (15,25). If a PAI-like region overlaps a GI, we call it a 'candidate PAI (cPAI)', otherwise the region is a 'nonprobable PAI (nPAI)'. Likewise, in this study, a REI-like region overlapping GI(s) was dubbed as a cREI and a REIlike region not overlapping a GI as an nREI (Figure 1).

PAI and REI data
GenBank accession numbers for PAI and REI loci were collected via an exhaustive search of GenBank and academic literature using a variety of terms related to 'pathogenicity island' and 'resistance island' (Supplementary Table S1). We also added PAIs and REIs that were reported in genome sequencing papers in a GenBank-like flat file format (Supplementary Table S2). Via expert review, we collected 223 types of PAIs, consisting of 1331 accessions for complete or partial PAI loci previously described in 804 pathogenic bacterial strains. Similarly, we collected 88 types of REIs with 108 accessions from 99 bacterial strains (Table 1).

Potential PAIs and REIs in prokaryotic genomes
As of October 2013, the sequence files of 2673 prokaryotic genomes (including 160 archaea) had been downloaded from the NCBI FTP server (Supplementary Table S3). To determine the pathogenicity of the retrieved organisms, we referred to related publications and to the Genomes Online Database (GOLD) (26). We considered an organism pathogenic if any of the bacterial strains caused any adverse effects in any host--human, animal, bird, fish, insect or bacteria. Aside from the 70 organisms without pathogenicity information, we tagged 1226 organisms as pathogenic and 1377 as non-pathogenic (Supplementary Table S3). The genomes were analyzed to predict potential PAIs and REIs, producing 3579 regions that were PAI-like or REI-like in 966 strains. Of these regions, 1596 cPAIs were detected in 560 strains and 210 cREIs were found in 178 strains (Figure 2, Supplementary Table S4). In total, 49.3% of the pathogenic strains (604 ea) were predicted to have 1366 cPAIs. Intriguingly, 424 cPAIs were also found in 18.6% of the non-pathogenic genomes (256 ea). In contrast to cPAIs, cREIs were detected in a relatively small number of genomes (137 pathogenic and 38 non-pathogenic).

METHODOLOGIES IMPROVEMENT
To detect candidate regions in genome sequences, we modified the method previously described in (25) (Figure 1). In a given genome sequence, each open reading frame (ORF) was searched for homology against the collected PAI and REI dataset at the nucleotide and amino acid level using BLAT (27) and BLAST+ (28), respectively. If the identity of the resulting hit was over 80% for a DNA sequence of a non-protein coding ORF (e.g. tRNA, rRNA and pseudogene), or 40% for a protein sequence, and the aligned region was both over 70% of the length of the query and the hit, the pair of sequences was considered as a homolog. Overlapping or adjacent genomic regions corresponding to the same or different PAI and REI loci were joined into a larger region ( Figure 3). Small genomic regions below 8 kb in size were excluded (20). Of these regions, PAI-like or REI-like regions were identified by checking for the presence of at least one virulence or resistance gene homolog, respectively. Finally, a region was considered as a cPAI or cREI only if the PAI-like or REI-like region partly or entirely spanned a GI. The remaining set of regions that did not span a GI was denoted as nPAIs or nREIs. We detail further updates in the methods for detecting GIs, virulence factors and resistance genes in the following sections.

Detection of genomic islands
GIs are a heterogeneous class of mobile elements that contain a large collection of genes acquired by HGT. Various methods have been suggested for their detection in microbial genomes (20). In the original version of PAIDB (15), genes were considered as acquired by HGT if their G+C content and codon usage were both aberrant (25). By merging neighboring HGT genes, a GI was identified. However, the P-value for codon usage deviation was calculated assuming a normal distribution of codon frequencies, which was later suggested to be suboptimal (29). Hence, to detect HGT regions in this update we have used SIGI-HMM (30), which measures the codon adaptation index, and IslandPath-DIMOB (31), which uses dinucleotide bias in combination with the presence of mobility gene(s). Both methods were reported to be the most accurate methods for GI predictors (32) and were applied in the IslandViewer web server (18). HGT regions detected from these methods were merged into a larger GI as described previously (25).

Identification of virulence and resistance genes in candidate regions
In our detection scheme, the presence of virulence-or resistance-related genes is a crucial criterion to identify candidate regions in a genome ( Figure 1). We tagged virulence and resistance genes of PAIs and REIs through literature search of verified ones. In addition, we adopted known virulence genes from the Virulence Factor Database (VFDB)   In each stacked bar, the total number is denoted on the top and the proportion (as a percentage) is shown inside, according to the organism's pathogenicity status--pathogenic (black), nonpathogenic (light gray) and unknown pathogenicity (dark gray). In a group of barplots for predicted regions, the left bar denotes the total number related to homologous regions, and the right bar represents the number related to candidate regions. (13) and resistance genes from the Comprehensive Antibiotic Research Database (CARD) (23) and the Antibacterial Biocide and Metal Resistance Genes Database (BacMet) (24). Transposase genes and integrase genes were excluded from the list. The sequence identifiers of the known virulence and resistance genes (e.g. NCBI accession number) were searched to retrieve amino-acid sequences from GenBank or UniProt website--2266 ea from VFDB, 1833 from CARD and 702 from BacMet. PAI/REI-like regions were identified by checking for the presence of at least one virulence/resistance gene homolog, as described above.

Browse
PAIDB is freely accessible at http://www.paidb.re.kr. The web-based database was redesigned to offer a user-friendly graphic interface with clear visualization of PAIs, REIs and candidate regions in bacterial genomes. The organization of the website follows the previous version of PAIDB (15). The web pages were modified to reflect the new addition of REI data and to accommodate the significantly expanded content ( Figure 4A). The menus 'PAIs' and 'REIs' enable users to casually explore annotated information on each of PAIs and REIs. The 'Genomes' menu provides a list of candidate regions of PAIs and REIs in each microbial genome. When a genome accession number is clicked, the 'Genome Information' page shows a circular genome map and tables for PAIs, cPAIs, nPAIs, REIs, cREIs and nREIs ( Figure 4B). The circular genome map is clickable and links to a linear genome browser view of the selected genomic region. Each of the candidate regions in table format is linked to the feature table, which contains the genes and virulence/resistance determinants.

Search tools
The 'Search' menu enables users to retrieve PAI and REI data stored in PAIDB through text-and homology-searches ( Figure 4C). Along with the PAIDB data, this version of PAIDB allows users to explore information from the databases for virulence factors from PAIDB and VFDB (13) and resistance determinants from PAIDB, CARD (23) and BacMet (24). To facilitate follow-up research, the search results are linked to internal and external databases. The phylogenetic relationship of the selected genes can be inferred through multiple sequence alignment using ClustalW2 (33) .

PAI finder
In addition to discovering candidate PAI regions in query sequences, 'PAI Finder' was modified to also locate candidate REI regions. The overall detection scheme follows Figure 1, except the GI prediction step: BLAT and BLASTX searches against PAIs and REIs, and BLASTX searches against virulence genes and resistance genes. The allowed number of DNA sequences in the multiple FASTA input was increased to 1000 ORFs (approximately 1 Mb). Multithreading, multiprocessing and queuing were implemented to accommodate the volume of the database, the increased number of input sequences and multiple requests by users.

PAIDB v2.0 allows comprehensive exploration and analysis of PAIs and REIs
Virulence factors and resistance factors are overrepresented in large mobile genetic elements of PAIs and REIs present in bacterial pathogens (4,5,34). PAIDB (15) has been a specialized reservoir of all the annotated and candidate PAIs predicted by a method described previously (25). In addition to PAIs, PAIDB v2.0 is now a centralized resource of REIs described so far in the academic literature. The updates included in PAIDB v2.0 are manifold: (i) inclusion of REI data, (ii) improvement of GI detection accuracy, (iii) significantly increased inventory of virulence and resistance genes, (iv) dramatic increase in the number of genomes analyzed and (v) improvement  Figure 3. Example of detection of a candidate REI in a genome sequence. A 27.6 kb genomic region in the chromosome of methicillin-resistant S. aureus ST80-IV (GenBank accession number: NC 017351) was identified as a cREI by merging genomic regions homologous to known REI loci (yellow bar). The stitched together genomic region contains homologs of seven resistance genes from REI loci and CARD datasets (red arrow). The region spans a GI (gray bar) and has a G+C content (-2.56%, P-value ≈ 0) lower than that of the rest of the chromosome. Therefore, this REI-like region is considered as a cREI. Red arrows in yellow bars denote resistance genes.  in text-and homology-searches and in the identification system for candidate regions in query sequences.

Detection of genomic segments homologous to the reported REIs, rather than individual homolog(s), can identify antimicrobial resistance regions in a sequenced genome
GIs are hotspots for the stepwise insertion of different genetic fragments carrying virulence and resistance determinants (5). PAIs often represent mosaic-like structures, such as Hrp PAI in P. syringae (35), SPI-2 in S. typhimurium (36) and PAI I in verocytotoxin-producing E. coli (37). This is also true for REIs, such as SGI1 in S. typhimurium (10), PAGI-1 in P. aeruginosa (11) and AbaR1 in A. baumannii (12). We have previously developed an algorithm that reflects the evolutionary process of PAIs--detection of genomic segments homologous to known PAIs and merging them into a large PAI-like region (25). It should be noted that this approach also reflects disruption and reorganization of a gene cluster during genome reorganization (38) (Figure 3). The algorithm was successfully applied to identify potential PAIs in prokaryotic genomes (15). In this study, we modified and applied the algorithm to identify REIs in prokaryotic genomes, providing 210 cREIs in 178 organisms. As shown in Figure 3, when our method was applied to a genome with primary annotation (39), potential regions related to known PAIs and REIs can be searched and demarcated without human intervention. The predicted region has information regarding the PAIs and REIs constituting it, providing insights into its function and origin.

The unexpected locations of candidate regions in nonpathogenic organisms allow pathogenomic study of nonpathogenic strains
Virulence factors involved in bacterial pathogenesis are often found in genomes of non-pathogenic bacteria (40,41). Comparative analysis of numerous genome sequences of both pathogenic and non-pathogenic strains of diverse bacterial genera can deepen our understanding of roles of different classes of virulence factors (34,42). In the early version of PAIDB, 171 pathogenic and 108 non-pathogenic prokaryotic genomes derived from 35 classes were analyzed to identify potential PAIs (15). In PAIDB v2.0, the number of genomes analyzed has drastically increased to 1226 pathogenic and 1377 non-pathogenic strains from 90 classes ( Figure 2, Supplementary Table S3). While the majority of cPAIs (86%) and cREIs (79%) were detected in pathogenic genomes, they were also found in a small portion of nonpathogenic organisms. The unexpected locations of potential PAIs and REIs in non-pathogenic genomes and their comparison with counterparts in pathogenic genomes may help to clarify the role and mechanism of virulence determinants. Importantly, such analysis may facilitate reassessment of the virulence potential of presumed non-pathogens in light of a better understanding and interpretation of virulence factors.

CONCLUSION
As the number and diversity of sequenced microbial genomes rapidly accumulate, this web-based, user-friendly resource will continue to contribute to the investigation of genomic regions related to pathogenicity and to give insight into the evolution of pathogenesis. We envision that PAIDB will be of significant use in detecting PAIs and REIs in newly sequenced genomes and mining virulence determinants from metagenomic analyses. Furthermore, as a unique resource for experimentally verified and computationally predicted PAIs and REIs, PAIDB should be particularly useful to design clinical biosensors for pathogen detection and infectious disease diagnostics. PAIDB will continue to incorporate newly discovered PAIs and REIs in a timely manner to keep pace with the rapidly developing field of pathogenomics.