Elevated levels of adaption in Helicobacter pylori genomes from Japan; a link to higher incidences of gastric cancer?

Helicobacter pylori is a bacterium that lives in the human stomach and is a major risk factor for gastric cancer and ulcers. H.pylori is host dependent and has been carried with human populations around the world after their departure from Africa. We wished to investigate how H.pylori has coevolved with its host during that time, focusing on strains from Japanese and European populations, given that gastric cancer incidence is high in Japanese populations, while low in European. A positive selection analysis of eight H.pylori genomes was conducted, using maximum likelihood based pairwise comparisons in order to maximize the number of strain-specific genes included in the study. Using the genic Ka/Ks ratio, comparisons of four Japanese H.pylori genomes suggests 25–34 genes under positive selection, while four European H.pylori genomes suggests 16–21 genes; few of the genes identified were in common between lineages. Of the identified genes which were annotated, 38% possessed homologs associated with pathogenicity and / or host adaptation, consistent with their involvement in a coevolutionary ‘arms race’ with the host. Given the efficacy of identifying host interaction factors de novo, in the absence of functionally annotated homologs our evolutionary approach may have value in identifying novel genes which H.pylori employs to interact with the human gut environment. In addition, the larger number of genes inferred as being under positive selection in Japanese strains compared to European implies a stronger overall adaptive pressure, potentially resulting from an elevated immune response which may be linked to increased inflammation, an initial stage in the development of gastric cancer.


Introduction
Helicobacter pylori is a Gram negative bacterium that colonizes the human stomach and is adapted to its harsh acidic conditions to such an extent that it is completely dependent on its host. The majority of people around the world are infected with H.pylori, although infection rates may vary according to the region [1]. H.pylori is often pathogenic, with infection linked to gastric cancer and ulcers [2]. Gastric cancer is the second most common cause of death from cancer worldwide, with Korea, Japan and China demonstrating the highest incidences [3]. In contrast, European populations have a low incidence of both gastric cancer and ulcers [4]. While H.pylori infection is a major risk factor for gastric cancer [5], not all epidemiological studies have shown a direct relationship between H.pylori infection and gastric cancer [6]. These inconsistencies may be due in part to additional factors such as smoking, alcohol consumption and diet, variations in the immune response, as well as the H.pylori genotype [7,8]. It is also notable that the incidence of gastric cancer has fallen over the past few decades in Europe [9] from a 19th century high in some countries [10,11], and this reduction has been attributed to a variety of factors such as a drop in smoking, changes in diet and possibly a reduction in H.pylori infection [12]. Interestingly, there is growing evidence that H.pylori may be beneficial early in life, protecting from acid-related esophageal diseases [13][14][15], asthma [16,17], and potentially obesity [18]. Thus, the effects of H.pylori infection on the host appear more complex than simply a host-pathogen interaction.
The H.pylori genotype may predispose an individual towards the development of gastric cancer if certain pathogenic factors are present [2]. One such factor is the cag pathogenicity island (cag PaI), which encodes a Type IV secretion system that injects the oncogenic CagA protein into host cells [19]. Strains that possess it are linked with an elevated incidence of gastric cancer and ulcers [2]. In East Asian countries, elevated gastric cancer rates are linked to the presence H.pylori strains that possess CagA with an EPIYA-D motif [20]. Another gene linked to regional variation in gastric cancer incidence is the cytotoxin VacA. The vacA s1m1 genotype predominates in Japan and Korea and has been linked to a higher incidence of gastric cancer, compared to the vacA s2m2 genotype that is more common in the West [20].
Additional genes associated with pathogenicity in H.pylori include the blood group antigen binding and sialic acid-binding adhesin genes [babA and sabA; 21,22], the iceA gene [23], outer inflammatory protein and duodenal ulcer promoting genes [oipA and dupA; 24,25] and flagellin subunit A gene [flaA ; 26]. babA, sabA and iceA are involved in cell adhesion, oipA and dupA are involved in the induction of inflammation and flaA is involved in motility. Given that H.pylori varies substantially in strain-specific genes [27], the possibility exists that there are additional unidentified genes associated with pathogenicity and that these are partly responsible for regional differences in gastric cancer and ulcer incidences.
In pathogen genomes, a proportion of genes under positive selection are expected to be pathogenic factors. This is because pathogenic factors are involved in a coevolutionary 'arms race' with the host, particularly those interacting directly with the immune system and other host receptor proteins. Competition with the host leads to rapid adaptation and signatures of positive selection [28]. In addition, a role in pathogenicity may be indicated by involvement on the cell surface, where many pathogenic factors have a role in cell adhesion, toxin secretion and host recognition [29]. Here, in an effort to identify novel pathogenic factors in the H.pylori genome, and strain-specific differences in overall adaptive pressure which might be linked to differences in the host immune response, we used a maximum likelihood Ka/Ks-based method to infer which genes were under positive selection.

Data source and phylogenetic analysis
A phylogenetic tree of 37 H.pylori strains with complete genomes available was constructed (Fig. 1). The strains, with NCBI Refseq ID, country of origin and associated disease status where available (Ga; gastritis, PUD; peptic ulcer disease, ML; MALT lymphoma, GC; gastric cancer, C; commensal) were as follows. SouthAfrica7 (NC_017361.  The phylogenetic tree was constructed as follows. The housekeeping genes atpA, atpD, efp, glnA, mutY, ppa, trpC, ureI and yhpC were obtained from each genome, concatenated, and a multiple alignment constructed using the Muscle program [30] and default parameters. This was used as input to the JModelTest program [31] to identify an appropriate substitution model for the phylogenetic analysis. The best model selected using the Akaike Information Criterion was GTR + I + D, with values of I = 0.638 and D = 0.454. Bayesian phylogenetic inference was conducted using the MrBayes 3.2 program [32], running the analysis for 8 000 000 generations and discarding as burn-in the first 25% of the samples generated. Genome-wide positive selection pipeline Pairwise genome comparisons, using four European and four Asian H.pylori strains were conducted, using a third strain as an outgroup in each comparison. The maximum likelihood-based branches test of PAML (Phylogenetic Analysis by Maximum Likelihood) package version 4 [33] was utilized for the comparisons, which produces an estimate of the genic Ka/Ks ratio, which is calculated over the entire length of the protein coding gene. Because of this, the genic Ka/Ks ratio is a strong indicator of positive selection when detected [34, a review]. A value of Ka/Ks > 1 is consistent with positive selection, Ka/Ks < 1 is consistent with negative selection and Ka/Ks = 1 is consistent with an absence of selection (neutrality). The maximum likelihood approach that we follow is more accurate than the commonly used 'approximate' methods to calculate the genic Ka/Ks [35][36][37][38], and also has the advantage of giving lineage specific information. The alternative branches sites test [39] looks for individual sites that may be under positive selection, using the Ka/Ks ratio at each site. While more sensitive at detecting lineage specific positive selection, it is also prone to false positives resulting from sequencing and alignment errors [40][41][42], the effects of recombination [43,44], and segregating polymorphisms [45]. Lastly, for pairwise comparisons its use is inappropriate given that substantial numbers of orthologous sequences are required for the alignment [33,46] and this necessitates a known phylogeny with 100 % certainty [47, a review], which is not available for H.pylori. Lastly, more limited pairwise comparisons allow the analysis of strainspecific genes that would be excluded if the H.pylori core genome of larger numbers of strains were used, given that known pathogenic factors are often strain specific in H.pylori.
Pairs of H.pylori strains were selected using the phylogenetic tree as a basis; those pairs that formed monophyletic groups were utilized. For the European strains these were B38/G27, and 26695/ p12. For the Asian strains these were 35A/83 and F16/F57. The Asian strains were all from Japanese populations; 35A and 83 isolated from individuals from Kyoto, F16 and F57 were isolated from individuals from Fukui (Yoshio Yamaoka, Michael E. DeBakey Veterans Affairs Medical Center, Baylor College of Medicine, personal communication). The HUP-B14 strain was used as an outgroup for both European pairwise comparisons and strain 52 was used for both Japanese pairwise comparisons. All pairs chosen were statistically supported (posterior probability >0.95), with the exception of the F16/F57 pair, which were statistically superior to the alternative F16/F32 and F16/F52 pairs (Fig. 1). In this case, while the use of alternative pairs would not affect the nature of the analysis, their choice would be arbitrary and we wished to use the most likely monophyletic pair according to the phylogenetic analysis. Because an unrooted tree is used for the analysis, the phylogenetic relationship of the pair to the outgroup is not expected to affect the outcome, and this is expected to minimize the potential effects of recombination given that the sequences are equidistant phylogenetically.
A pipeline was developed using Python, integrating a variety of different software packages (Fig. 2); the script is available from the authors upon request. The first step of the analysis was to identify all orthologous genes present in each set of three strains (pairwise comparison and outgroup). This was accomplished using the OrthoMCL program [48]. OrthoMCL begins with a reciprocal allagainst-all BLASTP search within strains, to identify putative paralogs, and between strains, to identify putative orthologs. The default cutoff e-value of 1e-5 was used. Putative paralogous and orthologous relationships are converted into a graph, which is then subjected to the MCL algorithm [49] in order to identify orthologous groups. An inflation value of 1.5 (default) was used for this stage of the analysis. The output was used to construct multiple alignments of orthologous protein sequences for each set of strains, using Muscle and default parameters. Then, each alignment was converted to DNA using the PAL2NAL program [50] and the respective DNA sequence corresponding to each protein sequence. This procedure ensures the correct placement of gaps.
To infer genes under positive selection, the Ka/Ks (or o) ratio was calculated using the branches test of the codeml module of PAML [33], for each set of orthologous genes. The null model was the one ratio model, where the value of o is fixed, while the alternative model allows the value of o to vary for individual branches (free ratio test). Likelihood values were generated for both models and a likelihood ratio test was applied using a V 2 distribution to calculate P values. Those gene comparisons that were not significant (P > 0.05) were excluded from further analysis. Values of Ka, Ks and Ka/Ks were extracted for each gene pair. Saturation of synonymous substitutions can lead to an underestimate of Ks and thus an overestimate of Ka/Ks [51]. Synonymous site saturation was considered as Ks > 1 (which means that all sites have been substituted at least once on average), and sequences with this value were excluded from the analysis. Given that there were some differences in branch lengths between the European and Japanese strains ( Fig. 1), analyses that estimated genes with Ks = 0 were excluded, given that this may lead to an inflated value of Ka/ Ks. Genes under positive selection were inferred with a value of Ka/Ks > 1 and P < 0.05. A P value < 0.05 has been shown in simulation studies to reliably minimize the occurrence of false positives, while maintaining the power of the maximum likelihood approach [52], which is important given that the genic Ka/Ks ratio is a conservative test for positive selection.

Functional characterization
Genes inferred to be under positive selection which possessed a functional assignment from the genome annotation, were categorized using their gene ontology (GO) classifications [53]. GO enrichment was conducted using Blast2Go [54]. Unannotated genes (identified as 'hypothetical proteins') were blast searched against the non-redundant Genbank database. If they possessed significant sequence similarity with proteins of known function (over 30% sequence identity at the amino acid level and e-value < e À 15), then they were classified as homologs. Genes with no known homologs that were inferred as being under positive selection were subject to localization prediction in order to infer whether they are membrane localized. The CELLO v.2.5 program [55] and PSORTb v.3.0 program [56] were used to predict the membrane localization of the proteins.

Results and Discussion
The complete genome sequences of the eight strains chosen for this study were aligned and show the recombinergic nature of the H.pylori genome, as previously noted [57] (Supplementary Fig. S1). Table 1 presents statistics from the eight pairwise genome comparisons. The numbers of genes analyzed after filtering are similar, with the 26695/P12 comparison presenting the highest number of gene pairs (1247), and the 35a/83 comparison the least (1173). The genome pairs were closely related to each other, with average lineage specific Ks values ranging from 0.023 substitutions per site (strain 35a) to 0.058 substitutions per site (strain 26695). Thus, the level of divergence between genome pairs was such that a substantial proportion of the genes in each genome were included in the analysis after filtering, which removes saturated and non-divergent sequences. The similarity in the numbers of genes retained is important for comparing the results of the different comparisons.
Overall, the majority of the genes examined from the eight comparisons had Ka/Ks < 1, consistent with the expectation that the majority of the genes in a genome should be under purifying selection. The average Ka/Ks for each genome was calculated only from these genes, with values ranging from 0.10 (strains 35a and 83) to 0.17 (strain 26695; Table 1). These values indicate that the genomes are under similar evolutionary pressures in different human populations regarding purifying selection, and imply that modes of transmission, host genetics and environmental factors do not exert a significant influence on the strength of purifying selection.
Positive selection was inferred to be acting on a number of protein coding genes from all H.pylori pairwise comparisons. The Japanese strain 83 presented the highest number of genes with Ka/Ks > 1 (34 genes), while the European strains B38 and 26695 had the least, with 16 genes each. Genes inferred as being under positive selection were significantly elevated in Japanese strains compared to European strains (P < 0.05, two-tailed Mann-Whitney test). Thus, the numbers of genes inferred as being under positive selection in the H.pylori strains may vary by a factor of 2. Strain B38 possesses a phage present in the genome [58], but none of the genes inferred as being under positive selection were phage associated. GO enrichment shows that transport proteins are particularly highly represented, reflecting a major role for membrane proteins in adaptation (Fig. 3). Tables 2 and 3 list those genes inferred to be under positive selection for the four comparisons for Japan and Europe, respectively. A total of 190 genes were inferred to be under positive selection, of these 65 were unannotated (described as 'hypothetical protein' or 'conserved hypothetical protein'). When the Ka/Ks values for different individual genes are plotted around the genome for strains F16 and F57 a mosaic pattern of selection values is observed, with no regions strongly associated with enhanced negative selection pressure (Fig. 4). However, strain F16 shows a number of distinct clusters of genes inferred as being under positive selection; these could be termed 'islands of adaptation'. This clustering is not observed with strain F57, where the distribution appears largely random.
Of the genes identified in the analysis that were annotated, 38% had homologs with known roles in host adaptation and/or pathogenicity (Supplementary Material S1). Among these were genes involved in cell adhesion (CMP-Nacetylneuraminic acid synthetase, strain F57), secretion (sec-independent translocase strain G27 and P12), host recognition (neuraminyllactose-binding hemagglutinin, strain 35a) and host colonization (urease-enhancing factor and flgM protein, strain F16). These results indicate the efficacy of using an evolutionary rationale to identify potential strainspecific pathogenic factors. In addition, a number of genes were identified that have been proposed as drug targets, such as shikimate kinase (strain 83) [62], adenine-specific DNA methyltransferase (strain F16) [63], sialidases (strain F16) [64] and chorismate mutase (strain 83) [65]. While previously the cagA oncogene has been inferred to be under positive selection [66][67][68][69][70], this was not detected using our methodology. There are two potential explanations for this; firstly, our methodology uses Ka/ Ks, which is a conservative measure. Second,    H.pylori isolated from Amerindian populations show the strongest degree of inferred positive selection on cagA [69], and these were not included in our study.
While the function of the unannotated proteins inferred as being under positive selection is unknown, it is possible to predict whether these proteins are membrane localized. Predicted membrane localization is indicated in Tables 2 and 3. Out of 65 unannotated proteins inferred as being under positive selection, 13 were predicted to be membrane proteins. These are good candidates as novel pathogenic factors, given that many pathogenic factors are membrane localized, and may represent potential markers for gastric cancer. Table 4 shows genes inferred as being under positive selection that were identified in more than one pairwise comparison; these represent examples of convergent evolution, with inferred positive selection occurring in parallel in different strains. All these genes are located in the cytoplasm, with the exception of membrane localized sec-independent translocase, and participate in various cellular housekeeping functions. The genes are probably under similar selective pressures in the different human populations, and so are probably not greatly influenced by differences in host genetics or the stomach environment. Interestingly, sec-independent translocases are involved in secretion and so have a potential link to pathogenicity [71].
Potential causes of differences in the strength of adaptive evolution between Japanese and European strains The causes of lineage-specific variation in rates of adaptation are not fully understood. While Darwin proposed slow, gradualist evolutionary change, bursts of adaptation such as is observed in adaptive The area of Japan from where the strains were isolated, and the disease status of the individuals from whom the H.pylori strains were isolated is indicated in brackets ('PUD' signifies peptic ulcer disease, 'Ga' gastritis, 'GC' gastric cancer). References describing the respective disease status for each strain are as follows:  radiations are well known, as are examples of the contrasting case of evolutionary stasis. So, while it is clear that rates of adaptation may vary at the phenotypic level, on a genomic level this has been little explored. Here, we present an example of differential adaptive evolution between eight strains of H.pylori, which contrasts with their similarity in levels of purifying selection. Hence, the overall strength of adaptation, using the measure of number of genes inferred as being under positive selection, is markedly stronger in the Japanese strains. In prokaryotes, lineage-specific differences in genome wide adaptive pressure have also been reported in the Streptococcus [72] and Campylobacter [73] genera. The identities of the genes inferred as being under positive selection themselves are mostly different in the eight strains. This indicates lineage-specific differences in adaptive pressures. There could be three different explanations for these differences, which are not mutually exclusive; (1) Differences in the stomach environment unrelated to host genetics. Differences in the stomach environment may be linked to dietary factors, smoking, levels of exercise, and differences in microbiota [74] and are expected to be different in the populations examined.
(2) Differences in the bacterial genetic background. The genetic background of the bacteria may exert an influence via epistatic effects, which may lead to differences in evolutionary trajectories [75]. Alternatively, specific bacterial factors may influence the overall immune response to the bacteria, thus increasing overall selection pressure (3) Differences in host genetics. This is discussed in the following section.
Potential role of the immune system Positive selection on microbial pathogenic factors often occurs via a selective pressure exerted by the host immune response [28], with antigenic moieties on the surface of the pathogenic factor undergoing adaptive evolution in order to avoid host immune recognition. Thus, greater levels of adaptive evolution in the Japanese strains may reflect an enhanced immune response and corresponding differences in host genetics. Strong differences in immune response, reflecting host genetics, is observed for a range of human pathogens, and differences in response may also be observed depending on the human population. In the case of H.pylori, polymorphisms in a number of immune genes have been shown to increase the immune response to infection and increase the risk of gastric cancer [76, a review]. While polymorphisms are often show differential distributions in different human populations, surprisingly few studies have focussed on population specific host genetic susceptibility factors to H.pylori. In one example, blood group, which shows strong populational biases, influences H.pylori infection via differential binding efficiency of H.pylori BabA protein to blood group antigens [77]. Host genetic background may also play a role; in one study IL-1B and IL-1 receptor antagonist gene polymorphisms were associated with gastric cancer risk in Caucasians but not Asians [78]. An elevated immune response would be expected to result in stronger selective pressure on the bacteria, which would be indicated by enhanced adaptive evolution of individual genes, given that only a subset of H.pylori genes are involved in interactions with the immune system [79]. This would occur via a more potent recognition of bacterial antigens, exerting a greater mortality on the bacterial population. The observation that 38% of the genes inferred as being under positive selection in the analysis have homologs involved in pathogenicity is consistent with this scenario. This allows a hypothesis to be formulated that can explain both the elevated levels of adaptation observed in Japanese strains and the elevated levels of gastric cancer in Japanese populations. A heightened immune response leads to increased inflammation and this is turn may be linked to an elevated incidence of gastric cancer, given that inflammation has been identified as a key stage in its development [80]. There is a strong etiological link between H.pylori driven inflammation and gastric lymphoma [81], although the mechanistic link between inflammation and gastric carcinoma is not so clear [82]. Thus, a heightened immune response in the Japanese population may lead to both increased adaptation in H.pylori strains and increased oncogenesis. The strains examined in this analysis were isolated from individuals with a variety of different pathologies, but the signatures of positive selection that we detected are likely to have been accumulated over a longer time period than the lifespan of a single individual. More genome sequences will be required in order to fully test the hypothesis that elevated adaptation is related to increased inflammation, and the genes that are responsible for increased inflammation need to be established. A number of potential candidates have been identified in this analysis and experimental verification could be conducted on these.

Relationship between transmission mode of Helicobacter pylori and pathogenicity
Helicobacter pylori is often regarded as vertically inherited, but increasing evidence shows that horizontal transmission may be more common than previously recognized, especially in developing countries [83][84][85]. The argument can be made that if H.pylori is entirely vertically transmitted, the observation of signatures of positive selection we detected in known and potential pathogenic factors of H.pylori is contradictory. This is because if the bacteria are vertically transmitted they are expected to evolve to minimize harmful effects [86], while if they are horizontally transmitted then they are expected to have a degree of pathogenicity [87]. Thus, our results are inconsistent with a strictly vertical mode of transmission. The age of transmission may be important; vertical transmission has been observed from mother to child, i.e. during reproductive age. This might help to explain the late onset of gastric cancer; it occurs after transmission has occurred and is consistent with the increasing evidence that H.pylori is beneficial, with pathogenic effects exerting themselves later in life [19].

Conclusion
While there are similarities in the levels of purifying selection, Japanese H.pylori genomes have a greater number of genes inferred as being under positive selection than European strains and present new potential protein factors that interact with the host. These results help to further our understanding of the host interaction and pathogenesis of H.pylori. In particular, we propose that elevated levels of adaptation in these genomes may indicate an elevated immune response, and hypothesise that this provides a connection with elevated levels of gastric cancer in Japanese populations. In addition, in the absence of functionally annotated homologs, our procedure may have value in identifying potential novel pathogenic factors from pathogen genomes which could be crucial for adaptation to the host.

supplementary data
Supplementary data is available at EMPH online.