HGV&TB: a comprehensive online resource on human genes and genetic variants associated with tuberculosis

Abstract Tuberculosis (TB) is an infectious disease caused by fastidious pathogen Mycobacterium tuberculosis. TB has emerged as one of the major causes of mortality in the developing world. Role of host genetic factors that modulate disease susceptibility have not been studied widely. Recent studies have reported few genetic loci that provide impetus to this area of research. The availability of tools has enabled genome-wide scans for disease susceptibility loci associated with infectious diseases. Till now, information on human genetic variations and their associated genes that modulate TB susceptibility have not been systematically compiled. In this work, we have created a resource: HGV&TB, which hosts genetic variations reported to be associated with TB susceptibility in humans. It currently houses information on 307 variations in 98 genes. In total, 101 of these variations are exonic, whereas 78 fall in intronic regions. We also analysed the pathogenicity of the genetic variations, their phenotypic consequences and ethnic origin. Using various computational analyses, 30 variations of the 101 exonic variations were predicted to be pathogenic. The resource is freely available at http://genome.igib.res.in/hgvtb/index.html. Using integrative analysis, we have shown that the disease associated variants are selectively enriched in the immune signalling pathways which are crucial in the pathophysiology of TB. Database URL: http://genome.igib.res.in/hgvtb/index.html


Introduction
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (Mtb), an air-borne, nosocomial, gram positive and acid fast bacterium (1). Nearly, one-third of the world's population is estimated to be infected with this pathogen (2). The disease has emerged as one of the major causes of mortality and morbidity in the developing world (1,3). It has been estimated that 8.8 million new cases of TB have been reported and 1.1 million affected individuals died in 2010. Majority of people infected with M. tuberculosis have latent infection with no evidence of clinical symptoms, but 10% of infected individuals develop clinical symptoms (1). Although the precise factors influencing the disease predisposition have not been well studied, several of them, such as pathogen virulence (4), host nutrition (5) and host genetic factors (6) have been implicated in causing the disease.
Differences in the disease susceptibility observed among different human populations followed by twin studies (7) suggested host genetic factors could at least in part influence predisposition to TB. In the early years, numerous studies aimed at understanding the genetic susceptibility to TB have been performed and extensively reviewed. However, they have largely been ambiguous (8). The first clear role of genetic factors in TB susceptibility was suggested through the pivotal experimental work of Lurie (9). Studies on different ethnic groups (10) and twins (7) provided additional evidence suggesting a larger role of host genetic factors in determining susceptibility to M. tuberculosis infection and progression to disease. More genetic associations have been shown in very recent years through approaches involving candidate genes (11)(12)(13)(14)(15)(16)(17). The use of genome-wide approaches also revealed significant genetic associations to TB infection in very recent years (18).
A number of distinct approaches and studies on diverse populations and ethnic groups have revealed genetic associations with respect to pathogenesis and outcome of TB. Though the data are available in the public domain, they are in disparate formats. A systematic attempt to collect, curate and perform integrative analysis of all the data on genetic factors, which influence susceptibility and outcome of TB could provide immense insights into the major pathways and mechanisms involved in the pathogenesis of TB and also open new avenues of investigation.
We have systematically collected evidence on host genetic associations with TB from peer reviewed literature and compiled them into a comprehensive and easily searchable online resource on human genes and genetic variants associated with TB (HGV&TB). The resource hosts information on 307 genetic variants from 162 studies. It provides a standardized view of genes and genetic variants, closely integrated with other online resources for gene and variant function analysis. For ease of future integration efforts, the genetic variant annotations in the resource have been standardized and conform to the recommendations of the Human Genome Variation Society (HGVS) (19) and recommendations for curation of Locus Specific Databases (20). Similarly, the gene names conform to the Human Gene Nomenclature Committee recommendations.

Data and resources
An exhaustive literature search was performed to retrieve all available evidences in the literature documenting association of the host genetic variability with TB. For the literature review, PubMed was searched using the keywords, such as 'susceptibility', 'SNPs', 'variants' and 'genetics' in combination with 'TB'. Data for each of the associated variant were manually curated and classified. Gene annotations were mapped and standardized according to the Human Gene Nomenclature Committee. Genetic variants were also remapped and standardized to conform to the recommendations of the HGVS (19). The data were distributed among the annotators and collected on a shared document system implemented in Google Drive. Individual research papers depicting sequence variations and their association with TB were systematically referred back. The Mutalyzer 2.0 b-8 (21) (HGVS nomenclature version 2.0) was used to further check the entries made. The dbSNP was used for scrutiny of the IDs of the variant data. HGVS values were obtained from Mutalyzer 2.0 and dbSNP. Genomic variant change and location were uploaded directly from literature for the variations whose RSID's were not available in dbSNP. All variants were cross checked and published only when no discrepancy was observed in the entries.
The sets of annotations were independently scrutinized manually by a team of database curators. Further in-depth bioinformatics analysis was performed using computational tools [SIFT (22) and PolyPhen2 (23)] to comprehend the potential biological significance of the variants and the genes that harbour them. The functional enrichment analysis was performed using the DAVID Bioinformatics Resource v6.7 (24). Disease classes and interaction pathways of the genes, which have been associated with genetic susceptibility to TB were analysed. All P-values were reported after the Bonferroni correction for multiple testing. Assessment of functional interaction of genes was performed using the STRING v.9.0 (Search Tool for the Retrieval of Interacting Gene/Proteins) (25).

Database construction and features
The HGV&TB database was built in MySQL, and the browsable interface was created in HTML and Perl/CGI. Information for each mutation was compiled in annotation tables and made available through the searchable web interface. The database was built considering data interoperability and recommendations for curation of data using the guidelines provided by HUGO Gene Nomenclature Committee (HGNC) (26). For each mutation, information is provided at the molecular level, such as DNA change, exon, predicted amino acid change, type of mutation, reported and concluded pathogenicity, source of material, technique used and unique database ID. The gene and variant annotations comply with the HGNC recommendations. A brief citation of the source manuscript is also available in the database.

Annotation of the variations
Two independent methods SIFT and PolyPhen2 were used to annotate the pathogenicity of the variants. While SIFT annotates the variants as tolerated and deleterious, PolyPhen2 uses the terms-benign, possibly damaging and probably damaging. The variants were annotated independently by each of the method, and a consensus was derived for the annotations. For each variant, a combination of annotations as tolerated and benign was considered and reported as 'non-pathogenic', tolerated and possibly damaging was reported as 'probably pathogenic' and deleterious or probably damaging was reported as 'pathogenic'. The annotations of the function and gene interactions were analysed using two popular online tools, DAVID (24) and STRING (25), respectively. Allele frequency of the variants was retrieved from the HapMap (27) for each variation (28).

Results and discussion
Data summary HGV&TB database harbours data for human genetic variations associated with susceptibility to different forms of TB, such as general TB, pulmonary, extra-pulmonary, pleural, miliary, spinal, cavitary, paediatric, meningeal and HIV-associated various forms of TB. The database hosts gene information on 98 genes and 307 variants. Of the total number of genes, 7 belong to the HLA class of genes, whereas 91 belong to non-HLA genes. The non-HLA genes including CCL1, CCL2, CCL5, IFNG, IFNGR1,IFNGR2, IL10, IL12RB1, MBL2, NOS2, P2RX7, SLC11A1 (NRAMP1), SP110, TLR2, TLR4, TNF (TNFA) and VDR have a large number of variants associated with TB. In addition, information on associated variants in HLA genes HLA-A, B, C, DPB1, DQA1, DQB1 and DRB1, which have been extensively shown to be associated with TB in multiple studies, have also been indexed in the database.
In HGV&TB, all genes and variants are indexed by unique database identifiers (HGVID). The resource could be searched using a variety of identifiers including HGVID, gene name, RSID, PMID, technique, template, geographic location, phenotype and concluded pathogenicity. Associated information including the HGVS nomenclature for the variants, the dbSNP RSID, genomic location and pathogenicity status of the variant as described in the primary literature have been included for each variant. In cases where the dbSNP RSIDs were not available, the variant change and position were obtained directly from the literature and compared with the databases using Mutalyzer. Supplementary Table S1  Haplotypes, set of variations located on the same chromosome, are considered to be better determinants for establishing phenotypic association than single nucleotide variations (29). Thus, in addition to SNVs, we have also compiled information of haplotypes showing significant associations. The HGV&TB database contains information on 75 haplotypes in 37 genes (Supplementary Table S2). For example, SLC22A5 haplotype-c.652þ77A>Gc.1052þ237T>C-c.1053-550G>C was found to confer disease susceptibility in Thai trio family study only (30). Other genes, such as CCL5, CTSZ, IL12RB1, IRGM, MBL2, SP110, TIRAP, TLR1 and VDR show both independent and haplotypic disease effect in various populations. In addition, gene studies in various Chinese populations revealed that variants could confer TB susceptibility, both when present individually and when present in the form of haplotype. Variants in AKT1 gene affects pulmonary TB susceptibility in Chinese Han population both individually and in the form of a haplotype c.175þ18C>T (rs3730358)-c.726G>A (rs1130233) (31). Similar pathogenesis pattern is observed in BTNL2 gene (individual variant: rs3763313, rs9268494, rs9268492; haplotype: rs9268492-rs3763313-rs9268494-rs9405098-rs3763317-rs2076530) (32) and MARCO gene (individual variant: rs17009726, haplotype1: rs17009726-rs2278588 and haplotype2: rs17795618-rs1371562-rs6761637-rs2011839) (33).

Analysis of genomic loci and gene position for associated variations
For each variant in all 98 genes, specific genomic location was determined from literature and public databases, such as dbSNP (34) and Ensembl (35). The chromosomal map of the variants is depicted in Figure 2. Of the total number of variations, 101 mapped to the coding sequence, 78 to the intronic region, and 38 in the intergenic region, 11 mapped to the 3 0 untranslated regions and 11 to the 5 0 untranslated regions. An additional 27 variants mapped to upstream and 5 variants to downstream regions of the genes. Apart from these, 10 variations are haplotypes and a total of 11 fall in the splicing sites or within the intronic, exonic and 5 0 -UTR of ncRNA. Fifteen variations of the total of 307 have not been assigned any genomic loci ( Figure 3A; Supplementary  Table S4). A total of 197 variants did not fall in close proximity to protein-coding genes (2 kb from TSS). Of these, 129 mapped to potential long-non-coding RNAs. The genomic loci and gene position mapping of the variants are summarized in Figure 2.

Functional consequences of the variations as predicted by PolyPhen2 and SIFT
Apart from the reported pathogenicity of each of the variants, we performed an independent analysis of the potential functional consequences of each of the variations. We used two independent tools SIFT and PolyPhen2 for the analysis of the functional consequences of the variants in the database. Both methods have been extensively used in the past for annotation of deleterious effects of variations on protein structure and thereby the function (36). Of the total number of 101 variants which mapped to protein-coding genes, 27 (out of 60 predictions) and 11 (out of 46 predictions) were predicted to be deleterious by SIFT and PolyPhen2, respectively ( Figure 3B). A total number of eight variants were in consensus predicted to be deleterious by both the tools (Supplementary Table S5; Supplementary Figure S1).

Associations of variants with other diseases and/ or traits
Many genes reported to be associated with TB are also found to be associated with other traits and conditions (Supplementary Table S1). For example, CHIT1 genomic variation is found to be associated with atopy, allergic rhinitis, contact dermatitis, food or drug allergy and asthma. Similarly, variants in other genes, such as BTNL2, CD209-promoter, CISH, CR-1, IL12A, MBL2, TLR8 were also found to be involved in various diseases. BTNL2 polymorphism has been recently associated with inflammatory autoimmune diseases, such as sarcoidosis and also associated with leprosy. Likewise, the CD209-promoter polymorphism-336A/G is associated with human susceptibility to dengue and HIV-1 besides TB. A complete list of disease classes to which each gene has been associated is presented in Supplementary  Table S3.

Functional interactions between associated genes
The genes harbouring associated variants were further analysed for mutual interactions using online tools available at DAVID Bioinformatics Resource v.6. (24). Functional annotation analysis from DAVID bioinformatics resource revealed that most of the genes formed close gene-gene interactions. This was observed from interactions based on direct or physical contact or interactions deduced using text-mining, gene context and high-throughput experiments including correlations in expression. This precisely points to the close biological context in the functional organization of the genes. A close analysis of the genes also revealed their enrichment in cell signalling pathways, especially in the Toll-like receptor signalling, cytokine-cytokine receptor interaction and the JAK-STAT pathways. No functional interaction was observed between the genes reported in the catalogue of published genome wide association studies for TB. The pathways enriched have been summarized in Table 1. A highly connected network of all the genes associated with TB was generated by STRING v.9.0 with almost every gene linked to more than one other gene (Supplementary Figure S2).
We also found evidence of gene interaction from the haplotypes associated with susceptibility to TB. Few variations in genes showed association, in concurrence with the presence of other variants of a different set of genes. For example, TLR6 gene variant (rs5743810) was not independently associated with TB, while in combination with TLR1 variants (rs4833095 and rs76798247); it was found to affect susceptibility to TB in African American population (37). Some disease susceptible gene variations are also found in diverse human populations, both independently and in association with variants of same or other genes. TNF alpha (TNF gene) variants (-238G/A, -308G->A, -836 A/C) were associated with various forms of TB in diverse populations either independently (38)(39)(40)(41)(42)(43)(44) or in combination with variants in TLR4 gene [rs7791836 (TNF)-rs1399431 (TLR4)] (45). Similarly, variations in other genes, such as CCL2 (46)(47)(48)(49)(50) and PSMB48 (51) were jointly associated with susceptibility to TB. It was also observed that variations in NOS2A, TLR4 and IFNGR1 were susceptible in different populations, both individually and in association with variants of same as well as different genes. But in African populations they showed strong gene-gene interactions leading towards various forms of TB.

Analysis of population frequencies of variations in HGV&TB
Independent evaluation of the variations in the HGV&TB database revealed that maximum number of variants have been discovered in African population (N ¼ 90) followed by Chinese (N ¼ 74) and Indian (N ¼ 69) (Supplementary Table S6). Most of the reports of genes and genetic variants were confined to one population or ethnic group, barring a handful of genes, such as IFNG, IFNGR1, SLC11A1 and VDR which have been shown to be associated in a number of populations, suggesting a robust association.
In addition, allele frequencies of variations in the database were independently analysed in the world populations, using data from the HapMap project (Supplementary Figure  S3; Supplementary Table S7). Additional analysis was performed on the basis of integrated haplotype score (iHS), a statistical parameter to detect evidence for positive selection. iHS data corresponding to different populations and chromosome for Hapmap phase 2 was downloaded and parsed for the entire variant dataset with respect to rsIDs. A total of 20 variants showed evidence of selection (iHS < -2) in populations (CEU, YRI) (Supplementary Table S8).

Database usage and navigation
The interface of the database has been built as a user friendly GUI wherein the homepage provides a brief summary and a search box for different query options. Each query directs the users to a table reporting all the genes and associated variations with their corresponding P-value, odds ratio, geographic location, pathogenicity and reference of the study. HGVID (HGV&TB identifiers) and rsID on this page are linked to a detailed report of the respective query in context to gene, variant, study details and external links. The 'Gene' panel reports the name of the gene, haplotype reported (if any) and genomic location of the gene and the variant. The 'Variant' panel provides a description about the type of the variant, reported phenotype, P-value and odds ratio, the reported and concluded pathogenicity and HGVS values corresponding to the variant. The 'Details' panel reports the details of the study which quotes the respective TB susceptible variation in context to the detection template, detection technique and the origin, ethnicity and geographic locations of the population under study. External links to dbSNP, PUBMED, UCSC (52) and Gene Card (53) have also been provided.

Discussion and future perspective
The advent of newer technologies for analysing genetic variations, including whole-genome sequencing methods, would enable researchers to query genomic signatures and to fine-map functional variations in genes previously shown to be associated with disease susceptibility. This would also unearth more genetic loci which confer susceptibility to various forms of TB. Systematic curation of such variations and their association from literature and sources of evidence needs to be done on a continuous basis. The involvement of the community would help achieve this goal much faster and more accurately. HGV&TB provides a starting point towards involving a larger community of researchers in the field who would on one end contribute their time and expertise curating variants, and on the other end see how these genetic variations could potentially be used in diverse clinical applications. In addition to the data being up-to-date, it is also important to ensure that the data are inter-operable with platforms and systems for analysis. The raw data have been provided on the server which can be used to perform meta-analysis to identify patterns and interesting relationships in context of multiple studies. To this end we foresee co-operation and  -cytokine receptor interaction   26  CCL1, IL1R1, TNF, CCL2, IL18, CCL5, CXCL12, IL10, IL12RB2, TNFRSF1A, TNFRSF1B, IL12RB1,  IL10RA, IFNG, IL1B, IFNGR2, IFNGR1, LTA, IL1A, IL4, IL6, IL23R, IL8 Toll -like receptor signalling pathway   16  IL6, TNF, IL8, TOLLIP, TLR1, TIRAP, TLR2, TLR4, TLR6, CCL5, TLR8, TLR9, AKT1, IL1B,  IL12B, CD14 1.51E-12 1.10E-13 Jak -STAT signalling pathway   15  IL4, IL6, IL23R, IL6R, CISH, IL10, AKT1, IL12RB2, IL12RB1, IL10RA, IFNG, IL12B, IFNGR2,  IFNGR1,   collaboration with systems which facilitate exchange of data between resources and tools like Cafe Variome, where data are shared within different laboratories having common interest or with the wider world (http://www.cafevariome.org/). We also foresee the potential integration of the data in resources which could enable automated analysis of whole-genome sequencing data, including data from personal genomes.

Supplementary Data
Supplementary data are available at Database Online.