RADB: a database of rheumatoid arthritis-related polymorphisms

Rheumatoid arthritis (RA) is an autoimmune disease that has a complex genetic basis. Therefore, it is important to explore the genetic background of RA. The extensive recent application of polymorphic genetic markers, especially single nucleotide polymorphisms, has presented us with a large quantity of genetic data. In this study, we developed the Database of Rheumatoid Arthritis-related Polymorphisms (RADB), to integrate all the RA-related genetic polymorphisms and provide a useful resource for researchers. We manually extracted the RA-related polymorphisms from 686 published reports, including RA susceptibility loci, polymorphisms associated with particular clinical features of RA, polymorphisms associated with drug response in RA and polymorphisms associated with a higher risk of cardiovascular disease in RA. Currently, RADB V1.0 contains 3235 polymorphisms that are associated with 636 genes and refer to 68 countries. The detailed information extracted from the literature includes basic information about the articles (e.g. PubMed ID, title and abstract), population information (e.g. country, geographic area and sample size) and polymorphism information (e.g. polymorphism name, gene, genotype, odds ratio and 95% confidence interval, P-value and risk allele). Meanwhile, useful annotations, such as hyperlinks to dbSNP, GenBank, UCSC, Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway, are included. In addition, a tool for meta-analysis was developed to summarize the results of multiple studies. The database is freely available at http://www.bioapp.org/RADB. Database URL: http://www.bioapp.org/RADB.


Introduction
Rheumatoid arthritis (RA) is a systemic inflammatory autoimmune disorder affected by genetic and environmental factors (1). The genetic component of RA has been estimated to be between 50 and 60% (2). Unlike single-gene disorders, RA is believed to be associated with multiple genes and their interactions (3). The strongest association has been shown to be with the HLA-DRB1 region (6p21), explaining $30% of the total genetic effect (4). In addition to the HLA region, non-HLA genes (e.g. PTPN22, PADI4) have also been reported to contribute to RA susceptibility (5,6). Currently, many loci that have convincing evidence for association with RA have been identified. However, the results are often poorly replicated, especially in different populations, increasing the complexity of the research. Collecting and collating the information about RA risk loci will facilitate systematic exploration of the genetic mechanisms of RA. Currently, there are several genetic association databases [e.g. Online Mendelian Inheritance in Man (OMIM) (7) and the Genetic Association Database (GAD) (8)] to store disease susceptibility loci. OMIM focuses on high-quality data of high significance for Mendelian disorders. Although in recent years, non-Mendelian diseases (also known as 'common' or 'complex' diseases) have been included, some biases still exist because of its history. In addition, OMIM is largely based on text and is a narrative history of disease research; thus, it is not designed to compare or analyze large sets of genetic data. More importantly, association studies of non-Mendelian diseases often have low-significance values, and findings of lower significance or negative findings are not routinely included in OMIM. Although GAD overcomes some disadvantages of OMIM, it is not a specialized database for RA, and polymorphisms associated with RA are not collected comprehensively. In addition, polymorphism genotype data are not collected in GAD, making some studies (e.g. metaanalysis) difficult. Therefore, a comprehensive, exhaustive and specialized database that includes all available genetic association study data from the published literature is urgently needed.
In addition to RA susceptibility, its clinical features [e.g. rheumatoid factor (RF) status, age of onset], drug response and cardiovascular (CV) events are also significantly influenced by genetic variation. Integrated management of these genetic variations and their relevant experimental information is also necessary, but so far, there is no database in which to store them.
Here, we present the Database of Rheumatoid Arthritisrelated Polymorphisms (RADB) to integrate and analyze RA-related genetic polymorphisms extracted from published papers. The information collected comprises susceptibility loci for RA, polymorphisms associated with the clinical features of RA, polymorphisms associated with drug response in RA and polymorphisms associated with a higher risk of CV disease in RA. We not only collected polymorphisms that are significantly associated with RA, but also collected polymorphisms of lower significance and non-associated polymorphisms from RA-related research. To facilitate the users' ability to summarize the results of multiple studies, a linked tool for meta-analysis was developed. In addition, useful annotations, such as those from dbSNP (9), the National Centre for Biotechnology Information (NCBI) GenBank (10), University of California Santa Cruz (UCSC) (11) and Gene Ontology (GO) (12), were integrated into RADB to complement and extend the information from the literature. .We extracted the important information from these reports, including basic information about the article [e.g. PubMed ID (PMID), title and abstract], population information (e.g. country, geographic area and sample size) and polymorphism information [e.g. polymorphism name, gene, genotype, odds ratio (OR) with 95% confidence interval (CI), P-value and risk allele].

Data collection and database content
Different laboratories may have different standards to describe the same polymorphism or gene; it is essential to standardize them. Polymorphisms may have multiple names: for example, rs2476601, PTPN22 1858C/T and PTPN22 R620W represent the same polymorphism. To standardize the name, we merged the synonyms for each polymorphism. For genes, we used the approved gene name/symbol and Entrez Gene ID.

Data categories
Using our criteria, we identified 3235 polymorphisms from 636 genes. The polymorphisms were divided into four classes: (i) susceptibility loci for RA; (ii) polymorphisms associated with particular clinical features of RA; (iii) polymorphisms associated with drug response in RA; and (iv) polymorphisms associated with a higher risk of CV disease in RA. Although these four classes are not independent-for example, PTPN22 rs2476601 exists in all four classes-we believe that such classification will enable users to interrogate our database quickly and in more depth. The primary relationships between the classes are shown in Table 1.
Although an association has been reported between PADI4 rs2240340 and RA in East Asian populations, it was not replicated in those of European ancestry (36,37). It was important, therefore, for our database to contain population information. The genes and genetic regions that have the strongest association with RA susceptibility are shown in Supplementary File S1 on the Web site: http://www. bioapp.org/research/RA.
(iv) Polymorphisms associated with a higher risk of CV disease in RA RA is associated with an increased risk of CV events, causing increased CV morbidity and mortality (61). Currently, RADB contains 48 reports that examined the relationships between 83 polymorphisms (37 genes/regions) and a higher risk of CV in RA. Among these, 18 polymorphisms (17 genes/regions) have P values < 0.05, and 2 polymorphisms (2 genes/regions) have P values < 1 Â 10 À3 , namely, LCE3C_LCE3B-del and CCR5 d32 (62)(63)(64). Although the number of polymorphisms associated with CV events is still relatively small, we expect the amount of data to expand on further research.

Meta-analysis module
The results of different association studies often show inconsistencies. A comprehensive evaluation of these results is important. Thus, we developed a module to perform a direct meta-analysis on the polymorphisms in RADB. Users can choose the parameters, such as the type of study (e.g. case-control study), the assumed risk allele and the genetic model. In addition, users can either analyze just their own data or supplement it with RADB data. In our meta-analysis module, the OR and 95% CI are calculated to assess the strength of association. Statistical heterogeneity among the studies is assessed with Woolf's test (65). A fixed-effects model using the Mantel-Haenszel method (66) and the random effects model of DerSimonian and Laird (67) are used to summarize the results. The summary results are presented in tabular form and forest plots. We also provide a funnel plot to detect publication biases. The full paper hyperlinks of the included research are offered to facilitate the inquiries of users that want more detailed information of samples.

Querying the database
To meet the needs of different users, we offer different ways to search our database, including searching by polymorphism, searching by gene, searching by population, searching by different types of research (including candidate gene linkage analysis studies, candidate gene association studies and GWAS) and searching by chromosome.
Searching RADB by polymorphism name is a basic function. There are several types of polymorphism, such as single nucleotide polymorphisms, HLA alleles and microsatellites. Users can use the dbSNP 'rs' number, gene symbol plus mutation position or gene symbol plus type of mutation to query RADB: for instance 'rs2476601', 'PTPN22 1858C/T', 'HLA-DRB1*0401' or IL1RN 86 bp VNTR' (Figure 1a). As mentioned above, the data are divided into four classes. Users can optionally choose a category of interest at this step. To facilitate ease of use, an auto-complete function has been used. The query results are reference centered (i.e. each record is a reference) and are displayed by publication date on a new page (Figure 1d). For instance, if a polymorphism has been described in 10 references there will be 10 records.
The query results include basic information about the articles (e.g. PMID, title, source and important results/ conclusions), population information (e.g. geographic area, population, population details and sample description) and polymorphism information (e.g. polymorphism name, gene symbol, Entrez Gene ID, genotype, OR and 95% CI, P-value and risk allele). If an article also examined other polymorphisms, a button will appear at the bottom of each record; users can click this button to display the other polymorphisms studied in the same paper.
Users can query the database using a keyword gene name (Fig. 1b) or list all the genes in RADB. Both Entrez Gene ID and Gene Symbol are currently supported, (e.g. 26191, PTPN22). The results are displayed on a new page (Fig. 1c). The results include gene-related information (e.g. number of references, number of polymorphisms and polymorphism list) and hyperlinked gene annotations (e.g. gene name, location, Entrez Gene, EMBL-EBI, UCSC, GenBank, RefSeq, Unigene, Uniprot, Pfam, Prosite, GO and KEGG pathway).
In addition to querying RADB by polymorphism name and gene name, users can search RADB by population, type of research and chromosome. If the user queries RADB by population, the results will list all studies undertaken within the same population and their corresponding polymorphisms. If the user searches RADB by type of research, the results will list all reports of the same study type and their corresponding polymorphisms. If the user searches RADB by chromosome (such as '6', 'X' or 'mitochondrion'), the results will list all the genes and their corresponding polymorphisms located in the queried chromosome or chromosomal region.

Submitting new data
To continually improve our database, we welcome the ongoing submission of new data. The submission process is simple. Users are only required to submit the article's PMID and the corresponding polymorphism names. We will verify and input the data, if they meet our requirements, as soon as possible by manually filtering and sending data.

Discussion and conclusion
Over the past 4 years, we have extracted a large number of polymorphisms associated with RA from the published literature. These polymorphisms were collected and collated manually to obtain detailed and reliable data. The polymorphisms are associated with different phenotypes in different studies. For example, the purpose of some studies is to determine whether a certain polymorphism is an RA susceptibility locus; thus, we need to examine the association between the polymorphism and the presence of RA. In these cases, the samples are patients with RA and healthy controls. However, the purpose of other studies is to determine whether a certain polymorphism is associated with a particular clinical feature of RA, such as positivity for RF. The samples presented here would be RFþ and RF-patients. Our four data classifications make it convenient for researchers to access and query RADB for a specific purpose.
To obtain all the studies from a certain population, and to compare data for the same polymorphism in different populations, we collected population information that includes detailed geographical information, and we provide a corresponding method of query. Currently, RADB contains data from 68 countries (see Supplementary File S2 on the  (23 studies). Patients with RA are found worldwide, and the prevalence has been estimated at $1% (2). Interestingly, the prevalence is higher (>2%) in some Native American populations, and is lower (<0.3%) in East Asian, Southeast Asian and African populations (68). More research on different populations will be beneficial to the understanding of the different genetic mechanisms involved in RA.
Compared with analysis at the single-gene level, GO term enrichment analysis may provide further insight into the biological function of RA-related genes at the system level. GO term enrichment analysis for RA-related genes can be performed using Fisher's exact test as implemented in the topGO package (69). A total of 477 genes (at least one polymorphism with a P-value < 0.05) have been associated with RA, and 364 of them have been successfully assigned GO terms. Table 2 lists the top 40 most significant GO terms (for more details, see Supplementary File S3 on the Web site: http://www.bioapp.org/research/RA), which include 'inflammatory response', 'antigen processing and presentation' and 'cytokine imbalances'; this is in agreement with a previous study (70). Over the past half century, several hypotheses have been proposed to explain the pathogenesis of RA. The key hypotheses are (i) the immune complex hypothesis and (ii) the T cells and cytokines hypothesis. The immune complex hypothesis states that immune complexes formed by antibodies and anti-antibodies (RFs) activate the complement cascade, which releases chemotactic factors such as C5a, resulting in inflammation and tissue damage (71). The T cells and cytokines hypothesis suggests that an imbalance between T helper 1 and T helper 2 cells and changes in cytokine expression (e.g. IL1, TNF-a and IL6) cause the immunopathological damage observed in RA (72). There is a close relationship between the enriched GO terms and these hypotheses. 'Inflammatory response' (GO:0006955 and GO:0006954) is a prominent characteristic of RA; 'antigen processing and presentation' (GO:0019882) is the initial step in the immune response (73). Moreover, 'cytokine imbalances' (GO:0001817) have been shown to be associated with many immunological processes, including promoting autoimmunity, chronic inflammation and tissue damage. Our results of GO term enrichment analysis show that the pathogenesis of RA is very complex. We suggest that more attention should be given to the enriched GO terms and the genes annotated to these categories. RADB is a genetic database that has been developed for basic research and clinical application for RA. RADB has several advantages over OMIM and GAD. First, more detailed phenotypic data are provided in RADB. Second, RADB contains the genotype data of RA-related polymorphisms, which are not given in the other genetic databases. Third, meta-analysis can be directly performed in RADB. Last but not least, RADB offers an easy user interface and the data can be easily compared.
In the future, we intend to add proteomic and epigenetic information to RADB, to reflect the growing importance of mRNA expression, DNA methylation and microRNAs in the pathogenesis of RA (74)(75)(76). Because of its ability to integrate and analyze the data from different sources, we believe that RADB will be helpful in studying and identifying the genetic and molecular basis of RA.

Supplementary data
Supplementary data are available at Database Online.