-
PDF
- Split View
-
Views
-
Cite
Cite
Julia Stoyanovich, Itsik Pe'er, MutaGeneSys: estimating individual disease susceptibility based on genome-wide SNP array data, Bioinformatics, Volume 24, Issue 3, February 2008, Pages 440–442, https://doi.org/10.1093/bioinformatics/btm587
- Share Icon Share
Abstract
Summary: We present MutaGeneSys: a system that uses genome-wide genotype data to estimate disease susceptibility. Our system integrates three data sources: the International HapMap project, whole-genome marker correlation data and the Online Mendelian Inheritance in Man (OMIM) database. It accepts SNP data of individuals as query input and delivers disease susceptibility hypotheses even if the original set of typed SNPs is incomplete. Our system is scalable and flexible: it produces population, technology and confidence-specific predictions in interactive time.
Availability: Our system is available as an online resource at http://magnet.c2b2.columbia.edu/mutagenesys/. Our findings have also been incorporated into the HapMap Genome Browser as the OMIM_Disease_Associations track.
Contact: [email protected]
1 INTRODUCTION
Availability of genetic information continues to revolutionize the way we perceive medicine, with an ever stronger trend towards personalized diagnostics and treatment of heritable conditions. One challenge towards this goal is per-patient evaluation of susceptibility to disease and potential to gain from treatment based on single nucleotide polymorphisms (SNPs) — DNA sequence variations occurring when a single nucleotide in the genome differs between individuals. Significant attention of the research community is devoted to determining direct causal association between the genotype and the phenotype, and many interesting associations have already been reported. The Online Mendelian Inheritance in Man (OMIM) (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) database is currently the most complete source of such associations. A text search of OMIM yields, for example, a correlation between a C/T SNP (rs908832) in exon 14 of the ABCA2 gene and Alzheimer disease and a connection between a SNP in the IFIH1 gene, rs1990760 and type-I diabetes.
Much individual genetic data is now being collected in the context of association studies, typing thousands of individuals for millions of variants. Yet, fully exploiting genetic information for disease prediction is difficult for two reasons. First and foremost, genetic information remains expensive to collect, and it is currently economically prohibitive to make a complete set of an individual's; genotypes of SNPs available for analysis. Nature Genetics' ‘Question of the Year’ (www.nature.com/ng/qoty) announced the sequencing of the entire human genome for $1000 as a goal for the genetics community. Cost-effective methods (e.g. SNP arrays) currently exist for collecting genetic data from 1% to 5% of all 11 million human SNPs. This calls for the development of techniques that can effectively utilize partial genetic information for disease prediction. The second reason is that OMIM is only accessible in text form, and the limited cross-reference between OMIM and other NCBI databases makes it difficult to use this data for automated genome-wide diagnostics.
Several studies, culminating with the International HapMap project (International HapMap consortium), report on a significant amount of correlation among markers in the genome (Pe'e;r et al., 2006). This genomic redundancy enables one to experimentally type an incomplete set of SNPs, and expand this set by including correlated proxies. Indirect association between proxy genotype and phenotype thus facilitates effective and efficient association analysis.
In the simplest case, SNPs are correlated pairwise, and one of them may be predicted by the other; such correlations are referred to as single-marker predictors. Many two-marker and three-marker predictors are also known. Correlations between causal SNPs and their proxies are associated with a coefficient of determination (squared Pearson's; correlation coefficient) r2ε[0,1]. Marker correlation is population-dependent (de Bakker et al., 2006); the typed SNPs, and hence, the allowed predictors also depend on the genotyping technology. For example, according to our marker correlation dataset (www.cs.columbia.edu/~itsik/StandardGenotyping.htm), we can best predict the minor allele T of rs1205 on chromosome 1 based on the the Affymetrix GeneChip 500K genotyping technology, and the prediction accuracy is r2 = 0.733 (in the Japanese and Chinese population). OMIM links the predicted SNP rs1205 with Systemic Lupus Erythematosus (SLE) and antinuclear antibody production.
Genome-wide correlation can be used to augment an individual's; genetic information, greatly enhancing its diagnostic value. In the example above, if rs1205 was not typed, but rs12076827 and rs1572970 were, probabilities for the presence of the SLE-associated variant may be estimated. Our project is the first step in this direction. Our goal is to stream-line the process of correlating SNP information with heritable disorders, and to enable real-time retrieval of disease susceptibility hypotheses on genome-wide scale. Our system is scalable and flexible: it produces population, technology and confidence-specific hypotheses in real time. MutaGeneSys (Mutation Genome System) is available to the scientific community as an online resource (magnet.c2b2.columbia.edu/mutagenesys/). Our findings have also been incorporated into HapMap GBrowse (International HapMap consortium).
2 METHODS
2.1 Data processing and integration
We integrate three datasets: the International HapMap project (International HapMap consortium), Online Mendelian Inheritance in Man (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM), and a dataset of marker correlations (www.cs.columbia.edu/~itsik/StandardGenotyping.htm). We use HapMap — a comprehensive repository of SNP genotypes, to compile population-specific lists of SNP alleles and frequencies. Our prediction dataset consists of single- and two-marker correlations; consequently, these are the types of correlation that our system supports.
Both HapMap and marker correlation datasets are clean, non-redundant, and available in machine-processable form. The challenge with these datasets is the sheer volume of data. However, while there is a lot of information regarding correlations along the genome (the marker correlation dataset is large), relatively little is still known about correlations between SNPs and heritable disorders. We observe that our system can take advantage of marker correlations only if they ultimately lead to a hypothesis of disease susceptibility, and use available marker-to-disorder data as the limiting factor. A correlation between SNP1 and SNP2 is only useful if at least one of these SNPs is associated with a heritable disorder.
We currently use OMIM, a repository of publications about human genes and genetic disorders, as our data source for marker to disorder associations. Associations between SNPs and diseases are not readily available and we resort to parsing this information from the text. We process OMIM record by record, looking for occurrences of rs numbers (cross-references from OMIM to dbSNP). We then assume that the mentioned SNP is associated with the heritable trait to which the current OMIM record pertains.
2.2 Using MutaGeneSys
The complete version of our system is available as an online resource (magnet.c2b2.columbia.edu/mutagenesys/). The web interface accepts individual-specific genotype queries, and estimates potential disease susceptibility by looking for population-specific disease associations that also meet the specified correlation coefficient cutoff. The online version of the system reasons with both single- and two-marker correlations. Query results present the causal and the proxy SNPs, the correlation coefficient, and provide a link to the relevant OMIM record and to the relevant portion of the genome in HapMap GBrowse. The system generates HTML and XML output.
Our findings have also been incorporated directly into the HapMap genome browser and into James Watson's; personal genome sequence browser. These sites only use single-maker correlations, and display links to potentially relevant OMIM records irrespective of the population and of the coefficient of determination.
3 RESULTS
Our database contains a significant amount of SNP and marker correlation data, but only a limited number of SNP to OMIM associations. Across all populations and platforms we store over 10 million SNPs, close to 50 million single correlations, and over 20 million two-marker correlations. OMIM contains about 18 000 scientific articles, many of which are not relevant to association studies. However, we still identify 187 articles that mention associations between heritable disorders and SNPs, with 133 unique participating SNPs. Combining OMIM with marker correlation data, we are able to estimate disease susceptibility for additional 328 unique single-marker pairs, and 396 double-marker sets. The dataset is enriched with a total of 1312 population-specific correlations. The number of susceptibility hypotheses will grow as more information about direct associations between SNPs and heritable disorders becomes available.
For an example of the effectiveness of MutaGeneSys, consider age-related macular degeneration (ARMD). According to OMIM, two SNPs are implicated in this disorder: rs3793784 in the ERCC6 gene and rs380390 in the CFH gene. MutaGeneSys associates 72 additional SNPs with ARMD. As another example, Systemic Lupus Erythematosus (SLE) is associated with two CRP polymorphisms in OMIM; MutaGeneSys uses 15 additional SNPs to indicate potential susceptibility to SLE.
4 DISCUSSION
MutaGeneSys cannot yet be considered as a source of diagnostic predictions, because of a number of uncertainties involved in going from a specific marker to disease. Given the available data, we made our best effort to control for population and correlation-specific effects: marker associations are computed separately for different populations, and susceptibility results include correlation parameters. For lack of information, we make two assumptions about OMIM markers: that they are causal, and that they correspond to the minor allele (we incorporate allele frequencies from HapMap to determine which allele to consider as minor). Another source of uncertainty is that specific markers may have different levels of penetrance, and therefore have different value as diagnostic predictors. Because of these factors we are currently only able to estimate, not diagnose, individual disease susceptibility.
While OMIM is the state-of-the-art collection of connections between typically rare variants and phenotypes, data is now being accumulated in the literature about associations between common alleles and disease (Altshuler and Daly, 2007). In the future, we plan to use such data in addition to OMIM to enhance the coverage of MutaGeneSys.
Our effort only partially overlaps with systems such as the Genetic Association Database (GAD) (geneticassociation.nih.gov) and the database of Genotype and Phenotype (dbGaP) (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gap). GAD correlates different kinds of genetic data across a variety of sources, including also OMIM. To the best of our knowledge, GAD has no facilities to use individual genotype data as part of a query. We see a potential for including GAD and similar databases as sources of association data, either by integrating such data into our framework or by having MutaGeneSys integrated into theirs. dbGaP focuses on archiving and distribution of results of studies that investigate the interaction of genotype and phenotype. The system provides customizable browsing and navigation facilities, but does not implement automatic integration of genome-wide marker correlation data with study results, and could be enhanced by integrating a system such as ours.
5 CONCLUSION
We presented MutaGeneSys: a scalable system that uses genome-wide genotype data for disease prediction. MutaGeneSys can be used to estimate potential susceptibility to OMIM disorders among participants of whole-genome association studies, a yet unexplored perspective of such data. We believe this system and its successors will pave the way for using whole genome SNP arrays as practical diagnostic tools, advancing them from bench to bedside.
In the future we plan to incorporate three-marker correlations into our dataset. We intend to incorporate new high-quality resources of genotype-to-phenotype associations which may be developed in the future. Finally, we plan to release a progammatic API to Muta-GeneSys, to enhance the interoperability of our tool with other systems.
ACKNOWLEDGEMENTS
This material is based in part upon work supported by the National Institute of Health grant 5 U54 CA121852-03 and by the National Science Foundation grant IIS-0121239.
Conflict of Interest: none declared.
REFERENCES
Author notes
Associate Editor: Jonathan Wren